Skip to main content
Group-Chat Agent Evaluation

Bibliography & Sources

Every cited benchmark, dataset, framework, and construct

Every named resource cited across the two reports, consolidated. The eval-mechanics sources are benchmarks, datasets, and frameworks (from the Eval Survey); the conversation-science sources are the foundational theory (from What to Eval For).

Verification column: independently confirmed real / correctly attributed during the research run · flagged inaccurate or self-corrected (see the report's inline note) · not verifiable. Raw verified data: eval-survey · dynamics.

Eval-mechanics sources (39)

Benchmarks, datasets, and frameworks for measuring group-chat agents.

Resource Kind Year What it is
Addressee and Response Selection for Multi-Party Conversation (Ouchi & Tsuboi) dataset 2016 Foundational EMNLP 2016 paper that formalized the joint task of selecting BOTH whom an agent addresses and what it says in a multi-party conversation. Released a large…
Addressee and Response Selection for Multi-Party Conversation (Ubuntu IRC benchmark) benchmark 2016 The foundational EMNLP 2016 paper (Ouchi & Tsuboi) that defines the joint Addressee-and-Response-Selection (ARS) task on the Ubuntu Multiparty Conversation Corpus: given…
Addressee and Response Selection (Ubuntu IRC / Hu et al.) dataset 2018 Canonical pre-LLM multi-party task and dataset built from Ubuntu IRC chat logs, where speakers play sender/addressee/observer roles. The system must pick both the correct…
A Large-Scale Corpus for Conversation Disentanglement (Kummerfeld et al.) / irc-disentanglement dataset 2019 ACL 2019 release of 77,563 #Ubuntu/#Linux IRC messages manually annotated with reply-to (parent-child) links forming reply-structure graphs that both disentangle interleaved…
Who Is Speaking to Whom? W2W model (Le, Hu et al.) paper 2019 EMNLP-IJCNLP 2019 paper introducing the who-to-whom (W2W) model that identifies the addressee of EVERY utterance in a session jointly, not just the next response. Uses…
Molweni dataset 2020 A multi-party dialogue machine-reading-comprehension dataset sampled from the Ubuntu Chat Corpus: 9,754 dialogues, 86,042 utterances, 30,066 question-answer pairs, annotated…
Molweni dataset 2020 COLING 2020 machine-reading-comprehension dataset over multiparty dialogue, sampled from the Ubuntu Chat Corpus: 10,000 dialogs / 88,303 utterances, 30,066 questions (incl.…
TurnGPT paper 2020 A GPT-2-based language model (Ekstedt & Skantze) that predicts turn-shifts by adding Transition Relevance Place (TRP) tokens to the vocabulary, projecting turn completion from…
WHO Says WHAT to WHOM: A Survey of Multi-Party Conversations paper 2022 IJCAI 2022 survey framing multi-party conversation research around the three coupled questions of WHO (speaker), WHAT (utterance), and to WHOM (addressee), surveying tasks,…
NormBank (SCENE taxonomy) dataset 2023 A knowledge bank of 155k situational social norms (Ziems et al., ACL 2023) where each norm is grounded in a multivalent sociocultural frame — setting, agent roles, attributes,…
SOTOPIA / SOTOPIA-Eval benchmark 2023 An open-ended environment that simulates goal-driven social interactions between LLM agents who role-play diverse character profiles with private goals and relationship…
Large Language Models Know What To Say But Not When To Speak (TRP benchmark) paper 2024 An EMNLP 2024 Findings paper (Umair, Sarathy, de Ruiter) that introduces a dataset of participant-labeled within-turn Transition Relevance Places (TRPs) in unscripted spoken…
Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction paper 2024 Combines an LLM (Llama, semantic/syntactic context) with a Voice Activity Projection model (acoustic cues) via an LSTM ensemble to predict turn-taking opportunities, fusing…
MUCA + MUS (Multi-User Chat Assistant / Multi-User Simulator) framework 2024 Described as the first LLM framework dedicated to multi-user group conversations, organized around the 3W design (What to say, When to respond, Who to answer) via a Sub-topic…
MUCA: Multi-User Chat Assistant framework framework 2024 An LLM framework for facilitating group text conversations whose Utterance Strategies Arbitrator explicitly decides the What/When/Who of a bot utterance, using an 'in-context…
PersonaGym / PersonaScore benchmark 2024 The first dynamic evaluation framework for persona agents (200 personas, 10k questions, 150 environments), with PersonaScore as an automated, human-aligned metric grounded in…
RENOVI: Remediating Norm Violations in Socio-Cultural Conversations benchmark 2024 A large-scale corpus of 9,258 multi-turn dialogues (512 human-authored + 8,746 ChatGPT-synthesized) annotated with social norms, designed to evaluate detecting and remediating…
Addressee Recognition in Multi-modal Multi-party Dialogue (LLM benchmark) benchmark 2025 A benchmark built on a multi-modal corpus of triadic (3-participant) discussions that tests whether an LLM can identify the addressee — who is being spoken to / who should take…
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue benchmark 2025 A benchmark built on a multimodal triadic (3-party) dialogue corpus with addressee annotations (explicit addressees occur in ~20% of turns), testing whether LLMs can identify…
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue (Inoue et al., TEIDAN) benchmark 2025 Kyoto University benchmark testing whether modern LLMs (GPT-4o) can do addressee recognition and next-speaker prediction in spontaneous triadic (3-person) dialogue, using the…
Beyond Words: Multimodal LLM Knows When to Speak paper 2025 Builds a dataset annotated for turn-taking labels, backchannel signals (e.g. 'mm-hmm'), and speech timing from Fisher, MAHNOB-HCI, and Harper Valley Bank corpora, training a…
DEBATE: Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates benchmark 2025 A large-scale benchmark (30,707 messages, 2,832 U.S. participants, 708 groups, 107 topics) that measures whether multi-agent role-playing LLMs reproduce authentic human group…
DICE-Bench benchmark 2025 Dialogue-based Interactive Calling Evaluation Benchmark: the first multi-round, multi-party benchmark for function/tool-calling grounded in realistic group-chat data. 1,607…
HSII (How Social Is It?) benchmark 2025 A benchmark explicitly built to assess LLMs as autonomous social agents in multi-user, multi-turn settings (avg 6.72 participants, 7.8 turns per scenario), as opposed to…
MAGPIE benchmark 2025 Multi-AGent contextual PrIvacy Evaluation: ~200 high-stakes tasks (earlier version: 158 scenarios across 15 domains) evaluating privacy preservation in multi-agent,…
Multi-Party Conversational Agents: A Survey paper 2025 A survey organizing multi-party conversational-agent research, with explicit sub-sections on Turn Detection (when to speak) and Addressee Selection (whom to address),…
Multi-Party Conversational Agents: A Survey paper 2025 A 2025 survey of multi-party conversational agents that organizes the field around the sub-capabilities required for group settings, including who-speaks-next / turn-taking,…
MultiAgentBench benchmark 2025 A benchmark evaluating LLM-based multi-agent systems across interactive scenarios with both cooperative (mutual-goal) and competitive (conflicting-goal) settings, supporting…
MultiAgentBench (MARBLE) benchmark 2025 A benchmark suite (Zhu et al.) for LLM multi-agent systems across cooperative (research collab, Minecraft build, DB diagnosis, coding) and competitive (bargaining, Werewolf…
Multimodal Conversation Structure Understanding (MCSU) benchmark 2025 A 2025 benchmark for evaluating (multimodal) LLMs on the structural fabric of multi-party conversation — including speaker/addressee and reply-to relations — beyond surface…
ProMediate framework 2025 A socio-cognitive framework (USC + Microsoft) for evaluating proactive AI mediator agents in multi-topic, multi-party negotiations. Includes a simulation testbed with…
SAGE framework 2025 A top-down/bottom-up knowledge-grounded user simulator for multi-turn agent evaluation (Columbia DAPLab, Findings of EACL 2026). Grounds simulated users in business logic…
tau2-bench (τ²-bench) benchmark 2025 Sierra's benchmark for tool-agent-user interaction. τ²-bench extends τ-bench to a dual-control setting (Telecom domain) where BOTH the simulated user and the agent can call…
The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation framework 2025 A psychometric evaluation framework (Zarreen Reza) that assesses LLMs as social actors inside multi-agent debates rather than in isolation, using a 3-round multi-party debate…
Triadic Multi-party Voice Activity Projection (VAP) for Turn-taking paper 2025 First extension of Voice Activity Projection to triadic (3-party) spoken conversation, predicting each speaker's future voice activity from acoustics to determine who takes the…
GroupMemBench benchmark 2026 A benchmark for LLM agent MEMORY specifically in multi-party conversations, motivated by the fact that nearly all memory benchmarks assume a dyadic single-user setup while real…
Mind the Sim2Real Gap in User Simulation for Agentic Tasks paper 2026 A CMU LTI study quantifying how faithfully LLM user simulators replicate real human behavior in agent interactions, and how that mismatch distorts benchmark scores.
MPCEval: A Benchmark for Multi-Party Conversation Generation benchmark 2026 A standardized, task-aware framework for evaluating multi-party conversation generation, covering both next-message prediction and full-conversation generation across varied…
RealUserSim framework 2026 A user-simulation framework grounded in real behavioral data: extracts 7,275 executable behavioral profiles from 14,000+ authentic human-LLM conversations (WildChat) and uses…

Conversation-science sources (25)

Foundational constructs from conversation analysis, the social psychology of groups, and sociolinguistics/pragmatics that ground the eval criteria.

Construct Seminal source Discipline
Adjacency pairs & conditional relevance Schegloff & Sacks 1973, 'Opening up closings', Semiotica 8:289–327; elaborated in Schegloff 2007, 'Sequence Organization in Interaction' Conversation Analysis & Interactional Linguistics (organization of multi-party talk)
Audience Design & Style-Shifting Bell 1984, "Language Style as Audience Design" (Language in Society 13:145-204) Sociolinguistics & Pragmatics (multi-party conversation)
Audience effects / social facilitation & evaluation apprehension Zajonc 1965, "Social Facilitation" (Science) — mere-presence drive theory; Cottrell 1972 evaluation-apprehension refinement Social psychology of group conversation and small-group dynamics
Common ground & grounding (least collaborative effort, grounding criterion) Clark & Brennan 1991, "Grounding in Communication" (in Resnick et al., eds., Perspectives on Socially Shared Cognition); building on Clark & Wilkes-Gibbs 1986 Social psychology of group conversation and small-group dynamics
Communication Accommodation Theory (convergence / divergence) Giles 1973; Giles, Coupland & Coupland 1991 (Contexts of Accommodation); Giles & Ogay 2007 review Social psychology of group conversation and small-group dynamics
Contextualization Cues, Code-Switching & Register Gumperz 1982, "Discourse Strategies" (conversational code-switching, contextualization cues) Sociolinguistics & Pragmatics (multi-party conversation)
Cooperative Principle & Conversational Maxims (Implicature) Grice 1975, "Logic and Conversation" (in Syntax and Semantics 3) Sociolinguistics & Pragmatics (multi-party conversation)
Cultural Variation in Face & Politeness (Discernment / Wakimae) Matsumoto 1988 and Ide 1989 (critiques of Brown & Levinson using Japanese); concept of wakimae/discernment politeness Sociolinguistics & Pragmatics (multi-party conversation)
Face & Face-Threatening Acts (FTAs) Brown & Levinson 1987, "Politeness: Some Universals in Language Usage" (building on Goffman); also Brown & Levinson 1978 Sociolinguistics & Pragmatics (multi-party conversation)
Face-work, Deference & Demeanor Goffman 1955, "On Face-Work" (Psychiatry 18:213-231); Goffman 1956/1967, "The Nature of Deference and Demeanor" Sociolinguistics & Pragmatics (multi-party conversation)
Floor management with 3+ parties (selection, schisming, addressee vs. next-speaker) Sacks, Schegloff & Jefferson 1974 (next-speaker selection); Egbert 1997 (schisming); Auer 2018 / Lerner 2003 (gaze, addressing, next-speaker selection in 3-party talk) Conversation Analysis & Interactional Linguistics (organization of multi-party talk)
Floor-control & participation inequality (turn-taking, dominance, silencing) Sacks, Schegloff & Jefferson 1974 (turn-taking systematics); Edelsky 1981 "Who's got the floor?" (singly-developed F1 vs. collaborative F2 floor); conversational-dominance literature Social psychology of group conversation and small-group dynamics
Group polarization Moscovici & Zavalloni 1969, "The group as a polarizer of attitudes" (J. Personality & Social Psychology); related risky-shift work, Stoner 1961 Social psychology of group conversation and small-group dynamics
Indirect Speech Acts Searle 1975, "Indirect Speech Acts" (in Syntax and Semantics 3); Searle 1969, "Speech Acts" Sociolinguistics & Pragmatics (multi-party conversation)
Informational vs. normative social influence (conformity) Deutsch & Gerard 1955, "A Study of Normative and Informational Social Influences upon Individual Judgment" (J. Abnormal & Social Psychology); rooted in Asch 1951/1956 conformity line experiments Social psychology of group conversation and small-group dynamics
Overlap & overlap-resolution Schegloff 2000, 'Overlapping talk and the organization of turn-taking for conversation', Language in Society 29(1):1–63 (extends Sacks/Schegloff/Jefferson 1974) Conversation Analysis & Interactional Linguistics (organization of multi-party talk)
Participation Framework & Footing Goffman 1981, "Footing" (in Forms of Talk) Sociolinguistics & Pragmatics (multi-party conversation)
Participation framework & footing (Goffman) Goffman 1981, 'Footing', in Forms of Talk, University of Pennsylvania Press; extended by C. Goodwin & M. H. Goodwin 2004, 'Participation' Conversation Analysis & Interactional Linguistics (organization of multi-party talk)
Recipient design (audience design) Sacks & Schegloff (recipient design, e.g., Sacks 1992 Lectures; Sacks & Schegloff 1979 on reference); cf. H. H. Clark & Murphy 1982 'audience design'; Giles' accommodation Conversation Analysis & Interactional Linguistics (organization of multi-party talk)
Repair (self/other-initiation, self/other-repair) Schegloff, Jefferson & Sacks 1977, 'The Preference for Self-Correction in the Organization of Repair in Conversation', Language 53:361–382 Conversation Analysis & Interactional Linguistics (organization of multi-party talk)
Sequence organization & expansion (pre-/insert-/post-) Schegloff 2007, 'Sequence Organization in Interaction: A Primer in Conversation Analysis, Vol. 1', Cambridge University Press Conversation Analysis & Interactional Linguistics (organization of multi-party talk)
Social loafing / free-riding (and effort responsibility) Latané, Williams & Harkins 1979, "Many hands make light the work: The causes and consequences of social loafing" (J. Personality & Social Psychology); Karau & Williams 1993 meta-analysis Social psychology of group conversation and small-group dynamics
Status hierarchies & expectation states in talk Berger, Cohen & Zelditch 1972 and Berger, Fisek, Norman & Zelditch 1977 (Expectation States / Status Characteristics Theory); Ridgeway & Berger syntheses Social psychology of group conversation and small-group dynamics
Theory of mind & perspective-taking in groups (egocentric anchoring) Keysar, Barr, Balin & Brauner 2000, "Taking Perspective in Conversation" (Psychological Science) — egocentric anchoring & adjustment; Premack & Woodruff 1978 on theory of mind Social psychology of group conversation and small-group dynamics
Turn-taking systematics (TRPs & turn-allocation) Sacks, Schegloff & Jefferson 1974, 'A Simplest Systematics for the Organization of Turn-Taking for Conversation', Language 50:696–735 Conversation Analysis & Interactional Linguistics (organization of multi-party talk)