Bibliography & Sources

Every named resource cited across the two reports, consolidated. The eval-mechanics sources are benchmarks, datasets, and frameworks (from the Eval Survey); the conversation-science sources are the foundational theory (from What to Eval For).

Verification column: ✓ independently confirmed real / correctly attributed during the research run · ⚠ flagged inaccurate or self-corrected (see the report's inline note) · — not verifiable. Raw verified data: eval-survey · dynamics.

Eval-mechanics sources (39)

Benchmarks, datasets, and frameworks for measuring group-chat agents.

Resource	Kind	Year	✓	What it is
Addressee and Response Selection for Multi-Party Conversation (Ouchi & Tsuboi)	dataset	2016	✓	Foundational EMNLP 2016 paper that formalized the joint task of selecting BOTH whom an agent addresses and what it says in a multi-party conversation. Released a large…
Addressee and Response Selection for Multi-Party Conversation (Ubuntu IRC benchmark)	benchmark	2016	✓	The foundational EMNLP 2016 paper (Ouchi & Tsuboi) that defines the joint Addressee-and-Response-Selection (ARS) task on the Ubuntu Multiparty Conversation Corpus: given…
Addressee and Response Selection (Ubuntu IRC / Hu et al.)	dataset	2018	✓	Canonical pre-LLM multi-party task and dataset built from Ubuntu IRC chat logs, where speakers play sender/addressee/observer roles. The system must pick both the correct…
A Large-Scale Corpus for Conversation Disentanglement (Kummerfeld et al.) / irc-disentanglement	dataset	2019	✓	ACL 2019 release of 77,563 #Ubuntu/#Linux IRC messages manually annotated with reply-to (parent-child) links forming reply-structure graphs that both disentangle interleaved…
Who Is Speaking to Whom? W2W model (Le, Hu et al.)	paper	2019	✓	EMNLP-IJCNLP 2019 paper introducing the who-to-whom (W2W) model that identifies the addressee of EVERY utterance in a session jointly, not just the next response. Uses…
Molweni	dataset	2020	✓	A multi-party dialogue machine-reading-comprehension dataset sampled from the Ubuntu Chat Corpus: 9,754 dialogues, 86,042 utterances, 30,066 question-answer pairs, annotated…
Molweni	dataset	2020	✓	COLING 2020 machine-reading-comprehension dataset over multiparty dialogue, sampled from the Ubuntu Chat Corpus: 10,000 dialogs / 88,303 utterances, 30,066 questions (incl.…
TurnGPT	paper	2020	✓	A GPT-2-based language model (Ekstedt & Skantze) that predicts turn-shifts by adding Transition Relevance Place (TRP) tokens to the vocabulary, projecting turn completion from…
WHO Says WHAT to WHOM: A Survey of Multi-Party Conversations	paper	2022	✓	IJCAI 2022 survey framing multi-party conversation research around the three coupled questions of WHO (speaker), WHAT (utterance), and to WHOM (addressee), surveying tasks,…
NormBank (SCENE taxonomy)	dataset	2023	✓	A knowledge bank of 155k situational social norms (Ziems et al., ACL 2023) where each norm is grounded in a multivalent sociocultural frame — setting, agent roles, attributes,…
SOTOPIA / SOTOPIA-Eval	benchmark	2023	✓	An open-ended environment that simulates goal-driven social interactions between LLM agents who role-play diverse character profiles with private goals and relationship…
Large Language Models Know What To Say But Not When To Speak (TRP benchmark)	paper	2024	✓	An EMNLP 2024 Findings paper (Umair, Sarathy, de Ruiter) that introduces a dataset of participant-labeled within-turn Transition Relevance Places (TRPs) in unscripted spoken…
Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction	paper	2024	✓	Combines an LLM (Llama, semantic/syntactic context) with a Voice Activity Projection model (acoustic cues) via an LSTM ensemble to predict turn-taking opportunities, fusing…
MUCA + MUS (Multi-User Chat Assistant / Multi-User Simulator)	framework	2024	✓	Described as the first LLM framework dedicated to multi-user group conversations, organized around the 3W design (What to say, When to respond, Who to answer) via a Sub-topic…
MUCA: Multi-User Chat Assistant framework	framework	2024	✓	An LLM framework for facilitating group text conversations whose Utterance Strategies Arbitrator explicitly decides the What/When/Who of a bot utterance, using an 'in-context…
PersonaGym / PersonaScore	benchmark	2024	✓	The first dynamic evaluation framework for persona agents (200 personas, 10k questions, 150 environments), with PersonaScore as an automated, human-aligned metric grounded in…
RENOVI: Remediating Norm Violations in Socio-Cultural Conversations	benchmark	2024	✓	A large-scale corpus of 9,258 multi-turn dialogues (512 human-authored + 8,746 ChatGPT-synthesized) annotated with social norms, designed to evaluate detecting and remediating…
Addressee Recognition in Multi-modal Multi-party Dialogue (LLM benchmark)	benchmark	2025	✓	A benchmark built on a multi-modal corpus of triadic (3-participant) discussions that tests whether an LLM can identify the addressee — who is being spoken to / who should take…
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue	benchmark	2025	✓	A benchmark built on a multimodal triadic (3-party) dialogue corpus with addressee annotations (explicit addressees occur in ~20% of turns), testing whether LLMs can identify…
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue (Inoue et al., TEIDAN)	benchmark	2025	✓	Kyoto University benchmark testing whether modern LLMs (GPT-4o) can do addressee recognition and next-speaker prediction in spontaneous triadic (3-person) dialogue, using the…
Beyond Words: Multimodal LLM Knows When to Speak	paper	2025	✓	Builds a dataset annotated for turn-taking labels, backchannel signals (e.g. 'mm-hmm'), and speech timing from Fisher, MAHNOB-HCI, and Harper Valley Bank corpora, training a…
DEBATE: Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates	benchmark	2025	✓	A large-scale benchmark (30,707 messages, 2,832 U.S. participants, 708 groups, 107 topics) that measures whether multi-agent role-playing LLMs reproduce authentic human group…
DICE-Bench	benchmark	2025	✓	Dialogue-based Interactive Calling Evaluation Benchmark: the first multi-round, multi-party benchmark for function/tool-calling grounded in realistic group-chat data. 1,607…
HSII (How Social Is It?)	benchmark	2025	✓	A benchmark explicitly built to assess LLMs as autonomous social agents in multi-user, multi-turn settings (avg 6.72 participants, 7.8 turns per scenario), as opposed to…
MAGPIE	benchmark	2025	✓	Multi-AGent contextual PrIvacy Evaluation: ~200 high-stakes tasks (earlier version: 158 scenarios across 15 domains) evaluating privacy preservation in multi-agent,…
Multi-Party Conversational Agents: A Survey	paper	2025	✓	A survey organizing multi-party conversational-agent research, with explicit sub-sections on Turn Detection (when to speak) and Addressee Selection (whom to address),…
Multi-Party Conversational Agents: A Survey	paper	2025	✓	A 2025 survey of multi-party conversational agents that organizes the field around the sub-capabilities required for group settings, including who-speaks-next / turn-taking,…
MultiAgentBench	benchmark	2025	✓	A benchmark evaluating LLM-based multi-agent systems across interactive scenarios with both cooperative (mutual-goal) and competitive (conflicting-goal) settings, supporting…
MultiAgentBench (MARBLE)	benchmark	2025	✓	A benchmark suite (Zhu et al.) for LLM multi-agent systems across cooperative (research collab, Minecraft build, DB diagnosis, coding) and competitive (bargaining, Werewolf…
Multimodal Conversation Structure Understanding (MCSU)	benchmark	2025	✓	A 2025 benchmark for evaluating (multimodal) LLMs on the structural fabric of multi-party conversation — including speaker/addressee and reply-to relations — beyond surface…
ProMediate	framework	2025	✓	A socio-cognitive framework (USC + Microsoft) for evaluating proactive AI mediator agents in multi-topic, multi-party negotiations. Includes a simulation testbed with…
SAGE	framework	2025	✓	A top-down/bottom-up knowledge-grounded user simulator for multi-turn agent evaluation (Columbia DAPLab, Findings of EACL 2026). Grounds simulated users in business logic…
tau2-bench (τ²-bench)	benchmark	2025	✓	Sierra's benchmark for tool-agent-user interaction. τ²-bench extends τ-bench to a dual-control setting (Telecom domain) where BOTH the simulated user and the agent can call…
The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation	framework	2025	✓	A psychometric evaluation framework (Zarreen Reza) that assesses LLMs as social actors inside multi-agent debates rather than in isolation, using a 3-round multi-party debate…
Triadic Multi-party Voice Activity Projection (VAP) for Turn-taking	paper	2025	✓	First extension of Voice Activity Projection to triadic (3-party) spoken conversation, predicting each speaker's future voice activity from acoustics to determine who takes the…
GroupMemBench	benchmark	2026	✓	A benchmark for LLM agent MEMORY specifically in multi-party conversations, motivated by the fact that nearly all memory benchmarks assume a dyadic single-user setup while real…
Mind the Sim2Real Gap in User Simulation for Agentic Tasks	paper	2026	✓	A CMU LTI study quantifying how faithfully LLM user simulators replicate real human behavior in agent interactions, and how that mismatch distorts benchmark scores.
MPCEval: A Benchmark for Multi-Party Conversation Generation	benchmark	2026	✓	A standardized, task-aware framework for evaluating multi-party conversation generation, covering both next-message prediction and full-conversation generation across varied…
RealUserSim	framework	2026	✓	A user-simulation framework grounded in real behavioral data: extracts 7,275 executable behavioral profiles from 14,000+ authentic human-LLM conversations (WildChat) and uses…

Conversation-science sources (25)

Foundational constructs from conversation analysis, the social psychology of groups, and sociolinguistics/pragmatics that ground the eval criteria.

Construct	Seminal source	Discipline	✓
Adjacency pairs & conditional relevance	Schegloff & Sacks 1973, 'Opening up closings', Semiotica 8:289–327; elaborated in Schegloff 2007, 'Sequence Organization in Interaction'	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓
Audience Design & Style-Shifting	Bell 1984, "Language Style as Audience Design" (Language in Society 13:145-204)	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Audience effects / social facilitation & evaluation apprehension	Zajonc 1965, "Social Facilitation" (Science) — mere-presence drive theory; Cottrell 1972 evaluation-apprehension refinement	Social psychology of group conversation and small-group dynamics	✓
Common ground & grounding (least collaborative effort, grounding criterion)	Clark & Brennan 1991, "Grounding in Communication" (in Resnick et al., eds., Perspectives on Socially Shared Cognition); building on Clark & Wilkes-Gibbs 1986	Social psychology of group conversation and small-group dynamics	✓
Communication Accommodation Theory (convergence / divergence)	Giles 1973; Giles, Coupland & Coupland 1991 (Contexts of Accommodation); Giles & Ogay 2007 review	Social psychology of group conversation and small-group dynamics	✓
Contextualization Cues, Code-Switching & Register	Gumperz 1982, "Discourse Strategies" (conversational code-switching, contextualization cues)	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Cooperative Principle & Conversational Maxims (Implicature)	Grice 1975, "Logic and Conversation" (in Syntax and Semantics 3)	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Cultural Variation in Face & Politeness (Discernment / Wakimae)	Matsumoto 1988 and Ide 1989 (critiques of Brown & Levinson using Japanese); concept of wakimae/discernment politeness	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Face & Face-Threatening Acts (FTAs)	Brown & Levinson 1987, "Politeness: Some Universals in Language Usage" (building on Goffman); also Brown & Levinson 1978	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Face-work, Deference & Demeanor	Goffman 1955, "On Face-Work" (Psychiatry 18:213-231); Goffman 1956/1967, "The Nature of Deference and Demeanor"	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Floor management with 3+ parties (selection, schisming, addressee vs. next-speaker)	Sacks, Schegloff & Jefferson 1974 (next-speaker selection); Egbert 1997 (schisming); Auer 2018 / Lerner 2003 (gaze, addressing, next-speaker selection in 3-party talk)	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓
Floor-control & participation inequality (turn-taking, dominance, silencing)	Sacks, Schegloff & Jefferson 1974 (turn-taking systematics); Edelsky 1981 "Who's got the floor?" (singly-developed F1 vs. collaborative F2 floor); conversational-dominance literature	Social psychology of group conversation and small-group dynamics	✓
Group polarization	Moscovici & Zavalloni 1969, "The group as a polarizer of attitudes" (J. Personality & Social Psychology); related risky-shift work, Stoner 1961	Social psychology of group conversation and small-group dynamics	✓
Indirect Speech Acts	Searle 1975, "Indirect Speech Acts" (in Syntax and Semantics 3); Searle 1969, "Speech Acts"	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Informational vs. normative social influence (conformity)	Deutsch & Gerard 1955, "A Study of Normative and Informational Social Influences upon Individual Judgment" (J. Abnormal & Social Psychology); rooted in Asch 1951/1956 conformity line experiments	Social psychology of group conversation and small-group dynamics	✓
Overlap & overlap-resolution	Schegloff 2000, 'Overlapping talk and the organization of turn-taking for conversation', Language in Society 29(1):1–63 (extends Sacks/Schegloff/Jefferson 1974)	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓
Participation Framework & Footing	Goffman 1981, "Footing" (in Forms of Talk)	Sociolinguistics & Pragmatics (multi-party conversation)	✓
Participation framework & footing (Goffman)	Goffman 1981, 'Footing', in Forms of Talk, University of Pennsylvania Press; extended by C. Goodwin & M. H. Goodwin 2004, 'Participation'	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓
Recipient design (audience design)	Sacks & Schegloff (recipient design, e.g., Sacks 1992 Lectures; Sacks & Schegloff 1979 on reference); cf. H. H. Clark & Murphy 1982 'audience design'; Giles' accommodation	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓
Repair (self/other-initiation, self/other-repair)	Schegloff, Jefferson & Sacks 1977, 'The Preference for Self-Correction in the Organization of Repair in Conversation', Language 53:361–382	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓
Sequence organization & expansion (pre-/insert-/post-)	Schegloff 2007, 'Sequence Organization in Interaction: A Primer in Conversation Analysis, Vol. 1', Cambridge University Press	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓
Social loafing / free-riding (and effort responsibility)	Latané, Williams & Harkins 1979, "Many hands make light the work: The causes and consequences of social loafing" (J. Personality & Social Psychology); Karau & Williams 1993 meta-analysis	Social psychology of group conversation and small-group dynamics	✓
Status hierarchies & expectation states in talk	Berger, Cohen & Zelditch 1972 and Berger, Fisek, Norman & Zelditch 1977 (Expectation States / Status Characteristics Theory); Ridgeway & Berger syntheses	Social psychology of group conversation and small-group dynamics	✓
Theory of mind & perspective-taking in groups (egocentric anchoring)	Keysar, Barr, Balin & Brauner 2000, "Taking Perspective in Conversation" (Psychological Science) — egocentric anchoring & adjustment; Premack & Woodruff 1978 on theory of mind	Social psychology of group conversation and small-group dynamics	✓
Turn-taking systematics (TRPs & turn-allocation)	Sacks, Schegloff & Jefferson 1974, 'A Simplest Systematics for the Organization of Turn-Taking for Conversation', Language 50:696–735	Conversation Analysis & Interactional Linguistics (organization of multi-party talk)	✓