A standalone, actionable extract: the consolidated table of what an AI agent in a group chat should be evaluated for, grounded in the science of human conversation, with a measurability rating and a concrete measurement idea for each. This is Part B of What the Science Says We Should Eval For — lifted out because it is the compendium's most directly usable artifact.
For the theory each criterion rests on, read the full dynamics report; for the sources, see the bibliography.
Consolidated criteria
Measurability key: CA = cheap-automatable · JR = judge-required · HSO = human-study-only · OP = open-problem.
| # | What to eval for | Grounded in (construct + citation) | Why it matters for a group-chat agent | Measurability | Measurement idea |
|---|---|---|---|---|---|
| 1 | Type-fitted responsiveness — when addressed with a first pair part, return the relevant second pair part; don't answer one owed by another party; surface unanswered group questions | Adjacency pairs & conditional relevance (Schegloff & Sacks 1973; Schegloff 2007) | The minimal unit of being a competent interlocutor; "answered for the wrong person" is a core failure | CA | Tag dialogues with FPP type + intended addressee; check next contribution is type-fitting and addressed to the right party (exact-match on addressee) |
| 2 | Addressee / participant-status tracking — respond only when ratified-as-addressed; stay attentive-but-silent on human-to-human talk; resolve "you"/@-mentions | Participation framework & footing (Goffman 1981; Goodwin & Goodwin 2004); Bell 1984 | THE multi-party construct; "the bot answered a question meant for Bob" | CA | Per-turn gold labels for "should the agent speak now?" + "who is the addressee?"; score precision/recall on intrusion vs. silence-when-needed |
| 3 | Footing / attribution integrity — when relaying or quoting, mark animator-vs-principal; attribute claims to their source, don't self-author them | Footing: animator/author/principal (Goffman 1981) | Blurring "I think X" with "the user said X" misattributes responsibility and fabricates authority | JR | Planted relays; judge or pattern-check whether relayed content is sourced to the right principal vs. asserted bare |
| 4 | When-to-speak / floor-share discipline — speak when selected or floor is genuinely open and unfilled; don't grab the floor; keep share proportionate; draw in quiet members | Turn-taking systematics (Sacks/Schegloff/Jefferson 1974); floor-control (Edelsky 1981) | An agent reading every silence as its cue dominates; one that never self-selects is furniture | JR (volumetric proxies CA) | Confusion matrix of {should-speak / should-stay-silent} at annotated TRPs; cheap proxies: turn-share ratio, mean inter-turn gap, reply-when-unaddressed rate |
| 5 | Barge-in restraint / graceful yield — withhold or retract when a human is mid-thought or was selected; distinguish collaborative completion from floor-grab | Overlap & overlap-resolution (Schegloff 2000; Jefferson) | Posting over a human, or stepping on an answer another was about to give, is the text analogue of interruption — but not all collisions are hostile | JR | Inject "human about to respond" scenarios (typing indicators/timing); measure barge-in rate and defer/edit-vs-duplicate after a near-simultaneous human message |
| 6 | Self-clarification before acting — ask a clarifying question on an under-specified/ambiguous request rather than acting on a guess | Repair: preference for self-initiation (Schegloff/Jefferson/Sacks 1977) | Acting on a misread goal is costly; the repair system says clarify first | CA | Self-clarification rate on a set of deliberately under-specified prompts (gold "ambiguous" flag) |
| 7 | Graceful other-correction & correction-uptake — flag others' errors mildly, leaving room to self-correct; accept being corrected without doubling down or over-apologizing | Repair: preference for self-repair / mitigated other-initiation (Schegloff/Jefferson/Sacks 1977); face-work (Goffman 1955) | Blunt "actually, you're wrong" in front of an audience is a major social-cost failure mode | JR | Plant human errors + corrections of the agent; rubric scores mitigation level of agent's corrections (bald vs. softened) and uptake vs. defensiveness |
| 8 | Multi-turn action coherence under interleaving — resume and complete a base sequence after N intervening turns; read pre-sequences as pres; emit closing thirds | Sequence organization & expansion (Schegloff 2007) | Side-talk constantly separates a first pair part from its second; "losing the plot" is common | JR (closing-third presence CA) | Interleave a target task with distractor side-conversations at controlled depths; measure completion-under-interruption and thread-drop rate |
| 9 | Recipient design / audience calibration — tune explicitness, jargon, and presupposed knowledge to the actual present mix; don't over-explain to experts or under-explain to novices | Recipient design (Sacks & Schegloff 1979; Clark & Murphy 1982); audience design (Bell 1984) | A heterogeneous audience means one-size answers fail in both directions | JR (jargon-density/readability proxies CA) | Same query, varied participant profiles (expert / novice / mixed); judge rates whether explicitness and references fit the present recipients |
| 10 | Common-ground maintenance — track per-participant shared knowledge; seek/offer evidence of understanding; repair on confusion signals; avoid both ungrounded reference and redundant over-grounding | Grounding (Clark & Brennan 1991; Clark & Wilkes-Gibbs 1986) | People join late and miss messages; assuming your knowledge is shared confuses, over-grounding bores | JR (unresolved-referent / redundant-re-explanation counts CA) | Transcripts with planted grounding events (newcomer joins; ambiguous referent; signaled non-understanding); rubric scores repair-initiation and reference-within-common-ground |
| 11 | Per-participant theory of mind / leakage control — maintain separate who-knows-what models; don't disclose to a party what they shouldn't know; resolve references per the right perspective | Egocentric anchoring / ToM (Keysar et al. 2000; Premack & Woodruff 1978) | Egocentric failure causes leaks, confusing references, answers pitched to the wrong knowledge state | CA (leak detection) / JR (belief attribution) | Asymmetric-knowledge tasks: A learns X privately, B doesn't; probe whether agent answers B without leaking X and attributes correct beliefs |
| 12 | Sycophancy / conformity resistance (with appropriate updating) — hold a justified position under social pressure; distinguish "new evidence" from "you simply disagree"; DO update on real evidence | Informational vs. normative influence (Deutsch & Gerard 1955; Asch 1951/56) | False consensus or an assertive high-status user shouldn't flip a correct factual stance | CA | Confederates assert wrong answer with no evidence (measure flip-rate, want low) vs. supply corrective evidence (measure update-rate, want high) — both against ground truth |
| 13 | Status-fair attention & credit — weight contributions by content, not inferred status cues; don't systematically favor the highest-status speaker or under-credit low-status ones | Status Characteristics / Expectation States (Berger et al. 1972, 1977) | An agent imports biased performance expectations rather than weighting on merit | CA | Hold content fixed, vary status cues (title, demographic name, assertiveness); measure differential agreement, credit, response length, deference |
| 14 | FTA-mitigation calibration — scale face-redress to the act's weightiness; neither bald-on-record bluntness nor so much hedging the content is lost | Face & FTAs (Brown & Levinson 1987; Goffman) | Corrections/refusals/disagreements are witnessed FTAs; mis-handling reads rude or evasive | JR | Scenarios requiring an FTA toward a named member; judge rates accuracy + redress presence + over/under-mitigation; check redress scales with FTA weight |
| 15 | Repair / recovery after rupture — perform proportionate corrective face-work after own error or an interpersonal rupture; no over-apology loop, no escalation | Face-work, deference & demeanor (Goffman 1955; 1956/1967) | A sustained encounter left unrepaired stays awkward; demeanor must avoid grovel and over-confidence both | JR | Inject a planted error/rupture; score next turn against a repair rubric (acknowledgment present, proportionate, forward-moving); count escalation vs. de-escalation |
| 16 | Maxim adherence — Quantity / Relation / Quality interaction — appropriately brief and relevant for chat tempo; don't re-answer settled points; don't assert beyond evidence (esp. hedge unsupported claims) | Cooperative Principle & maxims (Grice 1975) | Agents notoriously over-contribute in multi-party settings; the novel layer is truth × face × amount | CA (length/relevance) / JR (relevance nuance) | Turn-length target band for chat register + penalize info-dumps and re-answers; cross-check factual claims vs. ground truth and penalize unhedged false assertions |
| 17 | Implicature & indirect-directive comprehension — act on implicated meaning; recognize indirect/hinted requests and whether the agent is the (implicit) target — without over-eager action when it isn't | Implicature (Grice 1975); indirect speech acts (Searle 1969, 1975) | Group directives are routinely indirect and softened; the over-eager-action error is common | CA | Labeled indirect prompts with gold "intended action" + "who is the target?" answers, including distractors where the agent is NOT addressed |
| 18 | Accommodation appropriateness (convergence without over-/harmful-convergence) — converge on register/formality enough to be clear and affiliative; don't parrot slang, mirror hostility, or condescend; manage multiple styles in one thread | Communication Accommodation Theory (Giles 1973; Giles/Coupland/Coupland 1991); contextualization cues (Gumperz 1982) | Cold non-accommodation reads robotic; over-accommodation reads patronizing; mirroring hostility is harmful | JR (style-distance proxies CA) | Threads with divergent-style members (formal expert, casual newcomer, hostile user); judge rates per-addressee register match; flag over-accommodation and harmful convergence |
| 19 | Public-vs-private consistency — same quality and honesty in a large group as in a 1:1; no grandstanding, performative agreement, or degradation on hard tasks because many are watching | Social facilitation / evaluation apprehension (Zajonc 1965; Cottrell 1972) | The visible audience is the AI analogue of evaluation pressure | CA (performativity rating JR) | Identical prompts in 1:1 vs. large-audience condition; measure deltas in correctness, hedging, sycophancy, verbosity |
| 20 | Schism avoidance — don't fork the group into a private side-thread that fragments the conversation; select explicitly when wanting a specific actor | Schisming (Egbert 1997); next-speaker selection (Lerner 2003; Auer 2018) | Spawning a tangent that splits the floor degrades the whole group | JR (contested) / OP | Judge over the resulting transcript: did the agent's turn fork the conversation? (hard to operationalize cleanly) |
| 21 | Responsibility / effort calibration — know when it's on the hook vs. when another party owns the action; don't drop tasks on diffusion-of-responsibility; don't over-function and displace humans | Social loafing / free-riding (Latané/Williams/Harkins 1979; Karau & Williams 1993) | Ambiguous ownership in mixed human–agent groups causes dropped balls or agent over-reach | JR (drop/over-reach counts semi-CA) | Tasks with ambiguous ownership, some assigned to others; measure dropped-task rate (was responsible, didn't act) and over-reach rate (did another's task) |
| 22 | Group-polarization moderation — in a leaning group, surface counter-considerations / base rates / steelman the minority rather than supplying fresh one-sided arguments | Group polarization (Moscovici & Zavalloni 1969; Stoner 1961) | A persuasive agent can be an accelerant for pile-ons, risky plans, conspiracy | JR | Seed a leaning thread; compare agent-present vs. agent-absent extremity (judge-rated); count balancing vs. confirming/escalating moves |
| 23 | Cultural politeness portability — apply locally appropriate deference/address forms (honorific level by relative status) rather than transplanting one culture's directness everywhere | Discernment / wakimae (Matsumoto 1988; Ide 1989) | A correction fine among Western peers can be a serious face violation elsewhere | HSO (honorific grammar partially CA) | Honorific/T-V correctness against grammatical rules given known status config (partial); overall appropriateness needs native-speaker raters, reported per-locale |
What our existing evals already cover vs. the gap
Our existing LLM-eval corpus and the group-chat survey measure agent task competence, and several criteria above are essentially re-groundings of things we already test:
Already covered (well). Criterion #2 (addressee/participant-status tracking) is our addressee accuracy / disentanglement work — Goffman's participation framework is the theory underneath a metric we already run. #4's volumetric side (floor-share) overlaps our when-to-speak F1. #11's leak-detection slice is our leakage metric, now grounded in egocentric-anchoring / ToM. #16's Quality dimension overlaps existing factuality/hallucination evals. #12 (sycophancy) is a known, partly-covered target. #14/#7's face dimensions are partially captured by our social-rubric judges. These should be re-labeled with their constructs, not re-built.
Genuinely new and untested. The constructs that the task-competence framing does not reach:
- Repair after misattribution / graceful other-correction (#6, #7) — we measure whether the agent gets the addressee right, not how it recovers when it (or a human) gets something wrong, nor the mitigation level of its corrections.
- Footing / participation-role tracking as attribution integrity (#3) — animator-vs-principal sourcing is not in the corpus at all; we test who is addressed, never whose words the agent is voicing.
- Face-calibration per recipient and FTA-redress scaling (#14, #15) — our social rubrics score politeness coarsely; scaling redress monotonically to FTA weight, and corrective face-work after a rupture, are untested.
- Common-ground maintenance as an ongoing process (#10) — we don't measure grounding/least-effort over a thread; over-grounding in particular is invisible to current metrics.
- Participation-equity as a NORM, not just a metric (#4, #13) — we count the agent's turn-share, but we don't test whether it actively draws in quiet members or distributes credit fairly across status cues. The normative move — "did it improve the group's participation balance?" — is new.
- Accommodation / convergence (#18) — register-matching appropriateness (and its failure modes: over-accommodation, harmful convergence) is untested.
- Also new: multi-turn coherence under interleaving (#8), public-vs-private consistency (#19), group-polarization moderation (#22), responsibility calibration (#21), schism avoidance (#20), and cultural politeness portability (#23).
Honest qualifier on the gap: many of the new criteria are judge-required and norm-laden, with a live risk that an LLM judge shares the agent's blind spots — so each needs an anchored rubric, reported inter-rater agreement, and (where possible) a human-rated calibration subset. The genuinely cheap additions are the ones with plantable ground truth (#1, #11-leak, #12, #13, #17).
The 3–5 highest-value additions
Criteria that are both important and at least judge-measurable, and that a curriculum could realistically teach:
Graceful repair & other-correction (#6 + #7). The single most teachable interpersonal skill and a top failure mode of assistant-style agents (blunt "you're wrong"). Plantable errors + an anchored mitigation rubric make it judge-measurable today, and self-clarification rate (#6) is cheaply automatable. Grounded in Schegloff/Jefferson/Sacks (1977).
Footing / attribution integrity (#3). Genuinely new, structurally clean ("did it attribute the relayed claim to the right principal?"), and high-stakes — blurring "the user said X" into "X is true" fabricates authority. Goffman (1981).
Common-ground maintenance (#10). Captures both under- and over-grounding, which current addressee-only metrics miss entirely; plantable grounding events (newcomer joins, ambiguous referent) give a judge concrete things to score. Clark & Brennan (1991).
Face-calibration / FTA-redress scaling (#14). A judge can rate whether redress scales with FTA weight while holding content constant — a crisp, teachable rubric — and it generalizes the coarse social-rubric judges we already have. Brown & Levinson (1987), with the cross-cultural caveat reported per-locale.
Participation-equity as a norm + sycophancy resistance (#4/#13 + #12). Pairs a cheap, ground-truthable metric (sycophancy flip-rate; status-cue bias at fixed content) with the normative upgrade — does the agent distribute the floor and credit fairly rather than just keeping its own turn-share low? Deutsch & Gerard (1955); Berger et al. (1972, 1977); Edelsky (1981).