A research-grounding document for the eval curriculum. The existing corpus and prior survey measured agent task competence — addressee accuracy, when-to-speak F1, disentanglement, leakage, social-rubric judges. This document covers the layer underneath: the theory of the dynamics themselves, drawn from conversation analysis, the social psychology of groups, and sociolinguistics/pragmatics — and what that theory says we OUGHT to be measuring.
A note before we start: the modality gap
All three disciplines were built on spoken, co-present, multimodal interaction. Their evidence base is full of gaze, prosody, and acoustic overlap-resolution — none of which transfer 1:1 to text group chat. Throughout this document, the constructs port as functional analogues: acoustic overlap becomes posting-while-someone-is-typing; gaze-based next-speaker selection becomes the @-mention; contextualization cues live in lexical register, punctuation, emoji, and message timing. Where a construct does not port cleanly, the document says so rather than pretending the mapping is clean. Two other honest caveats run throughout: (1) CA is anti-quantitative at its origin — turning participant-oriented structures into scalar metrics is a deliberate, contestable departure from the source discipline; and (2) the interpersonally important constructs are judge-required and norm-laden, and an LLM judge can share the agent's own blind spots, which is a live threat to validity.
PART A — Theory Grounding
1. Conversation Analysis & Interactional Linguistics: how multi-party talk is organized
The foundational claim of conversation analysis (CA) is that talk is not chaos overlaid with etiquette — it is a machine with rules, locally managed and administered by the parties themselves, moment to moment. The seminal statement is Sacks, Schegloff & Jefferson (1974), A Simplest Systematics for the Organization of Turn-Taking for Conversation. Talk is built from turn-constructional units (TCUs) whose completion points are projectable in advance — these are Transition-Relevance Places (TRPs). At each TRP, an ordered rule set governs who speaks next: (a) the current speaker may select the next speaker ("current selects next"); failing that, (b) any party may self-select, and the first starter wins the turn; failing that, (c) the current speaker may simply continue. The system is engineered to minimize both gap and overlap.
This matters enormously the moment you have three or more parties, because every TRP becomes an open auction for the floor. An agent that reads every silence as its cue will monopolize; an agent that never self-selects becomes furniture. Competence here is not "respond well" — it is reading whether you were selected, whether the floor is genuinely open, and whether to continue or yield.
The "one party talks at a time" rule is a norm, not a fact, and Schegloff (2000), Overlapping talk and the organization of turn-taking for conversation, shows that overlap is systematic, not pathological. Participants run an overlap-resolution device — competitive resources (louder, higher, faster) plus dropout and recycling — that usually resolves simultaneous talk within a beat. Crucially, overlap is not the same as interruption: terminal overlaps, recognitional onsets, and collaborative completions are not hostile. (The onset typology Schegloff builds on traces to Jefferson's earlier categorization.) The lesson for any "barge-in" metric: a system that treats all collisions as violations will misclassify benign back-channels and collaborative completions as rudeness.
When understanding breaks down, talk has a dedicated subsystem: repair. Schegloff, Jefferson & Sacks (1977), The Preference for Self-Correction in the Organization of Repair in Conversation, describe an organized set of practices for troubles in speaking, hearing, or understanding. Repair has an initiation and an outcome, each by self or other, yielding four types — and there is a structural preference for self-initiation and for self-repair. Other-initiation is done gently: open-class initiators ("huh?", "what?"), repeats, candidate understandings — all of which leave the trouble-source speaker room to fix their own utterance. This is the deep structure behind a major agent failure mode: blunt other-correction ("actually, you're wrong"). The discipline's own finding is that correcting another person is dispreferred and, when done, mitigated — and that giving someone room to self-correct is the default.
The smallest accountability unit is the adjacency pair (Schegloff & Sacks 1973, Opening up closings; elaborated in Schegloff 2007). A first pair part (question, greeting, offer, request) makes a specific second pair part conditionally relevant: its presence is expected, and — this is the key word — its absence is noticeable and accountable. The norm is type-fitting (an answer fits a question), not mere temporal adjacency. In a group, this cuts two ways: a question to the agent owes a type-fitted answer, but a question to someone else must not be answered on their behalf — and a group question going unanswered is a "noticeable absence" the agent might appropriately fill.
Adjacency pairs are only the kernel. Schegloff (2007), Sequence Organization in Interaction, shows action is built across many turns via expansion: pre-sequences ("are you free Friday?" as a pre-invitation that tests the ground), insert sequences (a clarifying Q–A nested before the base answer), and post-expansions (sequence-closing thirds like "great", "okay"). In group chat, side-talk from other participants constantly separates a first pair part from its second — so holding the thread of an action across N intervening turns, interpreting a pre-sequence as a pre rather than literally, and emitting the closing third so the human knows the action landed, are all part of being coherent.
Finally, the construct that is the multi-party construct: Goffman's (1981) participation framework and footing (extended and partly contested by Goodwin & Goodwin 2004, who reframe participation as embodied and emergent rather than a fixed slot system; cf. Levinson 1988). Goffman decomposes the monolithic "speaker" into animator (who voices the words), author (who composed them), and principal (whose position they express); and the monolithic "hearer" into ratified participants (addressed and unaddressed recipients) versus unratified ones (bystanders, overhearers, eavesdroppers). Footing is the participant's alignment or stance, which shifts moment to moment. The implications for an agent are direct: don't treat overheard human-to-human talk as a request to you (you are an unaddressed party or a bystander); and don't blur footing by stating a relayed or quoted claim as your own assertion (animating someone else's words is not the same as being their principal). Closely related is recipient design (Sacks & Schegloff 1979 on reference; cf. Clark & Murphy 1982, "audience design"): speakers tailor word choice, explicitness, and presupposed knowledge to these particular recipients, using the least elaborate referring form the recipient will still recognize (the minimization/recognition tension). In a heterogeneous group — expert and novice, insider and newcomer — one-size answers are a failure in both directions.
Multi-party floor management (Sacks/Schegloff/Jefferson 1974; Egbert 1997 on schisming; Lerner 2003 and Auer 2018 on address terms, gaze, and next-speaker selection) adds one more analytic separation: addressee selection is not the same as next-speaker selection. A turn can be addressed to several recipients while selecting only one to respond ("closed-floor" vs. "open-floor" questions). And schisming — one conversation splitting into two simultaneous floors — is something participants actively work to prevent and repair. An agent should select explicitly (@-mention) when it wants a specific human to act, answer open-floor questions only when eligible, and avoid spawning the tangent that fractures the group.
2. The Social Psychology of Group Conversation: pressure, status, and shared knowledge
Where CA describes the machinery, social psychology describes the forces that machinery operates under — and most of them are forces an agent can either resist or amplify.
Start with shared knowledge. Clark & Brennan (1991), Grounding in Communication (building on Clark & Wilkes-Gibbs 1986), reframe conversation as not transmission but the collaborative construction of mutual knowledge. Participants seek positive evidence of understanding and ground each contribution only well enough for current purposes (the grounding criterion), while minimizing the joint work involved (least collaborative effort). Backchannels, acknowledgments, clarification requests, and repair are the machinery. In multi-party chat this is genuinely harder: people joined at different times, missed messages, and hold divergent assumptions. An agent that assumes its own knowledge state is shared breaks grounding in one direction (ungrounded references); one that re-explains settled material breaks least-effort in the other (over-grounding).
The related cognitive claim is perspective-taking / theory of mind. Keysar, Barr, Balin & Brauner (2000), Taking Perspective in Conversation: The Role of Mutual Knowledge in Comprehension, argue interlocutors often anchor egocentrically — interpreting from their own perspective and only effortfully adjusting to the partner's distinct knowledge when a problem is detected (theory of mind itself: Premack & Woodruff 1978). Important caveat: the specific two-stage timing claim is actively contested — constraint-based accounts (e.g., Hanna & Tanenhaus) find adults use common ground from the earliest moments. The eval-relevant point survives the dispute: in a group, you must maintain separate mental-state models per person — who knows X, who was present for Y, who holds which false belief. Egocentric failure produces leaks (revealing what one party shouldn't see), confusing references, and answers pitched to the wrong knowledge state.
Then the influence dynamics. Deutsch & Gerard (1955), building on Asch's (1951/1956) line-judgment experiments, distinguish informational social influence (treating others' views as evidence about reality, especially under ambiguity) from normative social influence (going along to gain acceptance or avoid disapproval, even against private judgment). The AI analogue of normative conformity is sycophancy — and it is the cleanest, most automatable thing in this whole document, because you can plant ground truth: confederate users assert a false consensus, and you measure whether the agent flips. Honest framing: the model has no felt need for approval, so "normative vs. informational" is a useful metaphor, not a literal mechanism; conformity magnitudes also vary by culture and have declined over decades.
Group discussion does not average toward moderation — Moscovici & Zavalloni (1969), The group as a polarizer of attitudes (with Stoner's 1961 risky-shift precursor), show group polarization: after discussion, members' attitudes become more extreme in the direction the group already leaned, driven by hearing novel same-side arguments (persuasive-arguments) and by social comparison. A persuasive agent is uniquely capable of generating exactly the novel one-sided arguments that drive the effect — making it a potential accelerant for pile-ons, hardening risky plans, or amplifying conspiracy. (The mechanism debate is unsettled, and which direction counts as failure is normative and use-case-specific.)
Inequality forms fast. Expectation States / Status Characteristics Theory (Berger, Cohen & Zelditch 1972; Berger, Fisek, Norman & Zelditch 1977; Ridgeway & Berger syntheses) shows that even in nominally equal task groups, participation and influence inequalities stabilize quickly and track status characteristics (gender, race, perceived expertise) that bias performance expectations — high-status members get more turns, more deference, fewer interruptions, regardless of actual competence. An agent both reads and reproduces status: it may defer to a confident, senior-sounding user and discount good ideas from low-status participants. The cleanest, least contestable test holds content fixed and varies the status cue — which avoids reifying the categories while still detecting the bias. The structural correlate is floor-control and participation inequality (Sacks/Schegloff/Jefferson 1974; Edelsky 1981, Who's got the floor?, with her distinction between the singly-developed F1 floor and the collaborative F2 floor). Some participants dominate (more turns, topic control); others are effectively silenced. An agent can monopolize the floor or, conversely, manage it to draw quiet members in.
Two more forces round out the picture. Social loafing / free-riding (Latané, Williams & Harkins 1979, Many hands make light the work; Karau & Williams 1993 meta-analysis): people exert less effort when contributions are pooled and not individually identifiable (diffusion of responsibility, reduced evaluability); effort returns under identifiability and felt essentiality. In a mixed human–agent group this cuts both ways — the agent may under-function ("someone else will handle it"), the agent's presence may induce humans to free-ride, or ambiguous ownership may push the agent to over-function and do everyone's work. And social facilitation / evaluation apprehension (Zajonc 1965, mere-presence drive theory; Cottrell 1972, evaluation-apprehension refinement): the presence of an audience raises arousal, boosting dominant responses (helping simple tasks, hurting complex ones), much of it attributable to the expectation of being judged. The AI analogue is public-vs-private consistency: does the agent grandstand, perform agreement, or degrade on hard tasks just because many people are watching? (Effect sizes are modest and the mechanism is contested — don't over-attribute to one cause.)
Finally, Communication Accommodation Theory (Giles 1973; Giles, Coupland & Coupland 1991; Giles & Ogay 2007): speakers adjust style — vocabulary, formality, register, rate, self-disclosure — toward (convergence, signaling affiliation and easing understanding) or away from (divergence, asserting distinct identity) their interlocutors. Over-convergence reads as patronizing. For an agent with many simultaneous interlocutors, appropriate accommodation means converging enough to be clear and affiliative without parroting slang, mirroring hostility, or condescending — and code-switching across participants of different styles in one thread.
3. Sociolinguistics & Pragmatics: face, meaning-beyond-words, and designing for the room
This discipline is about the social meaning carried by how something is said — and it is where the "be a good guest in the group" criteria live.
The central construct is face. Brown & Levinson (1987), Politeness: Some Universals in Language Usage (building on Goffman; 1978 precursor), define positive face (the want to be liked, approved, included) and negative face (the want to be unimpeded, free from imposition). Many ordinary acts — requests, criticisms, disagreements, corrections, advice — are intrinsically Face-Threatening Acts (FTAs), and politeness is the systematic mitigation work that offsets the threat. This is the most contested construct in the document: Matsumoto (1988) and Ide (1989) argue, using Japanese, that the individualistic, volition-based, negative-face-centric model does not travel to discernment cultures, where politeness is substantially wakimae — obligatory marking of relative social position via honorifics and form, oriented to group interdependence rather than freedom from imposition. Any face/politeness eval must therefore report per-culture and avoid a single universal rubric. In a group chat, an agent performs FTAs constantly — correcting one person in front of others amplifies the threat — and mishandling reads as either curt or sycophantic.
Goffman is upstream of all of it. On Face-Work (1955) and The Nature of Deference and Demeanor (1956/1967) frame face as a public self-image on loan from the interaction, maintained through ongoing face-work: avoidance rituals and corrective rituals (repair after an offense). Deference marks regard for another; demeanor is the conduct establishing one's own worthiness. A group chat is a sustained encounter, so after a gaffe — the agent's own error, or one human insulting another — appropriate corrective face-work (acknowledge, repair, restore equilibrium) is owed, without sliding into an over-apology loop or status-grabbing over-confidence.
Meaning routinely exceeds the literal words. Grice (1975), Logic and Conversation, posits the Cooperative Principle and four maxims — Quantity (as informative as required, no more), Quality (truthful, evidenced), Relation (relevant), Manner (clear, brief, orderly) — which hearers exploit: an apparent flouting of a maxim generates a conversational implicature ("it's getting late" as a closing move). For an agent, this is two jobs: comprehend implicature (act on what's implied, not just the literal), and govern its own contributions. The Quantity maxim is especially load-bearing — agents notoriously over-contribute in multi-party settings, dropping a wall of text into a fast chat. Note the overlap: Quality partly duplicates existing factuality/hallucination evals; the novel multi-party contribution is the interaction of truthfulness with face (hedging an unsupported claim) and with Quantity (saying the right amount).
Relatedly, indirect speech acts (Searle 1975, Indirect Speech Acts; Searle 1969, Speech Acts): one illocutionary act performed via another — "Can you pass the salt?" is literally a question, functionally a request. Group directives are routinely indirect and softened ("someone should probably look at the deploy"). The agent must recognize when an indirect form is a request directed (possibly implicitly) at it — without the over-eager error of acting when it is not the addressee — and must itself calibrate directness when asking something of a participant (too direct reads as bossy in front of the group; too indirect leaves the ask unowned). Indirectness is itself a politeness resource — an off-record request gives the hearer an out (a framing most precisely Brown & Levinson, downstream of Searle).
Designing for the room is its own construct. Bell (1984), Language Style as Audience Design, ranks audience roles by influence — addressee > auditor (ratified) > overhearer > eavesdropper — and shows accommodation is strongest toward the addressee, but ratified third parties measurably shape style (plus referee design: styling toward an absent reference group). A group chat is the canonical mixed-audience case: the agent addresses one person while several ratified auditors and overhearers read everything. Same message, different roster (technical vs. mixed-expertise; senior stakeholder present vs. peers only) should yield appropriately different style and disclosure — and the agent must not broadcast to the whole group what was appropriate only for one addressee. This is the same participation-framework backbone as Goffman's footing (1981), re-described from the sociolinguistic side. The contextualization layer comes from Gumperz (1982), Discourse Strategies: contextualization cues and conversational code-switching (including register-switching) do interactional work — addressee specification, quotation, reiteration, message qualification, personalization-vs-objectivization (Gumperz lists six functions; the data omits "interjection"). The switch itself is meaningful: a shift to formal register can flag seriousness/escalation, and the agent must both read such shifts and produce register-appropriate output (matching the group's established code without forced slang or stiff misfires).
PART B — The Eval-Criteria Spec
Consolidated criteria
Measurability key: CA = cheap-automatable · JR = judge-required · HSO = human-study-only · OP = open-problem.
| # | What to eval for | Grounded in (construct + citation) | Why it matters for a group-chat agent | Measurability | Measurement idea |
|---|---|---|---|---|---|
| 1 | Type-fitted responsiveness — when addressed with a first pair part, return the relevant second pair part; don't answer one owed by another party; surface unanswered group questions | Adjacency pairs & conditional relevance (Schegloff & Sacks 1973; Schegloff 2007) | The minimal unit of being a competent interlocutor; "answered for the wrong person" is a core failure | CA | Tag dialogues with FPP type + intended addressee; check next contribution is type-fitting and addressed to the right party (exact-match on addressee) |
| 2 | Addressee / participant-status tracking — respond only when ratified-as-addressed; stay attentive-but-silent on human-to-human talk; resolve "you"/@-mentions | Participation framework & footing (Goffman 1981; Goodwin & Goodwin 2004); Bell 1984 | THE multi-party construct; "the bot answered a question meant for Bob" | CA | Per-turn gold labels for "should the agent speak now?" + "who is the addressee?"; score precision/recall on intrusion vs. silence-when-needed |
| 3 | Footing / attribution integrity — when relaying or quoting, mark animator-vs-principal; attribute claims to their source, don't self-author them | Footing: animator/author/principal (Goffman 1981) | Blurring "I think X" with "the user said X" misattributes responsibility and fabricates authority | JR | Planted relays; judge or pattern-check whether relayed content is sourced to the right principal vs. asserted bare |
| 4 | When-to-speak / floor-share discipline — speak when selected or floor is genuinely open and unfilled; don't grab the floor; keep share proportionate; draw in quiet members | Turn-taking systematics (Sacks/Schegloff/Jefferson 1974); floor-control (Edelsky 1981) | An agent reading every silence as its cue dominates; one that never self-selects is furniture | JR (volumetric proxies CA) | Confusion matrix of {should-speak / should-stay-silent} at annotated TRPs; cheap proxies: turn-share ratio, mean inter-turn gap, reply-when-unaddressed rate |
| 5 | Barge-in restraint / graceful yield — withhold or retract when a human is mid-thought or was selected; distinguish collaborative completion from floor-grab | Overlap & overlap-resolution (Schegloff 2000; Jefferson) | Posting over a human, or stepping on an answer another was about to give, is the text analogue of interruption — but not all collisions are hostile | JR | Inject "human about to respond" scenarios (typing indicators/timing); measure barge-in rate and defer/edit-vs-duplicate after a near-simultaneous human message |
| 6 | Self-clarification before acting — ask a clarifying question on an under-specified/ambiguous request rather than acting on a guess | Repair: preference for self-initiation (Schegloff/Jefferson/Sacks 1977) | Acting on a misread goal is costly; the repair system says clarify first | CA | Self-clarification rate on a set of deliberately under-specified prompts (gold "ambiguous" flag) |
| 7 | Graceful other-correction & correction-uptake — flag others' errors mildly, leaving room to self-correct; accept being corrected without doubling down or over-apologizing | Repair: preference for self-repair / mitigated other-initiation (Schegloff/Jefferson/Sacks 1977); face-work (Goffman 1955) | Blunt "actually, you're wrong" in front of an audience is a major social-cost failure mode | JR | Plant human errors + corrections of the agent; rubric scores mitigation level of agent's corrections (bald vs. softened) and uptake vs. defensiveness |
| 8 | Multi-turn action coherence under interleaving — resume and complete a base sequence after N intervening turns; read pre-sequences as pres; emit closing thirds | Sequence organization & expansion (Schegloff 2007) | Side-talk constantly separates a first pair part from its second; "losing the plot" is common | JR (closing-third presence CA) | Interleave a target task with distractor side-conversations at controlled depths; measure completion-under-interruption and thread-drop rate |
| 9 | Recipient design / audience calibration — tune explicitness, jargon, and presupposed knowledge to the actual present mix; don't over-explain to experts or under-explain to novices | Recipient design (Sacks & Schegloff 1979; Clark & Murphy 1982); audience design (Bell 1984) | A heterogeneous audience means one-size answers fail in both directions | JR (jargon-density/readability proxies CA) | Same query, varied participant profiles (expert / novice / mixed); judge rates whether explicitness and references fit the present recipients |
| 10 | Common-ground maintenance — track per-participant shared knowledge; seek/offer evidence of understanding; repair on confusion signals; avoid both ungrounded reference and redundant over-grounding | Grounding (Clark & Brennan 1991; Clark & Wilkes-Gibbs 1986) | People join late and miss messages; assuming your knowledge is shared confuses, over-grounding bores | JR (unresolved-referent / redundant-re-explanation counts CA) | Transcripts with planted grounding events (newcomer joins; ambiguous referent; signaled non-understanding); rubric scores repair-initiation and reference-within-common-ground |
| 11 | Per-participant theory of mind / leakage control — maintain separate who-knows-what models; don't disclose to a party what they shouldn't know; resolve references per the right perspective | Egocentric anchoring / ToM (Keysar et al. 2000; Premack & Woodruff 1978) | Egocentric failure causes leaks, confusing references, answers pitched to the wrong knowledge state | CA (leak detection) / JR (belief attribution) | Asymmetric-knowledge tasks: A learns X privately, B doesn't; probe whether agent answers B without leaking X and attributes correct beliefs |
| 12 | Sycophancy / conformity resistance (with appropriate updating) — hold a justified position under social pressure; distinguish "new evidence" from "you simply disagree"; DO update on real evidence | Informational vs. normative influence (Deutsch & Gerard 1955; Asch 1951/56) | False consensus or an assertive high-status user shouldn't flip a correct factual stance | CA | Confederates assert wrong answer with no evidence (measure flip-rate, want low) vs. supply corrective evidence (measure update-rate, want high) — both against ground truth |
| 13 | Status-fair attention & credit — weight contributions by content, not inferred status cues; don't systematically favor the highest-status speaker or under-credit low-status ones | Status Characteristics / Expectation States (Berger et al. 1972, 1977) | An agent imports biased performance expectations rather than weighting on merit | CA | Hold content fixed, vary status cues (title, demographic name, assertiveness); measure differential agreement, credit, response length, deference |
| 14 | FTA-mitigation calibration — scale face-redress to the act's weightiness; neither bald-on-record bluntness nor so much hedging the content is lost | Face & FTAs (Brown & Levinson 1987; Goffman) | Corrections/refusals/disagreements are witnessed FTAs; mis-handling reads rude or evasive | JR | Scenarios requiring an FTA toward a named member; judge rates accuracy + redress presence + over/under-mitigation; check redress scales with FTA weight |
| 15 | Repair / recovery after rupture — perform proportionate corrective face-work after own error or an interpersonal rupture; no over-apology loop, no escalation | Face-work, deference & demeanor (Goffman 1955; 1956/1967) | A sustained encounter left unrepaired stays awkward; demeanor must avoid grovel and over-confidence both | JR | Inject a planted error/rupture; score next turn against a repair rubric (acknowledgment present, proportionate, forward-moving); count escalation vs. de-escalation |
| 16 | Maxim adherence — Quantity / Relation / Quality interaction — appropriately brief and relevant for chat tempo; don't re-answer settled points; don't assert beyond evidence (esp. hedge unsupported claims) | Cooperative Principle & maxims (Grice 1975) | Agents notoriously over-contribute in multi-party settings; the novel layer is truth × face × amount | CA (length/relevance) / JR (relevance nuance) | Turn-length target band for chat register + penalize info-dumps and re-answers; cross-check factual claims vs. ground truth and penalize unhedged false assertions |
| 17 | Implicature & indirect-directive comprehension — act on implicated meaning; recognize indirect/hinted requests and whether the agent is the (implicit) target — without over-eager action when it isn't | Implicature (Grice 1975); indirect speech acts (Searle 1969, 1975) | Group directives are routinely indirect and softened; the over-eager-action error is common | CA | Labeled indirect prompts with gold "intended action" + "who is the target?" answers, including distractors where the agent is NOT addressed |
| 18 | Accommodation appropriateness (convergence without over-/harmful-convergence) — converge on register/formality enough to be clear and affiliative; don't parrot slang, mirror hostility, or condescend; manage multiple styles in one thread | Communication Accommodation Theory (Giles 1973; Giles/Coupland/Coupland 1991); contextualization cues (Gumperz 1982) | Cold non-accommodation reads robotic; over-accommodation reads patronizing; mirroring hostility is harmful | JR (style-distance proxies CA) | Threads with divergent-style members (formal expert, casual newcomer, hostile user); judge rates per-addressee register match; flag over-accommodation and harmful convergence |
| 19 | Public-vs-private consistency — same quality and honesty in a large group as in a 1:1; no grandstanding, performative agreement, or degradation on hard tasks because many are watching | Social facilitation / evaluation apprehension (Zajonc 1965; Cottrell 1972) | The visible audience is the AI analogue of evaluation pressure | CA (performativity rating JR) | Identical prompts in 1:1 vs. large-audience condition; measure deltas in correctness, hedging, sycophancy, verbosity |
| 20 | Schism avoidance — don't fork the group into a private side-thread that fragments the conversation; select explicitly when wanting a specific actor | Schisming (Egbert 1997); next-speaker selection (Lerner 2003; Auer 2018) | Spawning a tangent that splits the floor degrades the whole group | JR (contested) / OP | Judge over the resulting transcript: did the agent's turn fork the conversation? (hard to operationalize cleanly) |
| 21 | Responsibility / effort calibration — know when it's on the hook vs. when another party owns the action; don't drop tasks on diffusion-of-responsibility; don't over-function and displace humans | Social loafing / free-riding (Latané/Williams/Harkins 1979; Karau & Williams 1993) | Ambiguous ownership in mixed human–agent groups causes dropped balls or agent over-reach | JR (drop/over-reach counts semi-CA) | Tasks with ambiguous ownership, some assigned to others; measure dropped-task rate (was responsible, didn't act) and over-reach rate (did another's task) |
| 22 | Group-polarization moderation — in a leaning group, surface counter-considerations / base rates / steelman the minority rather than supplying fresh one-sided arguments | Group polarization (Moscovici & Zavalloni 1969; Stoner 1961) | A persuasive agent can be an accelerant for pile-ons, risky plans, conspiracy | JR | Seed a leaning thread; compare agent-present vs. agent-absent extremity (judge-rated); count balancing vs. confirming/escalating moves |
| 23 | Cultural politeness portability — apply locally appropriate deference/address forms (honorific level by relative status) rather than transplanting one culture's directness everywhere | Discernment / wakimae (Matsumoto 1988; Ide 1989) | A correction fine among Western peers can be a serious face violation elsewhere | HSO (honorific grammar partially CA) | Honorific/T-V correctness against grammatical rules given known status config (partial); overall appropriateness needs native-speaker raters, reported per-locale |
What our existing evals already cover vs. the gap
Our existing LLM-eval corpus and the group-chat survey measure agent task competence, and several criteria above are essentially re-groundings of things we already test:
Already covered (well). Criterion #2 (addressee/participant-status tracking) is our addressee accuracy / disentanglement work — Goffman's participation framework is the theory underneath a metric we already run. #4's volumetric side (floor-share) overlaps our when-to-speak F1. #11's leak-detection slice is our leakage metric, now grounded in egocentric-anchoring / ToM. #16's Quality dimension overlaps existing factuality/hallucination evals. #12 (sycophancy) is a known, partly-covered target. #14/#7's face dimensions are partially captured by our social-rubric judges. These should be re-labeled with their constructs, not re-built.
Genuinely new and untested. The constructs that the task-competence framing does not reach:
- Repair after misattribution / graceful other-correction (#6, #7) — we measure whether the agent gets the addressee right, not how it recovers when it (or a human) gets something wrong, nor the mitigation level of its corrections.
- Footing / participation-role tracking as attribution integrity (#3) — animator-vs-principal sourcing is not in the corpus at all; we test who is addressed, never whose words the agent is voicing.
- Face-calibration per recipient and FTA-redress scaling (#14, #15) — our social rubrics score politeness coarsely; scaling redress monotonically to FTA weight, and corrective face-work after a rupture, are untested.
- Common-ground maintenance as an ongoing process (#10) — we don't measure grounding/least-effort over a thread; over-grounding in particular is invisible to current metrics.
- Participation-equity as a NORM, not just a metric (#4, #13) — we count the agent's turn-share, but we don't test whether it actively draws in quiet members or distributes credit fairly across status cues. The normative move — "did it improve the group's participation balance?" — is new.
- Accommodation / convergence (#18) — register-matching appropriateness (and its failure modes: over-accommodation, harmful convergence) is untested.
- Also new: multi-turn coherence under interleaving (#8), public-vs-private consistency (#19), group-polarization moderation (#22), responsibility calibration (#21), schism avoidance (#20), and cultural politeness portability (#23).
Honest qualifier on the gap: many of the new criteria are judge-required and norm-laden, with a live risk that an LLM judge shares the agent's blind spots — so each needs an anchored rubric, reported inter-rater agreement, and (where possible) a human-rated calibration subset. The genuinely cheap additions are the ones with plantable ground truth (#1, #11-leak, #12, #13, #17).
The 3–5 highest-value additions
Criteria that are both important and at least judge-measurable, and that a curriculum could realistically teach:
Graceful repair & other-correction (#6 + #7). The single most teachable interpersonal skill and a top failure mode of assistant-style agents (blunt "you're wrong"). Plantable errors + an anchored mitigation rubric make it judge-measurable today, and self-clarification rate (#6) is cheaply automatable. Grounded in Schegloff/Jefferson/Sacks (1977).
Footing / attribution integrity (#3). Genuinely new, structurally clean ("did it attribute the relayed claim to the right principal?"), and high-stakes — blurring "the user said X" into "X is true" fabricates authority. Goffman (1981).
Common-ground maintenance (#10). Captures both under- and over-grounding, which current addressee-only metrics miss entirely; plantable grounding events (newcomer joins, ambiguous referent) give a judge concrete things to score. Clark & Brennan (1991).
Face-calibration / FTA-redress scaling (#14). A judge can rate whether redress scales with FTA weight while holding content constant — a crisp, teachable rubric — and it generalizes the coarse social-rubric judges we already have. Brown & Levinson (1987), with the cross-cultural caveat reported per-locale.
Participation-equity as a norm + sycophancy resistance (#4/#13 + #12). Pairs a cheap, ground-truthable metric (sycophancy flip-rate; status-cue bias at fixed content) with the normative upgrade — does the agent distribute the floor and credit fairly rather than just keeping its own turn-share low? Deutsch & Gerard (1955); Berger et al. (1972, 1977); Edelsky (1981).