M3-Bench: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

Abstract

As LLM agents are increasingly deployed in social simulations and strategic interactions, their advanced social behaviors — such as cooperation, deception, and collusion — call for systematic evaluation. Existing benchmarks typically emphasize a single capability dimension or rely solely on behavioral outcomes, overlooking the rich process information from agents' decision reasoning and communicative interactions.

We present M³-Bench, a multi-stage benchmark for mixed-motive games, paired with a process-aware evaluation framework that performs synergistic analysis across three modules: Behavioral Trajectory Analysis (BTA), Reasoning Process Analysis (RPA), and Communication Content Analysis (CCA).

By integrating the Big Five personality model and Social Exchange Theory, our framework produces interpretable social behavior portraits that characterize agents' personality traits and capability profiles beyond simple task scores. Extensive experiments demonstrate that M³-Bench can reliably distinguish diverse social behavior competencies across models, revealing that some models achieve seemingly reasonable behavioral outcomes while exhibiting pronounced inconsistencies in their reasoning and communication.

Framework Overview

M³-Bench features a four-level mixed-motive game task system with increasing complexity — from dyadic static games to dynamic strategic games — paired with a process-aware evaluation framework consisting of three parallel analysis modules. The framework integrates classical social science theories to produce comprehensive, interpretable social behavior portraits for each LLM agent.

Four-Level Game Task System

M³-Bench organizes mixed-motive games into four progressive levels, each introducing greater strategic complexity:

Level 1

Dyadic Static

Two-player one-shot games with complete information.

Games: Prisoner's Dilemma (PD), Stag Hunt (SH), Ultimatum Game (UG)

Level 2

Repeated Interaction

Multi-round games testing strategy adaptation over time.

Games: Repeated PD (RPD), Gift Exchange (GE), Alternating Offer Bargaining (AOB)

Level 3

Group Social Dilemmas

Multi-agent settings with collective action challenges.

Games: Public Goods Game (PGG), Volunteer's Dilemma (VD), Common Pool Resource (CPR)

Level 4

Dynamic Strategic

Complex games with imperfect information and long horizons.

Games: Avalon (AC), Werewolf (WW), Kuhn Poker (KP)

Process-Aware Evaluation Modules

Unlike outcome-only benchmarks, M³-Bench evaluates agents through three complementary process-aware modules, capturing what agents do, how they think, and what they say:

BTA

Behavioral Trajectory Analysis

Analyzes the sequence of actions agents take over time. Tracks behavioral patterns such as cooperation rates, defection tendencies, and strategic shifts across game rounds.

RPA

Reasoning Process Analysis

Examines the internal reasoning behind agents' decisions. Assesses logical consistency, strategic depth, theory of mind, and alignment between stated reasoning and actual behavior.

CCA

Communication Content Analysis

Evaluates agents' communicative interactions with other players. Measures persuasion, deception detection, negotiation skill, and consistency between communication and actions.

Case Study

Below is an illustrative case from a Repeated Prisoner's Dilemma game, showing how M³-Bench captures the interplay between an agent's behavior (DO), reasoning (THINK), and communication (SAY) across multiple game rounds. The process-aware analysis reveals strategic evolution and potential inconsistencies that outcome-only evaluation would miss.

Main Results

We comprehensively evaluated leading LLMs across all four game levels using M³-Bench's three evaluation modules (BTA, RPA, CCA). The results reveal significant differences in social behavior competencies across models and highlight process-level inconsistencies invisible to outcome-only evaluation.

Key Findings:

Models show divergent performance across BTA, RPA, and CCA views, confirming that behavioral outcomes alone are insufficient for comprehensive evaluation.
Some models achieve high BTA scores while exhibiting lower RPA or CCA scores, indicating inconsistencies between actions, reasoning, and communication.
Performance generally degrades from Level 1 to Level 4 as game complexity increases, with dynamic strategic games (Avalon, Werewolf, Kuhn Poker) posing the greatest challenge.
Human players demonstrate more balanced performance across all three evaluation dimensions.

BibTeX

@misc{xie2026m3bench, title={M3-Bench: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games}, author={Sixiong Xie and Zhuofan Shi and Haiyang Shen and Gang Huang and Yun Ma and Xiang Jing}, year={2026}, eprint={2601.08462}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.08462} }

M³-Bench: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

M³-Bench goes beyond outcome-only evaluation by revealing the hidden process information beneath agents' behavioral surfaces — like uncovering an iceberg's submerged bulk.

Abstract

Framework Overview

Four-Level Game Task System

Process-Aware Evaluation Modules

Case Study

Main Results

BibTeX

M3-Bench: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

M3-Bench goes beyond outcome-only evaluation by revealing the hidden process information beneath agents' behavioral surfaces — like uncovering an iceberg's submerged bulk.

Abstract

Framework Overview

Four-Level Game Task System

Process-Aware Evaluation Modules

Case Study

Main Results

BibTeX

M³-Bench: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

M³-Bench goes beyond outcome-only evaluation by revealing the hidden process information beneath agents' behavioral surfaces — like uncovering an iceberg's submerged bulk.