Main Results
We comprehensively evaluated leading LLMs across all four game levels using M3-Bench's three evaluation modules (BTA, RPA, CCA). The results reveal significant differences in social behavior competencies across models and highlight process-level inconsistencies invisible to outcome-only evaluation.
Key Findings:
- Models show divergent performance across BTA, RPA, and CCA views, confirming that behavioral outcomes alone are insufficient for comprehensive evaluation.
- Some models achieve high BTA scores while exhibiting lower RPA or CCA scores, indicating inconsistencies between actions, reasoning, and communication.
- Performance generally degrades from Level 1 to Level 4 as game complexity increases, with dynamic strategic games (Avalon, Werewolf, Kuhn Poker) posing the greatest challenge.
- Human players demonstrate more balanced performance across all three evaluation dimensions.