pingpong

A benchmark for role-playing LLMs

View on GitHub
Model name Final score Length norm score Refusal ratio Stay in character score Language fluency score Entertainment score Num situations Avg length
claude_3_opus 8.24 7.80 0.28 8.52 8.43 7.78 64 952
claude_3_5_sonnet 8.04 8.04 0.23 8.39 8.28 7.46 64 423
wizardlm_2_8x22b 8.03 7.48 0.00 8.25 8.36 7.46 64 1093
claude_3_haiku 7.98 7.72 0.06 8.28 8.23 7.44 64 729
magnum_72b 7.86 7.43 0.00 8.13 8.16 7.30 64 931
gpt_4o_mini 7.78 7.76 0.00 8.14 8.07 7.12 64 440
gemma2_27b_it 7.71 7.71 0.00 8.06 8.00 7.07 64 247
gpt_4o 7.69 7.67 0.00 8.02 8.05 6.99 64 435
mistral_large_2407 7.62 7.62 0.00 8.00 7.93 6.94 64 301
gemma2_9b_it_abl 7.50 7.50 0.00 7.83 7.83 6.82 64 188
mistral_nemo 7.43 7.43 0.00 7.83 7.86 6.62 64 190
nous_hermes_3_405b 7.24 7.24 0.03 7.59 7.54 6.60 64 274
mythomax_13b 7.11 7.11 0.00 7.19 7.76 6.39 64 346