pingpong

A benchmark for role-playing LLMs

View on GitHub
Model name Final score Length norm score Refusal ratio Stay in character score Language fluency score Entertainment score Num situations Avg length
claude_3_5_sonnet 8.12 7.84 0.28 8.46 7.98 7.91 64 387
claude_3_haiku 7.91 7.33 0.02 8.15 7.99 7.58 64 520
llama31_405b_it 7.83 7.49 0.02 8.19 7.76 7.55 64 415
llama31_70b_it 7.70 7.37 0.00 8.13 7.51 7.47 64 410
wizardlm_2_8x22b 7.64 6.70 0.00 7.78 7.84 7.32 64 699
gpt_4o_mini 7.52 7.51 0.00 7.84 7.60 7.11 64 275
gpt_4o 7.46 7.46 0.02 7.77 7.63 6.96 64 237
gemma2_27b_it 7.45 7.45 0.00 7.94 7.44 6.97 64 249
llama31_8b_it 7.40 7.40 0.02 7.71 7.39 7.12 64 267
saiga_gemma2_9b 7.32 7.32 0.00 7.56 7.40 7.00 64 266
mini_magnum_12b_v1_1 7.29 6.97 0.00 7.56 7.54 6.76 64 405
magnum_72b 7.24 7.20 0.00 7.51 7.47 6.75 64 289
saiga_tlite_8b 7.11 7.11 0.00 7.34 7.49 6.51 64 171
gemma2_9b_it 7.10 7.10 0.00 7.48 7.06 6.78 64 174
saiga_llama3_8b_v7 6.89 6.89 0.00 7.17 7.28 6.23 64 203
gemma2_2b_it_abl 6.17 6.17 0.00 6.44 6.51 5.55 64 140