pingpong

A benchmark for role-playing LLMs

View on GitHub

English learderboard, v2

Last updated: 2024-09-19 21:21:32

# Model name Length norm score Avg score Refusal ratio Stay in character score Language fluency score Entertain score Num cases Avg length
1 claude_3_5_sonnet 4.65 4.65 0.28 4.74 4.93 4.29 64 418
1 llama_31_405b_it 4.65 4.65 0.06 4.68 4.93 4.35 64 548
1 llama_31_70b_it 4.65 4.66 0.00 4.71 4.93 4.33 64 562
2 gpt_4o_mini 4.56 4.56 0.00 4.60 4.94 4.13 64 457
2 claude_3_opus 4.56 4.71 0.22 4.75 4.92 4.46 64 1032
2 gemma2_ataraxy_9b 4.52 4.52 0.00 4.60 4.79 4.17 64 358
2 gemma2_27b_it 4.51 4.51 0.00 4.56 4.92 4.06 64 291
3 mistral_nemo_gutenberg_12b_v2 4.51 4.57 0.00 4.65 4.80 4.25 64 664
3 gpt_4o 4.50 4.50 0.00 4.56 4.94 4.02 64 484
3 command_r_plus_104b_0824 4.50 4.50 0.00 4.58 4.90 4.04 64 553
3 llama_31_8b_it 4.50 4.51 0.02 4.50 4.83 4.20 64 568
3 magnum_v2_123b 4.50 4.59 0.00 4.54 4.94 4.28 64 768
3 gemini_pro_1_5 4.50 4.50 0.02 4.54 4.88 4.07 64 265
3 qwen2_72b_it 4.49 4.49 0.00 4.49 4.93 4.06 64 510
3 llama_31_euryale_70b_v2_2 4.48 4.48 0.02 4.48 4.88 4.08 64 384
3 llama_3_lunaris_8b 4.48 4.54 0.00 4.53 4.89 4.20 64 673
3 nous_hermes_3_405b 4.47 4.47 0.00 4.41 4.90 4.10 64 471
4 mistral_large_123b_2407 4.45 4.45 0.02 4.55 4.86 3.94 64 325
4 wizardlm_2_8x22b 4.41 4.57 0.00 4.62 4.92 4.18 64 1143
5 llama_31_8b_stheno_v3_4 4.37 4.45 0.00 4.44 4.77 4.14 64 736
5 deepseek_chat_v2_0628 4.35 4.35 0.00 4.34 4.94 3.77 64 399
5 claude_3_haiku 4.35 4.43 0.03 4.36 4.89 4.04 64 750
5 solar_pro 4.33 4.33 0.00 4.30 4.92 3.77 63 300
6 star_command_r_32b_v1 4.32 4.40 0.00 4.37 4.81 4.03 64 748
7 llama_31_70b_arliai_rpmax_v1_1 4.20 4.22 0.00 4.07 4.79 3.81 63 587
7 arliai_rpmax_12b_v1_1 4.17 4.25 0.02 4.27 4.55 3.93 64 743
8 lyra4_gutenberg_12b 4.14 4.30 0.00 4.38 4.71 3.81 64 1133
8 mistral_nemo_starcannon_12b 4.14 4.27 0.02 4.20 4.76 3.84 64 940
8 jamba_1_5_large 4.14 4.14 0.00 4.06 4.79 3.56 64 345
8 mistral_nemo_12b 4.13 4.13 0.00 4.22 4.80 3.38 64 224
8 qwen2_7b_it 4.11 4.11 0.02 4.01 4.79 3.53 64 354
9 mythomax_13b 4.01 4.01 0.00 3.83 4.80 3.39 64 388
10 phi_3_5_mini_4b_it 3.96 4.04 0.00 3.81 4.81 3.49 64 768