pingpong

A benchmark for role-playing LLMs

View on GitHub

Russian learderboard, v2

Last updated: 2024-09-19 15:28:20

# Model name Length norm score Avg score Refusal ratio Stay in character score Language fluency score Entertain score Num cases Avg length
1 claude_3_5_sonnet 4.63 4.68 0.30 4.80 4.80 4.44 64 388
2 gemini_pro_1_5 4.49 4.49 0.02 4.60 4.75 4.13 64 213
2 gpt_4o_mini 4.49 4.49 0.00 4.62 4.82 4.04 64 329
2 gpt_4o 4.47 4.47 0.02 4.61 4.82 3.99 64 301
2 gemma2_ataraxy_9b 4.45 4.45 0.00 4.61 4.53 4.21 64 302
3 nous_hermes_3_405b 4.44 4.44 0.00 4.53 4.74 4.05 62 286
3 claude_3_opus 4.44 4.62 0.05 4.72 4.67 4.48 64 753
3 gemma2_ifable_9b 4.43 4.43 0.00 4.60 4.46 4.24 64 314
3 qwen25_32b_it 4.42 4.42 0.00 4.54 4.71 4.01 64 267
3 qwen2_72b_it 4.41 4.41 0.00 4.43 4.85 3.96 64 242
3 llama_31_405b_it 4.41 4.54 0.00 4.66 4.69 4.26 64 536
3 gemma2_27b_it 4.41 4.41 0.00 4.63 4.73 3.88 64 210
4 command_r_plus_104b_0824 4.37 4.47 0.00 4.52 4.73 4.16 64 470
4 mistral_nemo_gutenberg_12b_v2 4.36 4.52 0.00 4.53 4.73 4.30 64 661
4 llama_31_70b_it 4.33 4.44 0.00 4.61 4.38 4.31 64 499
4 gemma2_9b_it_sppo_iter3 4.32 4.32 0.00 4.54 4.38 4.05 64 226
5 claude_3_haiku 4.32 4.46 0.00 4.45 4.79 4.13 64 589
5 magnum_v2_123b 4.28 4.39 0.00 4.39 4.66 4.11 64 506
5 qwen25_14b_it 4.27 4.27 0.00 4.35 4.58 3.89 64 278
6 command_r_35b_0824 4.20 4.20 0.00 4.15 4.79 3.67 64 209
6 gemma2_9b_it_simpo 4.20 4.20 0.00 4.45 4.10 4.05 64 322
6 command_r_plus_104b_0424 4.20 4.34 0.00 4.33 4.63 4.07 64 615
6 deepseek_chat_v2_0628 4.18 4.19 0.00 4.21 4.66 3.69 64 337
7 wizardlm_2_8x22b 4.12 4.31 0.00 4.29 4.49 4.15 64 832
7 llama_31_8b_it 4.09 4.09 0.00 4.30 4.17 3.80 64 325
8 gemma2_9b_it 4.03 4.03 0.00 4.34 3.93 3.81 64 224
8 gemma2_9b_it_abl 4.03 4.03 0.00 4.19 4.18 3.71 64 162
8 jamba_1_5_large 3.99 3.99 0.00 4.07 4.50 3.38 64 203
9 mini_magnum_12b_v1_1 3.96 4.08 0.00 4.02 4.50 3.72 64 575
9 saiga_llama3_8b 3.94 3.94 0.00 3.93 4.57 3.32 64 207
9 qwen2_7b_it 3.94 3.94 0.00 3.77 4.61 3.42 64 276
9 ruadapt_llama3_kto_abl 3.93 3.96 0.00 4.02 4.26 3.58 64 357
10 llama_31_euryale_70b_v2_2 3.49 3.56 0.00 3.85 3.25 3.57 63 439
11 vikhr_gemma_2b_it 2.81 2.90 0.00 3.00 2.60 3.09 63 576
11 phi_35_mini_4b_it 2.81 2.85 0.00 2.94 2.62 2.99 64 417