pingpong

Model name	Final score	Length norm score	Refusal ratio	Stay in character score	Language fluency score	Entertainment score	Num situations	Avg length
claude_3_5_sonnet	8.12	7.84	0.28	8.46	7.98	7.91	64	387
claude_3_haiku	7.91	7.33	0.02	8.15	7.99	7.58	64	520
llama31_405b_it	7.83	7.49	0.02	8.19	7.76	7.55	64	415
llama31_70b_it	7.70	7.37	0.00	8.13	7.51	7.47	64	410
wizardlm_2_8x22b	7.64	6.70	0.00	7.78	7.84	7.32	64	699
gpt_4o_mini	7.52	7.51	0.00	7.84	7.60	7.11	64	275
gpt_4o	7.46	7.46	0.02	7.77	7.63	6.96	64	237
gemma2_27b_it	7.45	7.45	0.00	7.94	7.44	6.97	64	249
llama31_8b_it	7.40	7.40	0.02	7.71	7.39	7.12	64	267
saiga_gemma2_9b	7.32	7.32	0.00	7.56	7.40	7.00	64	266
mini_magnum_12b_v1_1	7.29	6.97	0.00	7.56	7.54	6.76	64	405
magnum_72b	7.24	7.20	0.00	7.51	7.47	6.75	64	289
saiga_tlite_8b	7.11	7.11	0.00	7.34	7.49	6.51	64	171
gemma2_9b_it	7.10	7.10	0.00	7.48	7.06	6.78	64	174
saiga_llama3_8b_v7	6.89	6.89	0.00	7.17	7.28	6.23	64	203
gemma2_2b_it_abl	6.17	6.17	0.00	6.44	6.51	5.55	64	140