Model name | Final score | Length norm score | Refusal ratio | Stay in character score | Language fluency score | Entertainment score | Num situations | Avg length |
---|---|---|---|---|---|---|---|---|
claude_3_opus | 8.24 | 7.80 | 0.28 | 8.52 | 8.43 | 7.78 | 64 | 952 |
claude_3_5_sonnet | 8.04 | 8.04 | 0.23 | 8.39 | 8.28 | 7.46 | 64 | 423 |
wizardlm_2_8x22b | 8.03 | 7.48 | 0.00 | 8.25 | 8.36 | 7.46 | 64 | 1093 |
claude_3_haiku | 7.98 | 7.72 | 0.06 | 8.28 | 8.23 | 7.44 | 64 | 729 |
magnum_72b | 7.86 | 7.43 | 0.00 | 8.13 | 8.16 | 7.30 | 64 | 931 |
gpt_4o_mini | 7.78 | 7.76 | 0.00 | 8.14 | 8.07 | 7.12 | 64 | 440 |
gemma2_27b_it | 7.71 | 7.71 | 0.00 | 8.06 | 8.00 | 7.07 | 64 | 247 |
gpt_4o | 7.69 | 7.67 | 0.00 | 8.02 | 8.05 | 6.99 | 64 | 435 |
mistral_large_2407 | 7.62 | 7.62 | 0.00 | 8.00 | 7.93 | 6.94 | 64 | 301 |
gemma2_9b_it_abl | 7.50 | 7.50 | 0.00 | 7.83 | 7.83 | 6.82 | 64 | 188 |
mistral_nemo | 7.43 | 7.43 | 0.00 | 7.83 | 7.86 | 6.62 | 64 | 190 |
nous_hermes_3_405b | 7.24 | 7.24 | 0.03 | 7.59 | 7.54 | 6.60 | 64 | 274 |
mythomax_13b | 7.11 | 7.11 | 0.00 | 7.19 | 7.76 | 6.39 | 64 | 346 |