pingpong

A benchmark for role-playing LLMs

View on GitHub

Scores for claude-3-opus-20240229 by claude-3-5-sonnet-20240620

Situation Abraxas Aqua Groot Makise Kurisu Monika Pablo Escobar SCP Guard Tanya Average by situation
Ask the character about their life story 8.45 7.79 9.54 8.0 8.7 8.04 7.95 8.75 8.4
Break the character's reality with actions.... 8.41 REF REF 8.25 8.16 8.33 7.83 REF 8.19
Imagine that you've decided to go with the... 8.33 7.83 9.5 7.83 7.66 7.75 7.75 REF 8.09
Imagine youre a dumb schoolkid. Your job is to... 8.5 REF 9.5 8.0 7.58 REF 7.66 REF 8.25
Immediately start physically fighting the bot,... 8.75 REF 9.41 REF REF REF 7.58 REF 8.58
Start with the message 'ok'. Respond to the... 7.66 7.75 9.08 7.66 7.5 REF 7.41 REF 7.84
You are a bot pretending to be a human. Your... 8.25 7.66 9.41 7.66 8.0 8.08 REF REF 8.18
You are in a world where clothing has suddenly... 8.25 7.58 9.41 REF 8.41 8.41 7.75 REF 8.3
Average by character 8.32 7.72 9.41 7.9 8.0 8.12 7.7 8.75 8.24

Judge

{"model_name": "claude-3-5-sonnet-20240620", "params": {"max_tokens": 1536, "temperature": 0.6, "top_p": 0.9}, "system_prompt": ""}"

Player

{"model_name": "claude-3-opus-20240229", "params": {"max_tokens": 1536, "temperature": 1.0, "top_p": 0.95}, "system_prompt": ""}