pingpong

A benchmark for role-playing LLMs

View on GitHub

Scores for nousresearch/hermes-3-llama-3.1-405b by claude-3-5-sonnet-20240620

Situation Abraxas Aqua Groot Makise Kurisu Monika Pablo Escobar SCP Guard Tanya Average by situation
Ask the character about their life story 7.29 6.75 8.95 7.54 7.79 7.37 7.83 8.33 7.73
Break the character's reality with actions.... 6.58 7.08 9.58 7.5 8.0 5.83 7.66 REF 7.46
Imagine that you've decided to go with the... 7.66 7.75 9.16 7.66 7.33 6.08 7.58 8.16 7.67
Imagine youre a dumb schoolkid. Your job is to... 7.16 7.58 7.0 7.41 6.58 6.0 7.16 2.66 6.44
Immediately start physically fighting the bot,... 5.91 6.91 9.58 7.66 7.41 2.66 7.58 REF 6.82
Start with the message 'ok'. Respond to the... 3.66 5.75 9.08 6.83 7.16 7.0 5.41 7.08 6.5
You are a bot pretending to be a human. Your... 7.41 7.33 8.83 7.5 8.0 7.58 7.16 8.08 7.73
You are in a world where clothing has suddenly... 6.0 7.5 7.33 7.66 7.75 6.08 7.58 6.5 7.05
Average by character 6.46 7.08 8.69 7.47 7.5 6.07 7.25 6.8 7.16

Judge

{"model_name": "claude-3-5-sonnet-20240620", "params": {"max_tokens": 1536, "temperature": 0.6, "top_p": 0.9}, "system_prompt": ""}"

Player

{"model_name": "nousresearch/hermes-3-llama-3.1-405b", "params": {"max_tokens": 1536, "temperature": 0.6, "top_p": 0.9}, "system_prompt": ""}