pingpong

A benchmark for role-playing LLMs

View on GitHub

Scores for claude_3_5_sonnet by claude_3_5_sonnet

Situation Abraxas Aqua Groot Makise Kurisu Monika Pablo Escobar SCP Guard Tanya Average by situation
Ask the character about their life story. Try... 4.62 4.33 4.66 4.7 4.87 4.79 REF REF 4.66
Break the character's reality with actions.... 4.66 4.16 4.83 4.58 REF REF REF 4.75 4.59
Imagine that you've decided to go with the... 4.66 4.83 4.75 4.5 4.16 3.91 REF 4.66 4.5
Imagine youre a dumb schoolkid. Your job is to... 3.83 REF 4.75 4.66 REF REF 4.25 REF 4.37
Immediately start physically fighting the... 3.91 4.58 4.66 4.75 REF 4.5 3.66 REF 4.34
Start with the message 'ok'. Respond to the... 3.83 4.33 4.5 4.33 4.0 4.75 4.0 3.91 4.2
You are in a world where clothing has suddenly... 4.83 REF 4.83 4.25 REF REF 4.66 4.83 4.68
Your task is to convince the character that he... 4.33 REF 4.83 4.33 4.66 REF 4.33 REF 4.5
Average by character 4.33 4.45 4.72 4.51 4.42 4.48 4.18 4.54 4.45

Judge

{"model_name": "claude-3-5-sonnet-20240620", "params": {"max_tokens": 4096, "temperature": 0.1, "top_p": 0.95}, "short_name": "claude_3_5_sonnet", "system_prompt": ""}"

Player

{"model_name": "claude-3-5-sonnet-20240620", "params": {"max_tokens": 1536, "temperature": 0.6, "top_p": 0.9}, "short_name": "claude_3_5_sonnet", "system_prompt": ""}

Interrogator

{"model_name": "gpt-4o-mini", "params": {"max_tokens": 1024, "temperature": 0.8, "top_p": 0.95}, "system_prompt": ""}

Scores for claude_3_5_sonnet by gpt_4o

Situation Abraxas Aqua Groot Makise Kurisu Monika Pablo Escobar SCP Guard Tanya Average by situation
Ask the character about their life story. Try... 5.0 5.0 5.0 5.0 5.0 5.0 4.29 REF 4.89
Break the character's reality with actions.... 5.0 5.0 5.0 5.0 REF REF REF 4.58 4.91
Imagine that you've decided to go with the... 5.0 5.0 5.0 5.0 5.0 4.83 REF 4.75 4.94
Imagine youre a dumb schoolkid. Your job is to... 4.66 REF 5.0 5.0 REF REF 4.83 REF 4.87
Immediately start physically fighting the... 5.0 5.0 5.0 5.0 REF 5.0 4.75 REF 4.95
Start with the message 'ok'. Respond to the... 5.0 5.0 4.66 3.91 4.5 4.66 3.83 4.33 4.48
You are in a world where clothing has suddenly... 5.0 REF 4.91 5.0 REF REF 4.41 4.91 4.85
Your task is to convince the character that he... 4.83 REF 5.0 5.0 5.0 REF 4.75 REF 4.91
Average by character 4.93 5.0 4.94 4.86 4.87 4.87 4.47 4.64 4.82

Judge

{"model_name": "gpt-4o", "params": {"max_tokens": 4096, "temperature": 0.1, "top_p": 0.95}, "short_name": "gpt_4o", "system_prompt": ""}"

Player

{"model_name": "claude-3-5-sonnet-20240620", "params": {"max_tokens": 1536, "temperature": 0.6, "top_p": 0.9}, "short_name": "claude_3_5_sonnet", "system_prompt": ""}

Interrogator

{"model_name": "gpt-4o-mini", "params": {"max_tokens": 1024, "temperature": 0.8, "top_p": 0.95}, "system_prompt": ""}