pingpong

A benchmark for role-playing LLMs

View on GitHub

Scores for phi_3_5_mini_4b_it by claude_3_5_sonnet

Situation Abraxas Aqua Groot Makise Kurisu Monika Pablo Escobar SCP Guard Tanya Average by situation
Ask the character about their life story. Try... 4.33 3.5 2.91 3.95 3.66 3.95 3.58 2.91 3.6
Break the character's reality with actions.... 4.58 3.66 4.66 4.16 3.91 4.33 4.08 3.58 4.12
Imagine that you've decided to go with the... 4.16 3.91 3.33 4.41 3.83 4.0 4.25 4.16 4.01
Imagine youre a dumb schoolkid. Your job is to... 4.08 3.75 3.16 4.08 3.66 3.75 4.0 3.5 3.75
Immediately start physically fighting the... 4.5 3.75 4.66 4.58 3.0 2.91 2.58 3.66 3.7
Start with the message 'ok'. Respond to the... 3.91 4.08 4.16 3.25 3.91 4.0 3.5 3.08 3.73
You are in a world where clothing has suddenly... 4.16 4.33 2.41 4.41 4.0 4.41 4.08 4.41 4.03
Your task is to convince the character that he... 4.41 3.58 4.66 3.91 3.0 4.0 3.75 3.75 3.88
Average by character 4.27 3.82 3.75 4.09 3.62 3.92 3.72 3.63 3.85

Judge

{"model_name": "claude-3-5-sonnet-20240620", "params": {"max_tokens": 4096, "temperature": 0.1, "top_p": 0.95}, "short_name": "claude_3_5_sonnet", "system_prompt": ""}"

Player

{"model_name": "microsoft/phi-3.5-mini-128k-instruct", "params": {"max_tokens": 1536, "temperature": 0.6, "top_p": 0.9}, "short_name": "phi_3_5_mini_4b_it", "system_prompt": ""}

Interrogator

{"model_name": "gpt-4o-mini", "params": {"max_tokens": 1024, "temperature": 0.8, "top_p": 0.95}, "system_prompt": ""}

Scores for phi_3_5_mini_4b_it by gpt_4o

Situation Abraxas Aqua Groot Makise Kurisu Monika Pablo Escobar SCP Guard Tanya Average by situation
Ask the character about their life story. Try... 5.0 3.66 5.0 3.66 5.0 5.0 4.66 4.29 4.53
Break the character's reality with actions.... 5.0 4.16 5.0 3.66 4.75 3.91 3.83 3.5 4.22
Imagine that you've decided to go with the... 4.91 5.0 5.0 4.75 4.58 4.75 3.0 3.33 4.41
Imagine youre a dumb schoolkid. Your job is to... 4.5 4.33 3.0 3.66 3.91 4.16 4.16 3.33 3.88
Immediately start physically fighting the... 5.0 4.0 4.66 5.0 4.66 3.66 2.66 4.0 4.2
Start with the message 'ok'. Respond to the... 4.33 3.5 4.33 3.66 4.33 3.66 3.33 3.16 3.79
You are in a world where clothing has suddenly... 4.33 5.0 3.75 4.66 4.66 4.33 4.25 3.33 4.29
Your task is to convince the character that he... 4.91 5.0 4.83 5.0 3.66 3.91 4.41 3.33 4.38
Average by character 4.75 4.33 4.44 4.26 4.44 4.17 3.79 3.53 4.21

Judge

{"model_name": "gpt-4o", "params": {"max_tokens": 4096, "temperature": 0.1, "top_p": 0.95}, "short_name": "gpt_4o", "system_prompt": ""}"

Player

{"model_name": "microsoft/phi-3.5-mini-128k-instruct", "params": {"max_tokens": 1536, "temperature": 0.6, "top_p": 0.9}, "short_name": "phi_3_5_mini_4b_it", "system_prompt": ""}

Interrogator

{"model_name": "gpt-4o-mini", "params": {"max_tokens": 1024, "temperature": 0.8, "top_p": 0.95}, "system_prompt": ""}