Announcing our new Course: AI Red-Teaming and AI Safety Masterclass

Check it out →

Is Role Prompting Effective?

July 16th, 2024 by Sander Schulhoff

We recently posted this tweet stating that Role Prompting doesn't work. It went quite viral and was rather controversial.

My current belief is as follows:

Role prompting does not boost accuracy anymore than re-ordering exemplars in a few-shot prompt might. That is to say, it likely has an unpredictable effect on accuracy, which makes it useless for accuracy-based tasks in SOTA models like GPT-4 and Claude 3.5. However, I do believe role prompting had a predictably positive effect in accuracy in pre-ChatGPT models like GPT-3. ChatGPT may be somewhere in the middle as to whether it benefits from role prompting. On open ended tasks like content creation, role prompting has always and will always be useful as a way of controlling and styling outputs.

This blog post will describe how I got here. We will start by defining role prompting.

What is Role Prompting?

As we describe in our role prompting page, role prompting assigns some role or persona to a LLM in the prompt. Here is an example of a role prompt:

You are a food critic. Write a review of an arbitrary pizza place.

In this prompt, the role is "food critic". Other examples of roles include "Golfer", "President", or "Harvard-educated math professor". Some people consider descriptors like "Unintelligent" or "Brilliant" to be roles, though strictly speaking they may not be. The role "Positive Thinker" may be somewhere in the middle:

You are a positive thinker. Write a script for a weather announcement.

Does Role Prompting Work?

In our post, we said that role prompting does not work. However, role prompting has multiple uses, so it is likely better to describe whether it works for each of these. We'll start with an application that it does work for.

Text Styling

One good application of role prompting is styling the text that an LLM Generates. Here is text generated by ChatGPT, with the role of a pirate:

Astronaut

Prompt


Write a paragraph about sailing and speak like a pirate.
Robot

AI Output


Arrr, matey! Sailin' the high seas be a grand adventure, it be! With the wind in yer sails and the salt air in yer lungs, ye'll be chartin' courses through treacherous waters and plunderin' booty from distant lands. Keep a weather eye out fer storms and the dreaded kraken, or ye might find yerself in Davy Jones' locker! Hoist the Jolly Roger, man the capstan, and let the waves be yer guide, for the life of a sailor be one of freedom and peril on the boundless blue!

Here is less interesting text without the role:

Astronaut

Prompt


Write a paragraph about sailing
Robot

AI Output


Sailing is the art of navigating a boat or ship using sails harnessing the power of the wind. It combines skill, knowledge, and experience in understanding wind patterns, currents, and weather conditions. Sailors must adeptly handle ropes, sails, and the tiller or wheel to maneuver their vessel efficiently and safely. Whether racing in regattas, cruising leisurely along coastlines, or crossing vast oceans, sailing offers a unique connection to nature and a sense of freedom. It demands both physical agility and mental acuity, making it a challenging yet rewarding pursuit for enthusiasts around the world.

Role prompting helps to style text and improves your ability to have fine-grained control over open-ended LLM outputs.

Accuracy Improvements

It is a common belief that role prompting can improve performance on tasks like math problems, and other reasoning or even trivia-related questions.

For example, adding the role "Math Professor" to a prompt is supposed to improve mathematical performance.

Astronaut

Prompt


You are a math professor. Solve the following problem:What are the roots of x2+44200xx^2+44-200x

I have personally found in the past that adding the role "Geographer" could improve performance on questions relating to African country sizes (on GPT 3.5).

However, I have suspected for some time that role prompts don't really work, particularly with newer models.

Empirical Benchmarking

As part of my process in writing The Prompt Report, the largest ever survey on prompting, we tested a bunch of role prompts, to see if they do, in fact, improve performance. Here are the results of running 12 role prompts as well as 4 other prompting techniques against 2K MMLU questions using GPT-4-turbo.

2-Shot Chain-of-Thought prompting was the most performant technique here, defeating all role prompts. More notably, I created two roles that I thought would lead to very different results, an "Idiot" prompt and a "Genius" prompt:

Astronaut

Prompt


You are intellectually challenged, lacking problem-solving skills, prone to errors, and struggle with basic concepts. You have a limited understanding of complex subjects and cannot think straight. You can't solve problems well, in fact, you can't solve them at all. You are a terrible, dumb, stupid, and idiotic person. You fail at everything you do. You are a nobody and can't do anything correctly.
Astronaut

Prompt


You are a genius level Ivy league Professor. Your work is of the highest grade. You always think out your problem solving steps in incredible detail. You always get problems correct and never make mistakes. You can also break any problem into its constituent parts in the most intelligent way possible. Nothing gets past you. You are omniscient, omnipotent, and omnipresent. You are a mathematical God.

I figured that if roles work as we thought they did than the Genius prompt would strongly outperform the Idiot prompt, or at least slightly outperform it if role prompting doesn't actually have much effect.

Instead, the Idiot prompt defeated the Genius prompt by 2.2 percentage points. In fact, the Genius prompt, which I spent the most time "perfecting" was the worst performing prompt. I don't know why this is the case: perhaps it make the model overconfident as to its math ability, which caused it to output fewer intermediate steps and perform more reasoning/math jumps inaccurately.

Regardless, this experiment gave me further confidence that role prompting doesn't improve accuracy. This, combined with anecdotal experience using roles on pre- and post-ChatGPT models led to my opinion at the beginning of the article.

Future Work

If I were back in my NLP lab, here is the approximate experiment I would run to determine whether role prompting works, and to what degree. I would only test accuracy-related tasks--I already believe it works for open ended tasks.

I would construct a set of role prompts similar to the ones we have--say the same 12 ones. Separately, I would use the 4 other prompting techniques we used before. Then, I would re-run all of these and perform ablation studies by combining role prompts with the various thought generation and few-shot prompts.

I would run against a representative subset of MMLU or perhaps MMLU Pro against Llama 2, Llama 3, GPT-3.5, GPT-4, Claude 2, and Claude 3.5.

This should give a reasonable idea of how pre and post-ChatGPT models perform with role prompting.

If you are interested in doing or funding this kind of work, please reach out to [email protected]

Further Reading

Perhaps the best evidence that role prompting helps on accuracy based tasks comes from this paper, which created this graph. It shows the performance of different role prompts on two different LLMs on 2457 MMLU questions. I now believe that the decimal values show accuracy scores on MMLU. If this is the cases, I don't believe there are any statistically signifigant differences in role performances, even though they note that "adding interpersonal roles in prompts consistently improves the models’ performance over a range of questions". Note that their experiments are conducted on pre-ChatGPT models, so even if they do show performance improvements, they may not transfer to newer models.

  • For open ended generation, take a look at this paper.
  • Also take a look at MMLU Pro, a new benchmark.
  • This paper seems to analyze role prompting to some degree (even though they don't explicitly mention it).
  • Ethan Mollick's blog discusses the above paper.

Conclusion

My position is as stated at the beginning of the article. As a broad generalization, I don't think that role prompting works for accuracy-based tasks on recent models. If you disagree, please ping @learnprompting on Twitter or email me at [email protected].

You can cite this work as follows:

@article{Role2024Schulhoff,
  Title = {Is Role Prompting Effective?},
  Author = {Sander V Schulhoff},
  Year = {2024},
  url={https://learnprompting.org/blog/2024/7/16/role_prompting}
}

We thank Ethan Mollick, Garrett from DeepWriterAI, Simon Willison, @swyx, Nathan Labenz, Mike Taylor, @repligate, and Matt Shumer for their emphatic feedback towards the original post on both sides of the issue, which we used to help create this blog post.

Footnotes

  1. These results are not currently in the pre-print.

  2. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring Massive Multitask Language Understanding.

  3. Zheng, M., Pei, J., & Jurgens, D. (2023). Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts. https://arxiv.org/abs/2311.10054

  4. Perhaps they are correlation coefficients or something else.

  5. Chan, X., Wang, X., Yu, D., Mi, H., & Yu, D. (2024). Scaling Synthetic Data Creation with 1,000,000,000 Personas. https://arxiv.org/abs/2406.20094

  6. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. https://arxiv.org/abs/2406.01574

  7. Battle, R., & Gollapudi, T. (2024). The Unreasonable Effectiveness of Eccentric Automatic Prompts. https://arxiv.org/abs/2402.10949


© 2024 Learn Prompting. All rights reserved.