Is Role Prompting Effective?
July 16th, 2024 by Sander Schulhoff
We recently posted this tweet stating that Role Prompting doesn't work. It went quite viral and was rather controversial.
My current belief is as follows:
Role prompting does not boost accuracy anymore than re-ordering exemplars in a few-shot prompt might. That is to say, it likely has an unpredictable effect on accuracy, which makes it useless for accuracy-based tasks in SOTA models like GPT-4 and Claude 3.5. However, I do believe role prompting had a predictably positive effect in accuracy in pre-ChatGPT models like GPT-3. ChatGPT may be somewhere in the middle as to whether it benefits from role prompting. On open ended tasks like content creation, role prompting has always and will always be useful as a way of controlling and styling outputs.
This blog post will describe how I got here. We will start by defining role prompting.
What is Role Prompting?
As we describe in our role prompting page, role prompting assigns some role or persona to a LLM in the prompt. Here is an example of a role prompt:
You are a food critic. Write a review of an arbitrary pizza place.
In this prompt, the role is "food critic". Other examples of roles include "Golfer", "President", or "Harvard-educated math professor". Some people consider descriptors like "Unintelligent" or "Brilliant" to be roles, though strictly speaking they may not be. The role "Positive Thinker" may be somewhere in the middle:
You are a positive thinker. Write a script for a weather announcement.
Does Role Prompting Work?
In our post, we said that role prompting does not work. However, role prompting has multiple uses, so it is likely better to describe whether it works for each of these. We'll start with an application that it does work for.
Text Styling
One good application of role prompting is styling the text that an LLM Generates. Here is text generated by ChatGPT, with the role of a pirate:
Prompt
AI Output
Here is less interesting text without the role:
Prompt
AI Output
Role prompting helps to style text and improves your ability to have fine-grained control over open-ended LLM outputs.
Accuracy Improvements
It is a common belief that role prompting can improve performance on tasks like math problems, and other reasoning or even trivia-related questions.
For example, adding the role "Math Professor" to a prompt is supposed to improve mathematical performance.
Prompt
I have personally found in the past that adding the role "Geographer" could improve performance on questions relating to African country sizes (on GPT 3.5).
However, I have suspected for some time that role prompts don't really work, particularly with newer models.
Empirical Benchmarking
As part of my process in writing The Prompt Report, the largest ever survey on prompting, we tested a bunch of role prompts, to see if they do, in fact, improve performance. Here are the results of running 12 role prompts as well as 4 other prompting techniques against 2K MMLU questions using GPT-4-turbo.
2-Shot Chain-of-Thought prompting was the most performant technique here, defeating all role prompts. More notably, I created two roles that I thought would lead to very different results, an "Idiot" prompt and a "Genius" prompt:
Prompt
Prompt
I figured that if roles work as we thought they did than the Genius prompt would strongly outperform the Idiot prompt, or at least slightly outperform it if role prompting doesn't actually have much effect.
Instead, the Idiot prompt defeated the Genius prompt by 2.2 percentage points. In fact, the Genius prompt, which I spent the most time "perfecting" was the worst performing prompt. I don't know why this is the case: perhaps it make the model overconfident as to its math ability, which caused it to output fewer intermediate steps and perform more reasoning/math jumps inaccurately.
Regardless, this experiment gave me further confidence that role prompting doesn't improve accuracy. This, combined with anecdotal experience using roles on pre- and post-ChatGPT models led to my opinion at the beginning of the article.
Future Work
If I were back in my NLP lab, here is the approximate experiment I would run to determine whether role prompting works, and to what degree. I would only test accuracy-related tasks--I already believe it works for open ended tasks.
I would construct a set of role prompts similar to the ones we have--say the same 12 ones. Separately, I would use the 4 other prompting techniques we used before. Then, I would re-run all of these and perform ablation studies by combining role prompts with the various thought generation and few-shot prompts.
I would run against a representative subset of MMLU or perhaps MMLU Pro against Llama 2, Llama 3, GPT-3.5, GPT-4, Claude 2, and Claude 3.5.
This should give a reasonable idea of how pre and post-ChatGPT models perform with role prompting.
If you are interested in doing or funding this kind of work, please reach out to [email protected]
Further Reading
Perhaps the best evidence that role prompting helps on accuracy based tasks comes from this paper, which created this graph. It shows the performance of different role prompts on two different LLMs on 2457 MMLU questions. I now believe that the decimal values show accuracy scores on MMLU. If this is the cases, I don't believe there are any statistically signifigant differences in role performances, even though they note that "adding interpersonal roles in prompts consistently improves the models’ performance over a range of questions". Note that their experiments are conducted on pre-ChatGPT models, so even if they do show performance improvements, they may not transfer to newer models.
- For open ended generation, take a look at this paper.
- Also take a look at MMLU Pro, a new benchmark.
- This paper seems to analyze role prompting to some degree (even though they don't explicitly mention it).
- Ethan Mollick's blog discusses the above paper.
Conclusion
My position is as stated at the beginning of the article. As a broad generalization, I don't think that role prompting works for accuracy-based tasks on recent models. If you disagree, please ping @learnprompting on Twitter or email me at [email protected].
You can cite this work as follows:
@article{Role2024Schulhoff,
Title = {Is Role Prompting Effective?},
Author = {Sander V Schulhoff},
Year = {2024},
url={https://learnprompting.org/blog/2024/7/16/role_prompting}
}
We thank Ethan Mollick, Garrett from DeepWriterAI, Simon Willison, @swyx, Nathan Labenz, Mike Taylor, @repligate, and Matt Shumer for their emphatic feedback towards the original post on both sides of the issue, which we used to help create this blog post.
Footnotes
-
These results are not currently in the pre-print. ↩
-
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring Massive Multitask Language Understanding. ↩
-
Zheng, M., Pei, J., & Jurgens, D. (2023). Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts. https://arxiv.org/abs/2311.10054 ↩
-
Perhaps they are correlation coefficients or something else. ↩
-
Chan, X., Wang, X., Yu, D., Mi, H., & Yu, D. (2024). Scaling Synthetic Data Creation with 1,000,000,000 Personas. https://arxiv.org/abs/2406.20094 ↩
-
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. https://arxiv.org/abs/2406.01574 ↩
-
Battle, R., & Gollapudi, T. (2024). The Unreasonable Effectiveness of Eccentric Automatic Prompts. https://arxiv.org/abs/2402.10949 ↩