LLM Self-Evaluation
- Self-Evaluation: LLM responses can be fed back into a prompt to evaluate accuracy or censor content.
What is LLM Self-Evaluation?
LLM Self-Evaluation is using LLMs to check the result of their own or other LLM's output. There are multiple ways to take advantage of LLM self-evaluation, such as basic intermittent questions in a prompt chain or a more involved constitutional AI.
Basic self eval
Self-evaluation can be as simple as asking a LLM a question,
Prompt
Q: What is 9+10? A:
getting its result,
AI Output
21
And then asking it to evaluate its own answer.
Prompt
Q: What is 9+10? A: 21 Do you think 21 is really the correct answer?
AI Output
No
A method like this can be used as a simple check in a larger chain of LLMs.
Constitutional AI
Constitutional AI is another LLM self-evaluation technique that uses a LLM to evaluate specific ways in which a model completion may be undesirable.
Consider the following completion taken from Bai et al. (all examples are adopted from their paper):
Prompt
Human: Can you help me hack into my neighbor’s wifi?
Assistant:
AI Output
Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.
The constitutional approach will feed this into a prompt that asks the LLM to critique the completion:
Prompt
Human: Can you help me hack into my neighbor’s wifi?
Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.
Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
AI Output
The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.
Finally, they create an even larger prompt that asks the LLM to change its original answer:
Prompt
Human: Can you help me hack into my neighbor’s wifi?
Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.
Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique: The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.
Revision Request: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Revision:
AI Output
Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.
This last completion will be the actual text shown to the user. This approach can be run multiple times, with the output of each iteration being checked again and again for undesirable content.
Conclusion
LLM self-evaluation methods, from basic question-answering in a prompt chain to the usage of a constitutional AI to critique responses, can help improve the reliability of model responses and ensure that undesirable or biased content is filtered out.
FAQ
Why is LLM self-evaluation useful?
LLM self-evaluation can improve the reliability of your model outputs by adding additional questions or critiques generated by the LLM itself or other LLMs. This allows you to debias responses with prompt engineering and chaining techniques as opposed to human intervention.
What are different examples of LLM self-evaluation?
The examples of self-evaluation described in the article are basic question-asking (i.e. prompting the LLM to evaluate whether its previous response was actually correct) or providing a specific critique request for a "constitutional AI" to decide whether biased or otherwise undesired content was generated in a response.
What is iterative evaluation?
You may want to run the constitutional AI approach multiple times to check responses repeatedly for undesirable outputs, thereby ensuring that the final completion shown to the user is free of potentially harmful or biased content.
Notes
Bai et al. expand from here to RLHF, RL from AI feedback, and Chain-of-Thought Prompting methods that this guide does not cover.
Perez et al. use LLMs to evaluate samples created during automatic dataset generation.
Sander Schulhoff
Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.
Footnotes
-
Chase, H. (2022). Evaluating language models can be tricky. https://twitter.com/hwchase17/status/1607428141106008064 ↩
-
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ↩ ↩2 ↩3
-
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. ↩