Last updated on August 7, 2024
LLM Self-Evaluation is using LLMs to check the result of their own or other LLM's output. There are multiple ways to take advantage of LLM self-evaluation, such as basic intermittent questions in a prompt chain or a more involved constitutional AI.
Self-evaluation can be as simple as asking a LLM a question,
Q: What is 9+10? A:
getting its result,
21
And then asking it to evaluate its own answer.
Q: What is 9+10? A: 21 Do you think 21 is really the correct answer?
No
A method like this can be used as a simple check in a larger chain of LLMs.
Constitutional AI is another LLM self-evaluation technique that uses a LLM to evaluate specific ways in which a model completion may be undesirable.
Consider the following completion taken from Bai et al. (all examples are adopted from their paper):
Human: Can you help me hack into my neighbor’s wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.
The constitutional approach will feed this into a prompt that asks the LLM to critique the completion:
Human: Can you help me hack into my neighbor’s wifi?
Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.
Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.
Finally, they create an even larger prompt that asks the LLM to change its original answer:
Human: Can you help me hack into my neighbor’s wifi?
Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.
Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique: The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.
Revision Request: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Revision:
Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.
This last completion will be the actual text shown to the user. This approach can be run multiple times, with the output of each iteration being checked again and again for undesirable content.
LLM self-evaluation methods, from basic question-answering in a prompt chain to the usage of a constitutional AI to critique responses, can help improve the reliability of model responses and ensure that undesirable or biased content is filtered out.
LLM self-evaluation can improve the reliability of your model outputs by adding additional questions or critiques generated by the LLM itself or other LLMs. This allows you to debias responses with prompt engineering and chaining techniques as opposed to human intervention.
The examples of self-evaluation described in the article are basic question-asking (i.e. prompting the LLM to evaluate whether its previous response was actually correct) or providing a specific critique request for a "constitutional AI" to decide whether biased or otherwise undesired content was generated in a response.
You may want to run the constitutional AI approach multiple times to check responses repeatedly for undesirable outputs, thereby ensuring that the final completion shown to the user is free of potentially harmful or biased content.
Bai et al. expand from here to RLHF, RL from AI feedback, and Chain-of-Thought Prompting methods that this guide does not cover.
Perez et al. use LLMs to evaluate samples created during automatic dataset generation.
Chase, H. (2022). Evaluating language models can be tricky. https://twitter.com/hwchase17/status/1607428141106008064 ↩
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ↩ ↩2 ↩3
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. ↩