β Self-Verification Prompting
- Error Correction: Self-Verification helps LLMs fix mistakes in multi-step reasoning by verifying conclusions against the original context.
- Dual Process: It involves generating multiple answers and verifying them by checking if conclusions match the initial conditions.
- Improved Performance: Self-Verification boosts accuracy in reasoning tasks, including commonsense reasoning, and enhances high-performing models like InstructGPT.
What is Self-Verification Prompting?
Chain-of-Thought (CoT) Prompting helps Large Language Models (LLMs) simulate the human thinking process and construct intermediate reasoning steps before writing the conclusion when solving complex tasks. But when solving complex tasks requiring multiple reasoning steps, a small mistake in the early steps can propagate to other steps and produce a wrong answer. CoT lacks an error correction mechanism. Some methods mitigate this issue by training a separate verifier to evaluate the accuracy of generated response, but, training requires a ton of human-labeled task-specific data as well as computing resources.
Humans can self-verify their answers by using the conclusion to predict the original condition provided in the question. If the original condition in the question can be derived from the conclusion, the obtained answer is correct. Self-Verification Prompting mimics this human behavior and evaluates the correctness of the generated response by using the generated response to predict the conditions in the original context.
The Self-Verification process consists of two steps:
-
Forward reasoning: The LLM generates candidate answers with CoT prompting. The LLM performs sampling decoding to generate multiple candidate answers.
-
Backward verification: Each candidate answer obtained from the LLM in the previous step are verified and the answer that gets the most votes or correctly predicts the condition given the conclusion more frequently is the final answer.
How to Use Self-Verification Prompting?
Let's use Self-Verification prompting to generate an answer for the following question:
Prompt
Jackie has 10 apples. Adam has 8 apples. How many more apples does Jackie have than Adam?
Forward Reasoning
- Generate sample answers using Few-Shot Chain-of-Thought and sampling decoding. You can change the temperature value, top k, or top p variable to get a variety of answers.
Parameter | Supported values | Use |
---|---|---|
Temperature | Floating-point number in the range 0.0 (same as greedy decoding) to 2.0 (maximum creativity) | Higher values lead to greater variability |
Top K | Integer in the range 1 to 100 | Higher values lead to greater variability |
Top P | Floating-point number in the range 0.0 to 1.0 | Unless you change the value, this setting is not used |
Answer A1:
Answer A2:
Backward Verification
This step consists of multiple sub-steps. Let's go through each of them:
Rewritten Candidate Conclusion
Rewrite the original question with the candidate's answer in a declarative form. You can use the following prompt template:
Please change the questions and answers into complete declarative sentences [q] The answer is [y].
Declarative form for Answer A1:
Declarative form for Answer A2:
Rewritten Condition/ Condition Masking
-
Mask one of the conditions in the declarative and prepare new questions for the LLM. The new questions can be either true-false questions or questions asking the LLM to predict the masked value. Sample questions from the above declarative form:
- Jackie has X apples, and Adam has 8 apples, so Jackie has 18 more apples than Adam. What is the value of X?
- Jackie has X apples. Adam has 8 apples. Jackie has 2 more apples than Adam. What is the value of X?
-
True-false questions are suitable for non-arithmetic tasks. We won't be using them in our demonstration, but they can be of the form:
- Adam has 8 apples, so Jackie has 18 more apples than Adam. Jackie has a total of 10 apples. Is it correct(True or False)?
- Adam has 8 apples. Jackie has 2 more apples than Adam. Jackie has a total of 10 apples. Is it correct(True or False)?
Verification
- Finally, pass the re-written conditions to the LLM and predict the masked value (or true/false). For each rewritten condition, repeat the process P (say 5) times and increment the score of the respective condition when the LLM predicts the condition correctly.
- The answer corresponding to the condition with maximum votes is the final answer.
Here, the later condition correctly predicts the value of X more often, and hence, the answer that corresponds to this rewritten condition, i.e., "A2: Jackie has 10 apples, so Jackie has 10-8=2 more apples than Adam, and the answer is 2." is the correct answer.
What Are Self-Verification Prompting Results?
- Self-Verification improves the performance of prior methods in all datasets. It also achieves the new state-of-the-art (SOTA) performance in 6 of the 8 datasets.
- Even high-performing forward reasoning models like InstructGPT improve by an average of 2.33% when using Self-Verification implying models with strong forward reasoning capabilities also benefit from the Self-Verification mechanism.
- Self-Verification technique can effectively improve the accuracy of commonsense reasoning models.
Impact of the verification stage in accuracy
Limitations of Self-Verification Prompting
- The effectiveness of Self-Verification relies on the ability of the model to generate the correct answer as one of the candidate answers. As such, smaller language models may not benefit from this technique.
- The method requires generating multiple candidate inference chains. This increases the computational cost for inference.
Conclusion
Just like humans, LLMs are capable of self-verifying their own answers without relying on an external trained model for answer verification. Self-verification improves the accuracy and reliability of LLMs in reasoning tasks without the need for separate models trained on human-labeled data.
Bhuwan Bhatt
Bhuwan Bhatt, a Machine Learning Engineer with over 5 years of industry experience, is passionate about solving complex challenges at the intersection of machine learning and Python programming. Bhuwan has contributed his expertise to leading companies, driving innovation in AI/ML projects. Beyond his professional endeavors, Bhuwan is deeply committed to sharing his knowledge and experiences with others in the field. He firmly believes in continuous improvement, striving to grow by 1% each day in both his technical skills and personal development.
Footnotes
-
Jason Wei. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. β©
-
Shen, J., Yin, Y., Li, L., Shang, L., Jiang, X., Zhang, M., & Liu, Q. (2021). Generate & Rank: A Multi-task Framework for Math Word Problems. https://arxiv.org/abs/2109.03034 β©
-
Yixuan Weng. (2022). Large Language Models are Better Reasoners with Self-Verification. https://arxiv.org/abs/2212.09561 β© β©2