β Skeleton-of-Thought Prompting
- Skeleton-of-Thought (SoT) prompting enhances response generation by first creating a basic structure (skeleton) and then expanding it in parallel, reducing latency.
- Two-stage process: SoT divides generation into a skeleton phase followed by a detailed expansion phase, improving efficiency and speed.
- Faster inference: SoT delivers over 2x speed improvement on 8 out of 12 models, making it ideal for real-time applications.
- Quality improvement: In 60% of cases, SoT generates answers with quality equal to or better than traditional methods.
- Limitations include higher token usage costs and potential quality issues when points in the skeleton are interdependent.
What is Skeleton-of-Thought Prompting?
Most state-of-the-art Large Language Models (LLMs) rely on sequential decoding, which can lead to high latency. In contrast, humans approach problem-solving by first creating an outline or skeleton of their answer, then filling in details and supporting evidence.
Skeleton-of-Thought (SoT) Prompting mimics this parallel process. It first instructs the LLM to generate a basic answer structure (the skeleton), and then expands on each point to create a detailed response. To optimize for speed, the detailed generation phase uses parallel API calls or batched decoding, reducing latency compared to traditional methods.
How to Use Skeleton-of-Thought Prompting?
SoT generates answers in two stages:
- Skeleton stage
- Point-expanding stage
Skeleton Stage
In the skeleton stage, SoT utilizes the skeleton prompt template to generate a skeleton answer.
Prompt
[User:] You're an organizer responsible for only giving the skeleton (not the full content) for answering the question. Provide the skeleton in a list of points (numbered 1., 2., 3., etc.) to answer the question. Instead of writing a full sentence, each skeleton point should be very short with only 3~5 words. Generally, the skeleton should have 3~10 points. Now, please provide the skeleton for the following question.
{question}
Skeleton:
[Assistant:] 1.
This skeleton can be directly fed to LLM to get the skeleton answer.
Let's use SoT to gain tips on reducing carbon emissions on a personal level.
Point-Expanding Stage
In this stage, SoT utilizes the point-expanding prompt template to expand the answer generated in the previous stage.
Prompt
[User:] You're responsible for continuing the writing of one and only one point in the overall answer to the following question.
{question}
The skeleton of the answer is
{skeleton}
Continue and only continue the writing of point {point index}. Write it very shortly in 1~2 sentence and do not continue with other points!
[Assistant:] {point index}.{Point skeleton}
The LLM is fed with the points generated using the previous stage and is asked to expand one point at a time. This is repeated for all points in the skeleton. This process can be parallelized to speed up the inference.
- For LLMs with only API access, multiple parallel API requests can be sent to the provider.
- For LLMs running locally, inference can be optimized by performing the operations in batch.
Now, let's use the point-expanding prompt to expand our previous skeleton.
What Are Skeleton-of-Thought Prompting Results?
- On 8 out of 12 models, SoT obtains a speed-up of at least 2x.
Speed-up gained after employing SoT (Ning et al.)
- The quality of answers generated by SoT is either comparable or better than that of normal generation in 60% of the cases.
Quality evaluation across two metrics: FastChat and LLMZoo (Ning et al.)
Limitations of Skeleton-of-Thought Prompting
- The quality of generated answers using the SoT approach was evaluated using GPT-4 judges without any involvement of human experts. Hence, the answer quality evaluation isn't perfect.
- SoT doesn't consider dependencies between points in the skeleton. As a result, when there is interdependence between points in the skeleton, the generated detailed answer may not be comprehensive.
- LLMs available via API are charged depending on the token usage. Employing SoT may increase token usage and, hence, the bills.
Conclusion
SoT can boost the inference speed of the model by over 2 times using parallelization. SoT is easy to implement and can be implemented with a few simple modifications to any prompt. However, the quality of the generated response may not be optimal, and humans need to evaluate it before deciding to use SoT in a production environment.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Footnotes
-
Xuefei Ning. (2023). Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation. β©