π¦ Chain of Code (CoC)
Large language models (LLMs) have demonstrated strong abilities in solving complex reasoning tasks like writing programs, answering science questions, and more. Traditional techniques such as Chain-of-Thought (CoT) prompting allowed to improve LLMs' performance on reasoning tasks by encouraging the model to break down a problem into intermediate reasoning steps using natural language.
Chain-of-Thought is based on semantic reasoning that involves understanding and processing the meaning of words and sentences. While semantic reasoning works well for tasks that depend on language and context, it can struggle with precise numerical or symbolic operations.
Program of Thoughts (PoT) takes a step further from CoT and generates code to represent its reasoning steps. The code is then executed by a code interpreter. PoT excels at tasks like arithmetic but struggles with semantic tasks that are hard to encode in code.
Chain of Code (CoC) is a framework designed to overcome these limitations by combining the strengths of both code execution and language-based reasoning.
CoC merges ideas behind CoT and PoT by having the model generate a mix of executable code and "semantic placeholders." When the code interpreter runs the program, it handles precise operations, and for parts that are ambiguous or non-executable, an LM-based simulation module, called an LMulator, uses language-based reasoning to simulate what the output should be based on the context. This dual mechanism combines the precision of code execution with the flexible reasoning of language models.
Key Differences: CoC vs. Program of Thoughts (PoT)
-
PoT: The model writes a full program and relies entirely on a code interpreter to run it. It works best when all parts of the task can be explicitly coded.
-
CoC: CoC extends PoT by interweaving a code interpreter with an LMulator. This allows CoC to handle parts of the problem that are inherently semantic or ambiguous by simulating the execution of non-executable code segments.
How Is CoC Built and How Does It Work?
CoC follows a two-step process:
Step 1: Code Generation
Given a problem, the LM is prompted to generate a program that lays out a step-by-step solution. The generated code is a mix of:
- Executable components: Sections that the interpreter can run (e.g., loops, arithmetic operations).
- Semantic placeholders: For parts that require semantic judgment or cannot be directly translated into code (such as checking if an item is a fruit or detecting sarcasm), the model inserts functions like
is_fruit()
orget_country()
.- When the interpreter encounters such a function, it cannot execute it. Instead, the LMulator steps inβit uses language-based reasoning to simulate what the output should be based on the context.
- The simulated result is then incorporated into the overall program state, allowing the process to continue seamlessly.
Step 2: Code Execution with an LMulator
CoC runs the code line by line.
For each line:
-
If executable, the interpreter updates the program state with precise calculations.
-
If not executable, the LMulator simulates the expected output, and the program state is updated accordingly. This interplay enables CoC to handle a wide range of tasks that mix numerical precision with semantic reasoning.
Example: Counting Fruits

Problem
I have an orange, a violin, two peaches, an apple, a pepper, and three plums. How many fruits do I have?
Step 1: Code Generation
The model generates the following code:
# Define a dictionary of objects with their counts
objects = {"orange": 1, "violin": 1, "peaches": 2, "apple": 1, "pepper": 1, "plum": 3}
# Initialize the fruit counter
num_fruits = 0
# Iterate over each object
for obj in objects:
# Check if the object is a fruit (this may not be executable directly)
object_is_fruit = is_fruit(obj)
if object_is_fruit:
num_fruits += objects[obj]
# Final answer
answer = num_fruits
Step 2: Code Execution with LMulator
The interpreter sets up the dictionary and initializes num_fruits
to 0. It correctly processes the loop structure and arithmetic for summing counts.
When the interpreter reaches is_fruit(obj)
, it cannot execute this function because it requires semantic judgment. Here, the LMulator is invoked:
- For each object, the LMulator simulates the expected outcome based on context. For example:
- For
"orange"
,"peaches"
,"apple"
, and"plum"
, it returnsTrue
. - For
"violin"
and"pepper"
, it returnsFalse
.
- For
- These simulated values are used to update
num_fruits
.
After processing all objects with a mix of execution and simulation, the final computed value of answer
reflects the correct count of fruits.
Find Chain-of-Code demo and more information about using CoC in Google Colab.
How CoC Differs from Existing Techniques
Method | Key Idea | Executes Code? | Handles Non-Executable Steps? | Improvement in Reasoning |
---|---|---|---|---|
Chain of Thought (CoT) | Uses step-by-step natural language reasoning | No | No | Limited on numeric/symbolic tasks |
Program of Thoughts (PoT) | Writes and executes code for problem-solving | Yes | No | Effective primarily for arithmetic tasks |
ScratchPad | Tracks intermediate steps through text-based simulation | No | Yes | Works only in text-based reasoning |
Chain of Code (CoC) | Combines code execution with LM-based simulation | Yes | Yes | Integrates the best of both approaches |
Benefits and Applications
-
CoC is suitable for tasks that require both detailed computation (such as mathematics) and understanding of language (such as commonsense reasoning).
-
The framework is adaptable to various challenges, including numerical calculations, logical operations, and even data processing tasks.
-
Improved Accuracy: Experiments have shown that CoC can outperform traditional methods on benchmark datasets, especially in tasks that combine semantic and numeric reasoning.
Limitations and Future Directions
While Chain of Code offers a powerful approach, it also faces some challenges:
-
Running both a code interpreter and an LMulator can be slower than direct text-based reasoning.
-
Some tasks might still be difficult to simulate accurately with code-based reasoning.
-
Future work could focus on refining how the LM simulates non-executable steps to further boost performance.
Conclusion
Chain of Code (CoC) offers a new way for language models to tackle complex reasoning tasks by merging code execution with language-based simulation. This hybrid approach allows models to solve problems more precisely by executing parts of a generated program while still reasoning through more abstract, semantic aspects.
For additional information, visit the project website.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Footnotes
-
Li, C., Liang, J., Zeng, A., Chen, X., Hausman, K., Sadigh, D., Levine, S., Fei-Fei, L., Xia, F., & Ichter, B. (2024). Chain of Code: Reasoning with a Language Model-Augmented Code Emulator. https://arxiv.org/abs/2312.04474 β©