Prompt Engineering Guide
πŸ˜ƒ Basics
πŸ’Ό Applications
πŸ§™β€β™‚οΈ Intermediate
🧠 Advanced
Special Topics
βš–οΈ Reliability
πŸ”“ Prompt Hacking
πŸ–ΌοΈ Image Prompting
🌱 New Techniques
πŸ”§ Models
πŸ—‚οΈ RAG
πŸ€– Agents
πŸ’ͺ Prompt Tuning
πŸ” Language Model Inversion
πŸ”¨ Tooling
πŸ“™ Vocabulary Resource
🎲 Miscellaneous
πŸ“š Bibliography
πŸ“¦ Prompted Products
πŸ›Έ Additional Resources
πŸ”₯ Hot Topics
✨ Credits
🧠 AdvancedDecomposition🟦 Chain of Code (CoC)

🟦 Chain of Code (CoC)

🟦 This article is rated medium
Reading Time: 4 minutes
Last updated on March 11, 2025

Valeriia Kuka

Large language models (LLMs) have demonstrated strong abilities in solving complex reasoning tasks like writing programs, answering science questions, and more. Traditional techniques such as Chain-of-Thought (CoT) prompting allowed to improve LLMs' performance on reasoning tasks by encouraging the model to break down a problem into intermediate reasoning steps using natural language.

Chain-of-Thought is based on semantic reasoning that involves understanding and processing the meaning of words and sentences. While semantic reasoning works well for tasks that depend on language and context, it can struggle with precise numerical or symbolic operations.

Program of Thoughts (PoT) takes a step further from CoT and generates code to represent its reasoning steps. The code is then executed by a code interpreter. PoT excels at tasks like arithmetic but struggles with semantic tasks that are hard to encode in code.

Chain of Code (CoC) is a framework designed to overcome these limitations by combining the strengths of both code execution and language-based reasoning.

CoC merges ideas behind CoT and PoT by having the model generate a mix of executable code and "semantic placeholders." When the code interpreter runs the program, it handles precise operations, and for parts that are ambiguous or non-executable, an LM-based simulation module, called an LMulator, uses language-based reasoning to simulate what the output should be based on the context. This dual mechanism combines the precision of code execution with the flexible reasoning of language models.

Key Differences: CoC vs. Program of Thoughts (PoT)

  • PoT: The model writes a full program and relies entirely on a code interpreter to run it. It works best when all parts of the task can be explicitly coded.

  • CoC: CoC extends PoT by interweaving a code interpreter with an LMulator. This allows CoC to handle parts of the problem that are inherently semantic or ambiguous by simulating the execution of non-executable code segments.

How Is CoC Built and How Does It Work?

CoC follows a two-step process:

Step 1: Code Generation

Given a problem, the LM is prompted to generate a program that lays out a step-by-step solution. The generated code is a mix of:

  • Executable components: Sections that the interpreter can run (e.g., loops, arithmetic operations).
  • Semantic placeholders: For parts that require semantic judgment or cannot be directly translated into code (such as checking if an item is a fruit or detecting sarcasm), the model inserts functions like is_fruit() or get_country().
    • When the interpreter encounters such a function, it cannot execute it. Instead, the LMulator steps inβ€”it uses language-based reasoning to simulate what the output should be based on the context.
    • The simulated result is then incorporated into the overall program state, allowing the process to continue seamlessly.

Step 2: Code Execution with an LMulator

CoC runs the code line by line.

For each line:

  • If executable, the interpreter updates the program state with precise calculations.

  • If not executable, the LMulator simulates the expected output, and the program state is updated accordingly. This interplay enables CoC to handle a wide range of tasks that mix numerical precision with semantic reasoning.

Example: Counting Fruits

Astronaut

Problem


I have an orange, a violin, two peaches, an apple, a pepper, and three plums. How many fruits do I have?

Step 1: Code Generation

The model generates the following code:

# Define a dictionary of objects with their counts
objects = {"orange": 1, "violin": 1, "peaches": 2, "apple": 1, "pepper": 1, "plum": 3}

# Initialize the fruit counter
num_fruits = 0

# Iterate over each object
for obj in objects:
    # Check if the object is a fruit (this may not be executable directly)
    object_is_fruit = is_fruit(obj)
    if object_is_fruit:
        num_fruits += objects[obj]

# Final answer
answer = num_fruits

Step 2: Code Execution with LMulator

The interpreter sets up the dictionary and initializes num_fruits to 0. It correctly processes the loop structure and arithmetic for summing counts.

When the interpreter reaches is_fruit(obj), it cannot execute this function because it requires semantic judgment. Here, the LMulator is invoked:

  • For each object, the LMulator simulates the expected outcome based on context. For example:
    • For "orange", "peaches", "apple", and "plum", it returns True.
    • For "violin" and "pepper", it returns False.
  • These simulated values are used to update num_fruits.

After processing all objects with a mix of execution and simulation, the final computed value of answer reflects the correct count of fruits.

Note

Find Chain-of-Code demo and more information about using CoC in Google Colab.

How CoC Differs from Existing Techniques

MethodKey IdeaExecutes Code?Handles Non-Executable Steps?Improvement in Reasoning
Chain of Thought (CoT)Uses step-by-step natural language reasoningNoNoLimited on numeric/symbolic tasks
Program of Thoughts (PoT)Writes and executes code for problem-solvingYesNoEffective primarily for arithmetic tasks
ScratchPadTracks intermediate steps through text-based simulationNoYesWorks only in text-based reasoning
Chain of Code (CoC)Combines code execution with LM-based simulationYesYesIntegrates the best of both approaches

Benefits and Applications

  • CoC is suitable for tasks that require both detailed computation (such as mathematics) and understanding of language (such as commonsense reasoning).

  • The framework is adaptable to various challenges, including numerical calculations, logical operations, and even data processing tasks.

  • Improved Accuracy: Experiments have shown that CoC can outperform traditional methods on benchmark datasets, especially in tasks that combine semantic and numeric reasoning.

Limitations and Future Directions

While Chain of Code offers a powerful approach, it also faces some challenges:

  1. Running both a code interpreter and an LMulator can be slower than direct text-based reasoning.

  2. Some tasks might still be difficult to simulate accurately with code-based reasoning.

  3. Future work could focus on refining how the LM simulates non-executable steps to further boost performance.

Conclusion

Chain of Code (CoC) offers a new way for language models to tackle complex reasoning tasks by merging code execution with language-based simulation. This hybrid approach allows models to solve problems more precisely by executing parts of a generated program while still reasoning through more abstract, semantic aspects.

Note

For additional information, visit the project website.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Footnotes

  1. Li, C., Liang, J., Zeng, A., Chen, X., Hausman, K., Sadigh, D., Levine, S., Fei-Fei, L., Xia, F., & Ichter, B. (2024). Chain of Code: Reasoning with a Language Model-Augmented Code Emulator. https://arxiv.org/abs/2312.04474 ↩