Note: Generative AI services are used as assistants in this blog post!

image generated by Bing Image Creator

This post is a summary of the paper Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

## Introduction

There’s been a lot of progress with language models recently. They’re doing a great job with many tasks. But, they’re kind of stuck in a pattern where they make decisions one step at a time, just moving along from start to finish. This isn’t the best when they need to think forward, go back to adjust something, or when the first few decisions are really important.

source: https://arxiv.org/pdf/2305.10601.pdf

Now, there’s a new solution called “Tree of Thoughts” or ToT. It’s an improvement on the typical method used for giving prompts to language models, which is called the “Chain of Thought.” The ToT approach lets language models look at larger pieces of text, known as “thoughts.” This means that these models can take their time, consider different options, follow different paths of reasoning, and even evaluate their own decisions to determine what to do next. They can also plan for the future or go back to fix something when they need to make big decisions.

The tests have shown that ToT really improves how well language models can solve problems. This was especially true with three new tasks that required a lot of planning or searching: a game called 24, Creative Writing, and Mini Crosswords. For example, in the Game of 24, the old method only solved 4% of the tasks, but the new ToT method had a success rate of 74%! All the prompts used in the tests are available on their repo:

You might find it interesting to take a look!

## System 1 vs System 2

Studies focusing on the “dual process” theory propose that individuals have two distinct methods of tackling decisions. One, labeled as “System 1”, operates swiftly, automatically, and often unconsciously. The other, known as “System 2”, functions in a slow, intentional, and conscious manner. These two systems were proposed by the Nobel laureate Daniel Kahneman, and they’ve been linked to an array of mathematical models utilized in machine learning.

Take for example the field of reinforcement learning. Researchers have delved into the conditions that prompt humans and other creatures to engage in simple, association-based “model-free” learning or a more thoughtful, deliberative “model-based” planning.

The basic associative, token-level decisions made by language models (LMs) can be likened to Kahneman’s “System 1”. However, they might actually perform better if they also incorporated some characteristics of “System 2” — the more mindful, planning-oriented process. Such a “System 2” augmentation would have two key features. First, it would hold onto and consider a variety of alternatives for each decision instead of settling on just one. Second, it would actively assess its current position and foresee or backtrack to make overarching decisions. This approach could potentially enhance the overall decision-making process and efficacy of LMs.

## Background

### Input-output (IO) prompting

Input-output prompting in language models refers to the technique of formatting an input to a language model in such a way that it is directed or prompted to produce a desired output. This concept is especially important in models like GPT, which are generative and can produce a wide range of responses based on the given input.

Essentially, input-output prompting involves crafting the input text to contain cues or hints that guide the language model towards generating the kind of output you’re aiming for.

### Chain-of-thoughts (CoT) prompting

In certain situations, the relationship between the input (x) and the output (y) may not be straightforward. For example, in the case of a complex math problem where the input is the question and the output is the final numerical answer, the path to the solution involves several intermediate steps.

The idea behind CoT prompting is to introduce a sequence of intermediate steps, referred to as a “chain of thoughts” (z1, …, zn). Each “thought” (zi) in this chain is a coherent language sequence representing an intermediate step toward solving the problem. In the case of a math question, for example, each zi might represent an intermediate equation or step in the problem-solving process.

The solution process with CoT prompting unfolds sequentially. Each thought zi is sampled (or generated) based on the input x and all preceding thoughts z1, …, zi-1. The final output y is then sampled based on the input x and all thoughts z1, …, zn.

In practice, the chain of thoughts [z1,…,n] and the output y are sampled as a continuous language sequence given the input x. The decomposition of thoughts (i.e., whether each zi is a phrase, a sentence, or a paragraph) remains ambiguous.

In essence, CoT prompting aims to make the model’s problem-solving process more transparent and tractable by generating a sequence of coherent and meaningful intermediate steps or “thoughts”. This can be particularly useful in complex tasks where the steps to the solution are not immediately obvious from the input.

### Self-Consistency with Chain-of-Thought (CoT-SC) prompting

CoT-SC is an ensemble method. In this approach, k independent and identically distributed (i.i.d.) chains of thought are sampled: [z^(i)_1…n, y^(i)] for i from 1 to k, given the input x. Here, each i represents a different chain of thought.

The method then selects the most frequent output (y) among these k chains as the final output. This is done by finding the y with the maximum count (arg maxy) across all chains of thought, where count is denoted by #{i | y^(i) = y}, i.e., the number of times the output y appears across all k chains.

CoT-SC improves upon CoT because it acknowledges the fact that there are often different thought processes that can lead to the solution of the same problem (for example, different ways to prove the same theorem). By exploring multiple chains of thought (each potentially representing a different way of thinking or approaching the problem), CoT-SC provides a richer set of intermediate steps, which can lead to a more robust and reliable output.

However, within each chain, there is no local exploration of different thought steps, i.e., the method doesn’t consider alternative paths within each individual chain of thought.

Additionally, the “most frequent” heuristic for selecting the final output only applies well when the output space is limited, such as in multi-choice question-answering tasks, where there are only a limited number of possible outputs (the multiple-choice options).

## Tree of Thoughts (ToT)

This research highlights that human problem-solving often involves exploring a combinatorial problem space, which is visualized as a tree where nodes represent partial solutions and branches represent different actions or steps that can modify these solutions. The choice of which branch to take is typically guided by heuristics or rules of thumb.

The research also identifies two key limitations in current language model (LM) approaches to problem-solving:

1) Locally, these models don’t explore different continuations or branches within a single thought process, thereby missing out on alternative paths to the solution.

2) Globally, these models lack planning, lookahead, or backtracking mechanisms to evaluate different options, which are essential components of the heuristic-guided search characteristic of human problem-solving.

ToT allows language models to explore multiple reasoning paths by framing any problem as a search over a tree. Each node in this tree represents a partial solution, comprising the input and the sequence of thoughts so far, denoted by [x, z1…i].

Tree of Thoughts (ToT) is designed to overcome the identified limitations of current language model (LM) problem-solving methods. ToT implementation involves addressing four key questions:

1- Decomposition of the intermediate process into thought steps: This refers to how the process or problem should be broken down into individual steps or ‘thoughts’ that the model can follow.

2- Generation of potential thoughts from each state: From a given node or state in the tree (i.e., a particular stage in the problem-solving process), the model should be able to generate or explore various potential ‘thoughts’ or steps.

They propose two strategies for generating k candidates for the next thought step, given a tree state s = [x, z1…i] in the Tree of Thoughts (ToT) framework:

Sample Independent and Identically Distributed (i.i.d.) thoughts: This strategy involves sampling i.i.d. thoughts from a Chain-of-Thought (CoT) prompt. For each j from 1 to k, a thought z(j) is sampled based on the current state s. This strategy works well when the thought space is rich, meaning there are many possible next steps or thoughts the model can consider, and each thought is relatively large or complex (e.g., a paragraph). Sampling thoughts independently from this space encourages diversity in the generated thoughts.

Propose thoughts sequentially: In this strategy, the model proposes thoughts sequentially using a “propose prompt”. This means that a sequence of thoughts [z(1), …, z(k)] is sampled based on the current state s. This strategy works better when the thought space is more constrained, such as when each thought is relatively small or simple (e.g., a word or a line). By proposing different thoughts in the same context sequentially, this method helps avoid duplication in the generated thoughts.

3- Heuristic evaluation of states: This involves developing a method for the model to evaluate the quality or potential success of each state (i.e., each potential step or solution) in a heuristic manner, similar to the way humans use rules of thumb or intuitive judgments in problem-solving.

This step, the state evaluator, estimates the progress each state or stage in the problem-solving process makes towards the final solution. This heuristic guides the search algorithm in deciding which states to continue exploring and the order in which to do so. The passage introduces a novel approach where the language model (LM) is used to deliberately reason about states, which is proposed as a more flexible and sample-efficient alternative to the traditional programmed or learned heuristics.

Two strategies are presented for evaluating states, and these can be used either independently or together:

Value each state independently: In this strategy, each state s in the frontier (set of states currently being explored, denoted as S) is evaluated independently. A value prompt is used to reason about the state s and generate a scalar value v, which could be a numerical score (e.g., 1–10) or a classification (e.g., sure/likely/impossible) that can be heuristically turned into a value. The evaluation might involve various forms of reasoning, including lookahead simulations and commonsense judgment, which can help promote “good” states and eliminate “bad” states. The value evaluations don’t need to be perfect, but only approximate.

Vote across states: In this strategy, states are evaluated collectively rather than independently. A “good” state s* is voted out based on a vote prompt that compares different states in S. This method is useful when it’s hard to directly evaluate the progress of the problem (e.g., when assessing the coherency of a passage), making it more practical to compare different partial solutions and vote for the most promising one. This approach is similar to a step-wise self-consistency strategy, where the decision on which state to explore next is cast as a multi-choice question, and the LM samples are used to vote for it.

For both strategies, the LM can be prompted multiple times to aggregate the value or vote results, trading off time, resources, and costs for more reliable and robust heuristics.

4- Choice of search algorithm: The method must specify the algorithm that the model will use to navigate the tree and search for solutions. One can plug and play different search algorithms depending on the tree structure.

Depending on the structure of the problem, different search algorithms can be used. Here are two:

(a) Breadth-first search (BFS): This algorithm maintains a set of the ‘b’ most promising states at each step. It is used in cases like the Game of 24 and Creative Writing, where the tree depth is limited (T ≤ 3) and the initial thought steps can be evaluated and pruned to a small set (b ≤ 5). BFS systematically explores all the nodes at the present depth before moving on to nodes at the next depth level.

(b) Depth-first search (DFS): This algorithm explores the most promising state first and continues along this path until the final output is reached (t > T), or the state evaluator determines it’s impossible to solve the problem from the current state ‘s’ (if the value of state ‘s’ is less than or equal to a certain threshold). If the latter happens, the subtree from ‘s’ is pruned to trade exploration for exploitation. If either of these conditions are met, DFS backtracks to the parent state of ‘s’ to continue exploration. DFS explores as far as possible along each branch before backtracking.

By addressing these questions, the ToT approach aims to provide a framework that allows language models to explore multiple reasoning paths and evaluate different options in a more human-like manner.

## Experiments

The authors propose three tasks that are challenging when using standard Input-Output (IO) prompting or Chain-of-Thought (CoT) prompting with the state-of-the-art language model, GPT-4. They demonstrate how a deliberate search in Trees of Thoughts (ToT) produces superior results and also illuminates novel and promising methods for using language models to solve problems requiring search or planning. We will check one of their experiments on a game called “Game of 24”.

The authors present a task setup where they extract data from 4nums.com, a platform with 1,362 games that require mathematical problem-solving. They use a challenging subset of these games for testing. The goal of each task is to form a valid equation equalling 24, using the given input numbers exactly once. The performance is evaluated on the success rate across 100 games. Several baselines are set using standard Input-Output (IO) and Chain-of-Thought (CoT) prompts, the latter also involves self-consistency checks and iterative refinements based on ground truth feedback about equation correctness. To utilize the Tree of Thoughts (ToT) model, they decompose the task into three steps, each involving an intermediate equation. They apply a Breadth-First Search (BFS) within the ToT model and deliberately evaluate each candidate thought based on its feasibility to reach the final solution. Here are the results:

The results indicate that standard prompting methods such as Input-Output (IO), Chain of Thought (CoT), and Self-Consistency with Chain of Thought (CoT-SC) underperform in the task, only achieving success rates of 7.3%, 4.0%, and 9.0% respectively. Conversely, the Tree of Thoughts (ToT) model with a breadth of 1 reaches a 45% success rate, which increases to 74% with a breadth of 5. In a comparison between IO/CoT and ToT where best samples are selected, CoT performs better than IO, reaching a success rate of 49% with 100 samples, but still falls short of the success rate of ToT (b > 1). Error analysis shows that approximately 60% of CoT samples fail the task after generating the first step, revealing the limitations of direct left-to-right decoding.

You can refer to the main paper for other experiments.

## Comments