# visualization-of-thought

**Repository Path**: mirrors_microsoft/visualization-of-thought

## Basic Information

- **Project Name**: visualization-of-thought
- **Description**: [NeurIPS 2024]Repos for "Visualization-of-Thought" dataset, construction code and evaluation.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-10-12
- **Last Updated**: 2025-09-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Visualization-of-Thought (VoT)

## Overview

Visualization-of-Thought (VoT) prompting is designed to enhance the spatial reasoning abilities of large language models (LLMs) by visualizing their reasoning traces, thus guiding subsequent reasoning steps. This approach leverages the concept of the "mind's eye" in human cognition, which refers to the ability to visualize and manipulate mental images. By emulating this cognitive process, VoT has been applied to tasks such as natural language navigation, visual navigation, and visual tiling in 2D grid worlds, significantly improving the performance of LLMs in these areas.
![image](https://github.com/user-attachments/assets/cc34d9b5-3f34-4e2b-87fc-6d1d7d81bcb1)

## Getting Started

### Prerequisites

Before you begin, ensure you have met the following requirements:
- You have installed Python 3.x.
- You have installed nodejs (needed for data augmentation).

### Installation

Clone the repository and install the required dependencies:

```bash
git clone https://github.com/microsoft/Visualization-of-Thought.git
cd Visualization-of-Thought
pip install -r src/requirements.txt
```

### Dataset Prepare

You need to download the dataset and then generate prompts for different settings. The dataset includes tasks designed to evaluate the spatial reasoning capabilities of LLMs:

1. **Natural Language Navigation**:
   - A square map defined by a sequence of random walk instructions and associated objects.
   - Task: Identify the associated object at a specified location determined by navigation instructions.
   - Data generation is implemented by [SpatialEvalLLM](https://github.com/runopti/SpatialEvalLLM) repository. Use the following command to generate the data:
```bash
python square.py --seed 8 --size 4 --steps 8 --maptype square --label_path ./labels/imagenetsimple.json --n_sample 200 --out_dir results_map_global --special_order snake_order
```

2. **Visual Tasks**:
   
Download the dataset of visual tasks via this [link](https://github.com/microsoft/visualization-of-thought/raw/main/vot-dataset-visual-tasks.zip), and place it under the root folder of this repo.

```bash
mkdir -p dataset
unzip VoT-Dataset-Visual-Tasks.zip -d dataset
cd src
# fill in prompt template for different settings
sh patch-prompt.sh ../dataset
```
Please notice that prompts of different settings (CoT/VoT/GPT-4V CoT) for each instance have been removed from this released version. The `patch-prompt.sh` script is provided to automatically fill in prompt templates for all experiment settings across tasks. Prompt templates are stored in the prompts folder under each visual task. For example:
- visual-navigation/route-planning/prompts/{setting}.txt
- visual-navigation/next-step-prediction/prompts/{setting}.txt
- visual-tiling/prompts/{setting}.txt

To create your own prompt template for a new setting, simply add the template file under the `prompts` folder of the relevant task. An example of a template can be found in the [VoT template](https://github.com/microsoft/visualization-of-thought/blob/main/src/visual-tiling/prompts/0-shot-vot). 

### Evaluation
Sample codes are provided for each task to run experiments. You need to implement the `run_llm_client` function for each visual task, which writes response to specified output path. Then run following command for evaluation:
```bash
python visual-navigation/route-planning/sample.py --jsonl-path ../dataset/visual-navigation/route-planning.jsonl --output-folder {output-folder} --setting {setting}
python visual-navigation/next-step-prediction/sample.py --jsonl-path ../dataset/visual-navigation/next-step-prediction.jsonl --output-folder {output-folder} --setting {setting}
python visual-tiling/sample.py --jsonl-path ../dataset/visual-tiling/visual-tiling.jsonl --output-folder {output-folder} --setting {setting}
```
The performance of specific setting will be printed on the terminal. The path of log file is also provided, which includes all failing cases for debugging purpose.
To be noticed, the LLM-generated responses are parsed based on regex pattern, and default patterns implemented in the code are for GPT-family models. You may need to **specify regex patterns for other models** and pay attention to "failing to parse" cases in the log file. To use the regex patterns we implemented for [LLaMA](https://github.com/microsoft/visualization-of-thought/blob/main/src/visual-navigation/next-step-prediction/llama-regex-patterns.txt) or your customized patterns, just specify the parameter `--regex-path` when running the evaluation script.

## Dataset Schema
### Main Schema

| Field Name        | Description                                                                                 | Type              | Example                          |
|-------------------|---------------------------------------------------------------------------------------------|-------------------|----------------------------------|
| `desc`            | Text input for LLMs.                                                                        | String            | "" |
| `desc_multimodal` | Message array input for MLLMs, containing text and images, following [Azure OpenAI format](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision?tabs=rest%2Csystem-assigned%2Cresource#call-the-chat-completion-apis).   | Array of Messages | `[{}]` |
| `answer`          | A single string, or a list of strings for route planning tasks.                             | String or String Array   | "A" or `["left", "down"]` |
| `puzzle_path`     | Folder of each prompt instance.                                                               | String            | "puzzles/level-2/103/Tetromino T"            |
| `config_path`     | Folder of images and original spatial configurations.                                         | String            | "configurations/level-2/103"         |
| `difficulty`      | Difficulty level of the question or puzzle.                                                 | Integer            | 2                         |
| `instance_id`     | Relative path to the instance identifier within the puzzle folder.                          | String            | "103/Tetromino T"            |

### Schema for Natural Language Navigation Task

| Field Name        | Description                | Type   | Example                      |
|-------------------|----------------------------|--------|------------------------------|
| `question`        | The navigation question.   | String | "" |
| `answer`          | The name of the object to be found.     | String | "Sofa" |


## Code for data construction/augmentation
This dataset could be extended by specifying the difficulty. Please refer to the scripts generating the dataset of visual tasks.
1. **Visual Navigation**:

```bash
# make sure to switch to the src folder
mkdir -p ../dataset
sh visual-navigation/gen-data.sh ../dataset/visual-navigation
```
Run following commands to extend with a new difficulty level K.

```bash

python visual-navigation/gen_all_paths.py --turn {K} --dest-folder ../dataset/visual-navigation/configurations/level-{K}
python visual-navigation/route-planning/gen_puzzle.py --config-folder ../dataset/visual-navigation/configurations/level-{K} --puzzle-folder ../dataset/visual-navigation/route-planning/level-{K} --output-jsonl ../dataset/visual-navigation/route-planning.jsonl --difficulty {K}
python visual-navigation/next-step-prediction/gen_puzzle.py --config-folder ../dataset/visual-navigation/configurations/level-{K} --puzzle-folder ../dataset/visual-navigation/next-step-prediction/level-{K} --output-jsonl ../dataset/visual-navigation/next-step-prediction.jsonl --difficulty {K}
```
2. **Visual Tiling**:

```bash
# make sure to switch to the src folder
mkdir -p ../dataset
# uncomment line `npm install` to install dependency node modules.
sh visual-tiling/gen-data.sh ../dataset/visual-tiling
```

Run following commands to extend with a new difficulty level K, rectangle size and polyomino pieces. For example, a 5 * 4 rectangle could be filled by "TTLII" (2 T pieces, 1 L piece and 2 I pieces).

```bash
cd visual-tiling/gen-solution
node run.js --width=4 --height=5 --masked=K --dest=../dataset/visual-tiling/configurations/level-{K} --pieces='TTLII'
cd ..
python gen_puzzle.py --config-folder ../dataset/visual-tiling/configurations/level-{K} --puzzle-folder ../dataset/visual-tiling/puzzles/level-{K} --output-jsonl ../dataset/visual-tiling/visual-tiling.jsonl --difficulty {K}
```

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Contact

If you have any questions or suggestions, feel free to open an issue in the repository or contact the authors.

## Citation

If you use this dataset, please cite us:

```bibtex
@misc{wu2024mindseyellmsvisualizationofthought,
      title={Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models}, 
      author={Wenshan Wu and Shaoguang Mao and Yadong Zhang and Yan Xia and Li Dong and Lei Cui and Furu Wei},
      year={2024},
      eprint={2404.03622},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2404.03622}, 
}