# ClusterLLM

**Repository Path**: bobosanguo/ClusterLLM

## Basic Information

- **Project Name**: ClusterLLM
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-11
- **Last Updated**: 2025-10-11

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CLUSTERLLM: Large Language Models as a Guide for Text Clustering

![](image/overall_v6.jpg)

This is the official PyTorch implementation of paper [CLUSTERLLM: Large Language Models as a Guide for Text Clustering (EMNLP2023)](https://arxiv.org/abs/2305.14871).

## Install
```bash
pip install -r requirements.txt
```

## Datasets

Download zip file [here](https://drive.google.com/file/d/1TBq3vkfm3OZLi90GVH-PVNKi3fk1Vba7/view?usp=sharing) and unzip.

## Steps to run perspective experiments

### 1. Original embeddings
```bash
cd perspective/2_finetune
bash scripts/get_embedding.sh
```
The embeddings are produced in each folder of `datasets`. It will also save the clustering measures. Details instructions see bash script. E5 embeddings are produced with `scripts/get_embedding_e5.sh`.
### 2. Sample triplets
```bash
cd perspective/1_predict_triplet
bash scripts/triplet_sampling.sh
```
Sampled triplets will be produced in `perspective/1_predict_triplet/sampled_triplet_results`. Details instructions see bash script.
### 3. Predict triplets
First replace the openai keys in `perspective/1_predict_triplet/scripts/predict_triplet.sh`.
```bash
cd perspective/1_predict_triplet
bash scripts/predict_triplet.sh
```
Predicted triplets will be in `perspective/1_predict_triplet/predicted_triplet_results`. Details instructions see bash script.
### 4. Convert triplets
This step only converts the format.
```bash
cd perspective/2_finetune
bash scripts/convert_triplet.sh
bash scripts/convert_triplet_self.sh
```
Converted triplets will be in `perspective/2_finetune/converted_triplet_results`. Details instructions see bash script.
### 5. Finetune
```bash
cd perspective/2_finetune
bash scripts/finetune.sh
```
Finetuned model will be in `perspective/2_finetune/checkpoints`. Details instructions see bash script.
### 6. Finetune
```bash
cd perspective/2_finetune
bash scripts/get_embedding.sh
```
This time, switch to checkpoints. Clustering measures will be saved into checkpoint folder.

## Steps to run granularity experiments

### 1. Sample pairs
```bash
cd granularity
bash scripts/sample_pairs.sh
```
Sampled pairs will be saved in `sampled_pair_results`.
### [optional] Sample pairs for prompt
4 pairs will be sampled as in-context examples.
```bash
cd granularity
bash scripts/sample_pairs_for_prompt.sh
```
### 2. Predict pairs
First replace the openai keys in `granularity/scripts/predict_pairs.sh`.
```bash
cd granularity
bash scripts/predict_pairs.sh
```
Predicted pairs will be in `granularity/predicted_pair_results`. Also specify `prompt_file` to sampled the prompt.
### 3. Predict cluster num
```bash
cd granularity
bash scripts/predict_num_clusters.sh
```
Details instructions see bash script.

## Citation
```
@misc{zhang2023clusterllm,
      title={ClusterLLM: Large Language Models as a Guide for Text Clustering}, 
      author={Yuwei Zhang and Zihan Wang and Jingbo Shang},
      year={2023},
      eprint={2305.14871},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## Thanks
Some of the code was adapted from:
* https://github.com/xlang-ai/instructor-embedding

## Contact
Yuwei Zhang yuz163@ucsd.edu