# ChatBench **Repository Path**: mirrors_microsoft/ChatBench ## Basic Information - **Project Name**: ChatBench - **Description**: ChatBench Interactive Benchmark fine-tune pipeline - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-24 - **Last Updated**: 2026-02-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # chatbench Chatbench simulator fine-tune project. ## ChatBench/ ```bash ├── data/ # this is released sperately │ ├── finetuning_experiments.ipynb # raw ChatBench JSONL (train + val + test) │ └── prepare_splits.py # data‐prep script (see next section) │ └── split_0/ │ ├── train.jsonl │ └── test.jsonl │ └── split_1/ │ ├── train.jsonl │ └── test.jsonl │ ... up to ... │ └── split_9/ │ ├── train.jsonl │ └── test.jsonl ├── src/ │ ├── finetune.py # training script (model specific examples provided by finetune_) │ ├── eval_chatbench.py # paper chatbench metrics │ └── eval_ppl.py # perplexity evaluation script (model specific examples provided by val_ppl_) ├── configs/ │ └── training_args.json # optional HF Trainer config ├── results │ ├── logs// # where logs go │ ├── models// # where checkpoints & model weights go │ └── ppl/ # tensorboard or Accelerate logs └── .env # where the HF token goes (HF_TOKEN=) └── README.md ``` ## Getting Started: #### System packages and installation: ```bash sudo apt update sudo apt install -y git wget build-essential sudo apt install -y python3 python3-venv python3-pip python3 -m venv ~/venv source ~/venv/bin/activate pip install --upgrade pip setuptools pip install -r requirements.txt ``` #### Verify GPU availability: ```bash python - <” We’ll follow the same as Azure OpenAI expecting format for chat fine-tuning: ```json {"prompt": "\nuser:", "completion": ""} ``` ## Run fine-tuning (i.e. DIstilGPT-2) 1. Choose a split ```bash export SPLIT_ID=split_1 ``` 2. Launch multi-GPU training via accelerate First configure: ```bash accelerate config ``` - Choose “multi‐GPU” (no DeepSpeed if you only want simple). - Pick a default location for state. - Set fp16=True since if GPU supports mixed‐precision (i.e. RTX A6000). Then lanch: ```bash accelerate launch src/finetune.py ``` (Use our src/finetune_.py version to replicate our released model weights). ## Evaluation #### Validation losses: Look at the end of each training run’s stdout—the Trainer prints {'eval_loss': …} after each epoch. Open TensorBoard pointed at logs//: ```bash tensorboard --logdir logs/ ``` #### Perplexity on Held-Out Test Set: ```bash export SPLIT_ID=split_1 python src/eval_ppl.py ``` (Use our src/eval_ppl_.py version to replicate our released model evaluation). ## Automate N-Fold Runs & Evaluation: (You can use the `run_all_train_ppl.sh` and other model sepecific shell scripts for each train + perplexity score full pipeline.) #### Train & PPL: run `run_all_train_ppl.sh` to fine-tune and record validation perplexities. ```sh chmod +x run_all_train_ppl.sh ./run_all_train_ppl_distilgpt2.sh distilgpt2 # More models: # Mistral 7B ./run_all_train_ppl_mistral.sh mistralai/Mistral-7B-v0.1 # Llama 3 8B ./run_all_train_ppl_llama3.sh meta-llama/Meta-Llama-3-8B ``` #### Monitor the training: In a new terminal, monitor the GPU load: ```sh watch -n 1 nvidia-smi ``` If you don't want the header repeated, you can monitor GPU load by: ```sh nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.total,memory.used \ --format=csv --loop-ms=1000 ``` #### ChatBench metrics: _Note: Please use this [notebook](https://github.com/msrnyc/chatbench/blob/main/src/test_interactive_benchmark.ipynb) as example to run end-to-end benchmark testing for Chatbench Paper._ run `run_all_chatbench.sh` to get BLEU, ROUGE, accuracy correlation, and MAE across folds. ```sh chmod +x run_all_chatbench.sh ./run_all_chatbench.sh distilgpt2 # More models: # Mistral 7B ./run_all_chatbench.sh mistralai/Mistral-7B-v0.1 # Llama 3 8B ./run_all_chatbench.sh meta-llama/Meta-Llama-3-8B ``` #### Quick Sanity Check: run `run_quick_sanity.sh` to review results. ```sh chmod +x run_quick_sanity.sh ./run_quick_sanity.sh distilgpt2 # or ./run_quick_sanity.sh meta-llama/Meta-Llama-3-8B ``` ## Output: adapter_config.json # your LoRA hyperparameters (r, α, target_modules…) adapter_model.safetensors # the *only* file you truly need to re-load your LoRA adapter checkpoint-1500/ # checkpoint directory saved every save_steps checkpoint-1550/ # └─ contains a copy of adapter_*.safetensors + trainer state config.json # the base model’s TransformerConfig (e.g. hidden_size etc.) training_args.bin # a binary dump of all your TrainingArguments README.md # auto-generated pointer on how to re-load with PEFTModel special_tokens_map.json # maps your added special tokens to IDs tokenizer_config.json # tokenizer settings (truncation side, normalization…) tokenizer.json (or tokenizer.model) # the actual tokenizer files (vocab + merges/spm) ## Training Piepline Overview #### What’s happening in the pipeline: 1. **Data format** Each line of your `train.jsonl` and `test.jsonl` looks like: ```json { "messages": [ { "role": "system", "content": "…instructions…" }, { "role": "user", "content": "How many bacteria…?" }, { "role": "assistant", "content": "Answer: C" } ] } ``` * **system**: the task instructions (e.g. “You are a human user interacting with an AI…”) * **user**: the human’s question or conversation context * **assistant**: the “next user turn” we actually want the model to learn to produce 2. **Prompt ↦ Completion** When we fine-tune, each example becomes a `` pair: ```text prompt = "[SYSTEM] …instructions…\n\nHow many bacteria…?\n\n[USER] " completion = "Answer: C [END]" ``` In other words, we feed the model both the system instructions and the user’s prior context (up through `[USER] `), and train it to generate the assistant’s content (the “next user turn”) up through `[END]`. 3. **Perplexity evaluation** At evaluation time (via `eval_ppl.py`), for each user turn in the test set we: * Reconstruct the same `` * Mask out the prompt tokens (they contribute no loss) * Compute loss only on the tokens of the ground-truth completion * Aggregate over all tokens to get a corpus-level PPL That PPL directly measures “how well does my fine-tuned model assign probability to the true assistant response given the exact same prompt?” A lower PPL means the fine-tuned model is closer (more confident) in the human-like reply it was trained on, so it: - Scores exactly the same pairs you trained on—i.e. for every assistant turn, not user turns. - Reuses the fine-tune prompt format ([SYSTEM] …\n\n…\n\n[USER] ) and completion (assistant_content + " [END]"). - Masks out prompt tokens exactly, computes loss only on the true assistant tokens, then aggregates into a corpus-level PPL. — So as a summary: > **We treat “system + user” as the prompt, and the original assistant message as the completion. We fine-tune the LM to generate that completion, and then we measure perplexity on exactly that same split of `` to see how well the model learned to imitate the human-AI conversation.**