# calibration
**Repository Path**: mcgrady164/calibration
## Basic Information
- **Project Name**: calibration
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-08-04
- **Last Updated**: 2021-08-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Calibration of Pre-trained Transformers
Code and datasets for our EMNLP 2020 paper [Calibration of Pre-trained Transformers](https://arxiv.org/abs/2003.07892). If you found this project helpful, please consider citing our paper:
```bibtex
@inproceedings{desai-durrett-2020-calibration,
author={Desai, Shrey and Durrett, Greg},
title={{Calibration of Pre-trained Transformers}},
year={2020},
booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
}
```
## Overview
Posterior calibration is a measure of how aligned a model's posterior probabilities are with empirical likelihoods. For example, a perfectly calibrated model that outputs 0.8 probability on 100 samples should get 80% of the samples correct. In this work, we analyze the calibration of two pre-trained Transformers (BERT and RoBERTa) on three tasks: natural language inference, paraphrase detection, and commonsense reasoning.
For natural language inference, we use [Stanford Natural Language Inference](https://nlp.stanford.edu/projects/snli/) (SNLI) and [Multi-Genre Natural Language Inference](https://www.nyu.edu/projects/bowman/multinli/) (MNLI). For paraphrase detection, we use [Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) (QQP) and [TwitterPPDB](https://languagenet.github.io/) (TPPDB). And, for commonsense reasoning, we use [Situations with Adversarial Generations](https://rowanzellers.com/swag/) (SWAG) and [HellaSWAG](https://rowanzellers.com/hellaswag/) (HSWAG).
To measure calibration error, we chiefly use expected calibration error (ECE), where ECE = 0 indicates perfect calibration. When used in-domain, pre-trained models ([BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692)) are generally **much more calibrated** than non-pre-trained models ([DA](https://arxiv.org/abs/1606.01933), [ESIM](https://arxiv.org/abs/1609.06038)). For example, on SWAG:
| Model | Accuracy | ECE |
|---------|:--------:|:----:|
| DA | 46.80 | 5.98 |
| ESIM | 52.09 | 7.01 |
| BERT | 79.40 | 2.49 |
| RoBERTa | 82.45 | 1.76 |
To bring down calibration error, we experiment with two strategies. First, **temperature scaling** (TS; dividing non-normalized logits by scalar T) almost always brings ECE below 1. Below, we show in-domain results with and without temperature scaling:
| Model | SNLI | QQP | SWAG |
|---------------|:----:|:----:|:----:|
| RoBERTa | 1.93 | 2.33 | 1.76 |
| RoBERTa (+TS) | 0.84 | 0.88 | 0.76 |
Second, deliberately inducing uncertainty via **label smoothing** (LS) helps calibrate posteriors out-of-domain. MLE training encourages models to be over-confident, which is typically unwarranted out-of-domain, where models should be uncertain. We show out-of-domain results with and without label smoothing:
| Model | MNLI | TPPDB | HSWAG |
|--------------|:----:|:-----:|:-----:|
| RoBERTa-MLE | 3.62 | 9.55 | 11.93 |
| RoBERTa-LS | 4.50 | 8.91 | 2.14 |
Please see [our paper](https://arxiv.org/abs/2003.07892) for the complete set of experiments and results!
## Instructions
### Requirements
This repository has the following requirements:
- `numpy==1.18.1`
- `scikit-learn==0.22.1`
- `torch==1.2.0`
- `tqdm==4.42.1`
- `transformers==2.4.1`
Use the following instructions to set up the dependencies:
```bash
$ virtualenv -p python3.6 venv
$ pip install -r requirements.txt
```
### Obtaining Datasets
Because many of these tasks are either included in the GLUE benchmark or commonly used to evaluate pre-trained models, the test sets are blind. Therefore, we split the development set in half to obtain a non-blind, held-out test set. The dataset splits are shown below, and additionally, you may [download the exact train/dev/test datasets](https://drive.google.com/file/d/1ro3Q7019AtGYSG76KeZQSq35lBi7lU3v/view?usp=sharing) used in our experiments. Use `tar -zxf calibration_data.tar.gz` to unpack the archive, and place it in the root directory.
| Dataset | Train | Dev | Test |
|---------|:-------:|:------:|:------:|
| SNLI | 549,368 | 4,922 | 4,923 |
| MNLI | 392,702 | 4,908 | 4,907 |
| QQP | 363,871 | 20,216 | 20,217 |
| TPPDB | 46,667 | 5,060 | 5,060 |
| SWAG | 73,547 | 10,004 | 10,004 |
| HSWAG | 39,905 | 5,021 | 5,021 |
### Fine-tuning Models
Pre-trained models can be fine-tuned on any dataset (SNLI, MNLI, QQP, TwitterPPDB, SWAG, HellaSWAG). Below, we show an example script for fine-tuning BERT on SNLI. To use maximum likelihood estimation (MLE), set `LABEL_SMOOTHING=-1`, otherwise to use label smoothing (LS), use `0 < LABEL_SMOOTHING < 1`. Models were trained using a single NVIDIA 32GB V100 GPU, although 16GB GPUs can be also be used with smaller batch sizes.
```bash
export DEVICE=0
export MODEL="bert-base-uncased" # options: bert-base-uncased, roberta-base
export TASK="SNLI" # options: SNLI, MNLI, QQP, TwitterPPDB, SWAG, HellaSWAG
export LABEL_SMOOTHING=-1 # options: -1 (MLE), [0, 1]
export MAX_SEQ_LENGTH=256
if [ $MODEL = "bert-base-uncased" ]; then
BATCH_SIZE=16
LEARNING_RATE=2e-5
WEIGHT_DECAY=0
fi
if [ $MODEL = "roberta-base" ]; then
BATCH_SIZE=32
LEARNING_RATE=1e-5
WEIGHT_DECAY=0.1
fi
python3 train.py \
--device $DEVICE \
--model $MODEL \
--task $TASK \
--ckpt_path "ckpt/${TASK}_${MODEL}.pt" \
--output_path "output/${TASK}_${MODEL}.json" \
--train_path "calibration_data/${TASK}/train.txt" \
--dev_path "calibration_data/${TASK}/dev.txt" \
--test_path "calibration_data/${TASK}/test.txt" \
--epochs 3 \
--batch_size $BATCH_SIZE \
--learning_rate $LEARNING_RATE \
--weight_decay $WEIGHT_DECAY \
--label_smoothing $LABEL_SMOOTHING \
--max_seq_length $MAX_SEQ_LENGTH \
--do_train \
--do_evaluate
```
### Evaluating Calibration
We evaluate calibration using the output files dumped in the previous step (when `--do_evaluate` is enabled). Below is an example script that evaluates the calibration of RoBERTa-MLE on QQP using temperature scaling. Note that we use QQP-dev to learn the temperature scaling hyperparameter `T`, then evaluate its performance on QQP-test.
```bash
export TRAIN_PATH="output/dev/QQP_QQP_roberta-base.json"
export TEST_PATH="output/test/QQP_QQP_roberta-base.json"
python3 calibrate.py \
--train_path $TRAIN_PATH \
--test_path $TEST_PATH \
--do_train \
--do_evaluate
```
And, here is a sample output generated by the previous command:
```
*** training ***
temperature = 1.35
*** evaluating ***
accuracy = 90.92752906257729
confidence = 90.47990015170494
temperature = 1.35
neg log likelihood = 0.21548164472197104
expected error = 1.0882433139537815
max error = 2.4872166586668243
total error = 7.413702354436335
```