# CodeT5

**Repository Path**: ecust-dp/code-t5

## Basic Information

- **Project Name**: CodeT5
- **Description**: Code replication for Vulnerability Identification experiments in the EMNLP 2021 paper: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-05-17
- **Last Updated**: 2025-01-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

#### 项目介绍
Code replication for Vulnerability Identification experiments in the EMNLP 2021 paper: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation


#### 系统环境

**P40-1**_高性能计算平台应用名称：gym

    CPU：10 核
    
    RAM：100 GB
    
    GPU：NVIDIA Tesla P40 24G
    
    OS：Ubuntu 18.04


**P40-2**_高性能计算平台应用名称：Tensorflow_GPU

    CPU：10 核
    
    RAM：64 GB 
    
    GPU：NVIDIA Tesla P40 24G
    
    OS：CentOS 7.7


**3090**_高性能计算平台应用名称：Desktop_GPU

    CPU：10 核
    
    RAM：20 GB
    
    GPU：Nvidia GeForce RTX 3090
    
    OS：CentOS 7.8


#### 环境配置

```git clone https://gitee.com/ecust-dp/code-t5.git```

```cd code-t5```

```conda create --name CodeT5 python=3.8```

```conda activate CodeT5```

```pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html```

```pip install -r requirements.txt```

Manually download **added_tokens.json, config.json, merges.txt, pytorch_model.bin, special_tokens_map.json, tokenizer_config.json, and vocab.json** from [Hugging Face](https://huggingface.co/Salesforce/codet5-base/tree/main) and upload them to the **/pretrained_models/codet5_base/** folder, or download **codet5_base.tar.gz** from 高性能计算平台(Path: Ecust-SE/HuggingFace/), upload to the **/pretrained_models/** folder, and extract the **codet5_base** folder containing above files via ```tar -zxvf codet5_base.tar.gz```

**Note:** The above steps also applies to codet5-small/codet5_small, and similar to using pretrained files of other baselines.

Manually download **test.jsonl, train.jsonl, and valid.jsonl** from [Google Drive](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data/defect) and upload them to the **/data/defect/** folder, or download **defect.tar.gz** from 高性能计算平台(Path: Ecust-SE/CodeT5-VI/), upload to the **/data/** folder path, and extract the **defect** folder containing above files via ```tar -zxvf defect.tar.gz```


#### 使用说明

cd sh

vi exp_with_args.sh and made following modification:

Line1: WORKDIR="Path/to/code-t5"


**Finetuning codet5_base**
```
python run_exp.py --model_tag codet5_base --task defect --sub_task none
```


**Finetuning codet5_small**
```
python run_exp.py --model_tag codet5_small --task defect --sub_task none
```


**Finetuning plbart_base**
```
python run_exp.py --model_tag bart_base --task defect --sub_task none
```


**Finetuning codebert_base**
```
python run_exp.py --model_tag codebert --task defect --sub_task none
```


**Finetuning roberta**
```
python run_exp.py --model_tag roberta --task defect --sub_task none
```


#### 结果对比

Table 5: Results on the code defect detection tasks.

| Methods      | Acc-paper | Acc-3090     | Acc-P40-1   | Acc-P40-2    |
|--------------|-----------|--------------|-------------|--------------|
| RoBERTa      | 61.05     | 54.06        | 54.06       | 54.06        |
| CodeBERT     | 62.08     | 63.91        | 63.54       | 63.54        |
| PLBART       | 63.18     | 63.40        |             | 64.06        |
| CodeT5-small | 63.40     | 63.69^2^     | **64.42**   | **64.42^2^** |
| CodeT5-base0 | **65.78** |              |             | 63.87        |
| CodeT5-base  | **65.78** | **64.39^2^** | 62.92       | 62.92^2^     |


**Notes:**
1.CodeT5-base0 means conducting experiments by using the default run_gen.py, other results are obtained by using the run_defect.py.
2.Boldface value shows the best result.
3.^2^ means the specific experiment conducted ≥ 2 times with same parameter settings (**Default: epoch10+patience2, Valid Best: Epoch-5**) on the same machine and have the same experimental results.
4.The experimental results of CodeT5-small on 3090 is **63.47** and **63.87** for **epoch15+patience5 (Valid Best: Epoch-5)** and **epoch30+patience10 (Valid Best: Epoch-5)**, respectively.
5.The experimental results of CodeT5-base on 3090 is **64.39** and **63.40^2^** for **epoch15+patience5 (Valid Best: Epoch-3)** and **epoch30+patience10 (Valid Best: Epoch-2)**, respectively.

#### 实验结果分析

**Findings:**
1. The experimental results on different machines differ, and fixed for specific CPUs.

2. Compared to 3090, experiments with regard to CodeT5 conducted on P40s yield contrary results.

3. Replication results of RoBERTa are the same for different GPUs, and are significantly smaller than the Acc score reported in the original paper. Actually, experimental results of all baselines presented in the paper are the same as those reported in the PLBART paper, which means they may not replicate these baselines and simply copy and paste the results from PLBART.

4. The actual performance of PLBART is comparable to CodeT5, rather than the wide gap between them reported in the paper.

5. More experiments using different datasets and machines need to be conducted to further validate the real performance of these pre-trained models, besides, all models need to be evaluated on more evaluation metrics.