# Cspider11

**Repository Path**: Samuelcoding/Cspider11

## Basic Information

- **Project Name**: Cspider11
- **Description**: 西湖大学在EMNLP2019上提出了一个中文text-to-sql的数据集CSpider，主要是选择Spider作为源数据集进行了问题的翻译，并利用SyntaxSQLNet作为基线系统进行了测试，同时探索了在中文上产生的一些额外的挑战，包括中文问题对英文数据库的对应问题(question-to-DBmapping)、中文的分词问题以及一些其他的语言现象。

挑战赛链接：https://taolusi.github.io/CSpider-explorer/


- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-06-15
- **Last Updated**: 2020-12-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CSpider: A Large-Scale Chinese Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

CSpider is a large Chinese dataset for complex and cross-domain semantic parsing and text-to-SQL task (natural language interfaces for relational databases). It is released with our EMNLP 2019 paper: [A Pilot Study for Chinese SQL Semantic Parsing](https://arxiv.org/abs/1909.13293). This repo contains all code for evaluation, preprocessing, and all baselines used in our paper. Please refer to [the task site](https://taolusi.github.io/CSpider-explorer/) for more general introduction and the leaderboard.

### Changelog
- `10/2019` We start a Chinese text-to-SQL task with the full dataset translated from [Spider](https://yale-lily.github.io/spider). The submission tutorial and our dataset can be found at our [task site](https://taolusi.github.io/CSpider-explorer/). Please follow it to get your results on the unreleased test data. Thank [Tao Yu](https://taoyds.github.io/) for sharing the test set with us.
- `9/2019` The dataset used in our EMNLP 2019 paper is redivided based on the training and deveploment sets from Spider. The dataset can be downloaded from [here](https://drive.google.com/drive/folders/1SVAdUQqZ2UjjcSCSxhVXRPcXxIMu1r_C?usp=sharing). This dataset is just released to reproduce the results in our paper. To join the CSpider leaderboard and better compare with the original English results, please refer to our [task site](https://taolusi.github.io/CSpider-explorer/) for full dataset.

### Citation
When you use the CSpider dataset, we would appreciate it if you cite the following:
```
@inproceedings{min2019pilot,
  title={A Pilot Study for Chinese SQL Semantic Parsing},
  author={Min, Qingkai and Shi, Yuefeng and Zhang, Yue},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={3643--3649},
  year={2019}
}
```
Our dataset is based on [Spider](https://github.com/taoyds/spider/), please cite it too.

### Baseline models

#### Environment Setup

1. The code uses Python 2.7 and [Pytorch 0.2.0](https://pytorch.org/get-started/previous-versions/) GPU, and will update python and Pytorch soon.
2. Install Pytorch via conda: `conda install pytorch=0.2.0 -c pytorch`
3. Install Python dependency: `pip install -r requirements.txt`

#### Prepare Data, Embeddings, and Pretrained Models
1. Download the data, embedding and database:
  - To use the full dataset(recommended), download train/dev data from [Google Drive](https://drive.google.com/drive/folders/1TxCUq1ydPuBdDdHF3MkHT-8zixluQuLa?usp=sharing) or [BaiduNetDisk](https://pan.baidu.com/s/1Dxj38wRbbTOe0t3mQ3qhMg) and evaluate on the unreleased test data based on the submission tutorial on our [task site](https://taolusi.github.io/CSpider-explorer/). Specifically, 
    - Put the downloaded `train.json` and `dev.json` under `chisp/data/char/` directory. To use word-based methods, please do the word segmentation first and put the json files under `chisp/data/word/` directory.
    - Put the downloaded `char_emb.txt` under `chisp/embedding/` directory. This is generated from the Tencent multilingual embeddings for the cross-lingual word embeddings schema. To use monolingual embedding schema, step 2 is necessary.
    - Put the downloaded `database` directory under `chisp/` directory.
    - Put the downloaded `train_gold.sql` and `dev_glod.sql` under `chisp/data/` directory.
  - To use the dataset redivided based on the original train and dev data in our paper, download the train/dev/test data from [here](https://drive.google.com/drive/folders/1SVAdUQqZ2UjjcSCSxhVXRPcXxIMu1r_C?usp=sharing). This dataset is released just to reproduce the results in our paper and results based on this dataset cannot join the leaderboard. Specifically,
    - Put the downloaded `data`, `database` and `embedding` directory under `chisp/` directory. And you can run all the experiments(step 2 is necessary) shown in our paper.
    - `models` directory contains all the pretrained models.
2. (optional) Download the pretrained [Glove](https://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip), and put it as `chisp/embedding/glove.%dB.%dd.txt`
3. Generate training files for each module: `python preprocess_data.py -s char|word`

#### Folder/File Description
- ``data/`` contains:
    - ``char/`` for character-based raw train/dev/test data, corresponding processed dataset and saved models can be found at ``char/generated_datasets``. 
    - ``word/`` for word-based raw train/dev/test data, corresponding processed dataset and saved models can be found at ``word/generated_datasets``.
- ``train.py`` is the main file for training. Use ``train_all.sh`` to train all the modules (see below).
- ``test.py`` is the main file for testing. It uses ``supermodel.py`` to call the trained modules and generate SQL queries. In practice, use ``test_gen.sh`` to generate SQL queries.
- ``evaluation.py`` is for evaluation. It uses ``process_sql.py``. In practice, use ``evaluation.sh`` to evaluate the generated SQL queries.


#### Training
Run ``train_all.sh`` to train all the modules.
It looks like:
```
python train.py \
    --data_root       path/to/char/or/word/based/generated_data \
    --save_dir        path/to/save/trained/module \
    --train_component <module_name> \
    --emb_path        path/to/embeddings 
    --col_emb_path    path/to/corresponding/embeddings/for/column
```

#### Testing
Run ``test_gen.sh`` to generate SQL queries.
``test_gen.sh`` looks like:
```
python test.py \
    --test_data_path  path/to/char/or/word/based/raw/dev/or/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --emb_path        path/to/embeddings 
    --col_emb_path    path/to/corresponding/embeddings/for/column
```

#### Evaluation
Run ``evaluation.sh`` to evaluate generated SQL queries.
``evaluation.sh`` looks like:
```
python evaluation.py \
    --gold            path/to/gold/dev/or/test/queries \ 
    --pred            path/to/predicted/dev/or/test/queries \
    --etype           evaluation/metric \
    --db              path/to/database \
    --table           path/to/tables \
```
``evalution.py`` is from the general evaluation process in [the Spider github page](https://github.com/taoyds/spider).

#### Acknowledgement

The implementation is based on [SyntaxSQLNet](https://github.com/taoyds/syntaxSQL). Please cite it too if you use this code.