# ml-cread **Repository Path**: mirrors_apple/ml-cread ## Basic Information - **Project Name**: ml-cread - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-05-22 - **Last Updated**: 2026-03-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CREAD: Combined Resolution of Ellipses and Anaphora in Dialogues [**Paper**](https://arxiv.org/abs/2105.09914) | [**Task**](#Task-Description) | [**Dataset**](#Dataset) | [**Run Code**](#Run-the-Code) | [**Citation**](#Citation) | [**License**](#License) | [**Contact**](#Contact-Us) This is the source code of the paper [CREAD: Combined Resolution of Ellipses and Anaphora in Dialogues](https://arxiv.org/abs/2105.09914). In this work, we propose a novel joint learning framework of modeling coreference resolution and query rewriting for complex, multi-turn dialogue understanding. The coreference resolution [MuDoCo](https://github.com/facebookresearch/mudoco) dataset augmented with our query rewrite annotation is released as well. ## Task Description Given an ongoing dialogue between a user and a dialogue assistant, for the user query, the model is required to predict both coreference links between the query and the dialogue context, and the self\-contained rewritten user query that is independent to the dialogue context. ## Dataset The MuDoCo dataset is a public dataset that contains 7.5k task\-oriented multi\-turn dialogues across 6 domains (calling, messaging, music, news, reminders, weather). Each dialogue turn is annotated with coreference links (`links` field). Please refer to [MuDoCo](https://github.com/facebookresearch/mudoco) for more details. In the **MuDoCo\-QR\-dataset** used in work, we annotate the query rewrite for each utterance, including both user and system turn. On top of the MudoCo data format, we add three fields `graded`, `rewrite_required` and `rewritten_utterance`. Most of the turns are with annotated with query rewrite (`graded` is true). Only 1.4% dialogue turns with incomplete dialogue context (e.g., missing turns) in MuDoCo are filtered out (`graded` is false). `rewrite_required` records whether the input utterance should be rewritten or not. `rewritten_utterance` is the rewritten query, same as the utterance if `rewrite_required` is false. The resulting dataset is provided in the folder `MuDoCo-QR-dataset`. ```json { "number": 3, "utterance": "Show me a live version that he moonwalks on .", "links": [ [ { "turn_id": 1, "text": "Michael Jackson", "span": { "start": 5, "end": 20 } }, { "turn_id": 3, "text": "he", "span": { "start": 28, "end": 30 } } ] ], "graded": true, "rewritten_utterance": "Show me a live version that Michael Jackson moonwalks on", "rewrite_required": true } ``` ## Requirements python3.6 and the packages in `requirements.txt`, install them by running ```console >>> pip install -r requirements.txt ``` ## Run the Code Enter the `modeling` folder and follow the instruction below. ```console >>> cd modeling ``` ## Data Pre-processing First run the following command to prepare the data for training. The processed data will be stored in the `proc_data/` directory. ```console >>> python utils/process_data.py ``` ## Training Run `train.sh` to train the model, which calls `main.py` with default hyper-parameters. ```console >>> bash train.sh [job_name] ``` The model checkpoint will be stored at `checkpoint/$job_name`, and training log file is at `log/$job_name.log` A reference training log (`log/trained-cread.log`) is provided. ## Evaluation Run `decode.sh` to decode using a trained model. `job_name` is the same as specified in training. ```console >>> bash decode.sh [job_name] ``` Evaluation result, with both generated rewritten utterances and model performance, is recorded in `deocde/$job_name.json`. A reference decoding file (`decode/trained-cread.json`) is provided. ## Key Hyper-parameters in Main.py - task: which task to perform. The default value `qr-coref` specifies our complete joint learning model. Set to `qr` for the model variant `qr-only` model or `coref` for the model variant `coref-only` model. - coref\_layer\_idx: which gpt2 layers to use for coreference resolution, e.g., "1,5,11" uses three layers. n is between 0 to 11, if default gpt2\-small is used. - n\_coref\_head: how many attention heads to use in each layer for coreference resolution. n is between 1 to 12. - use_coref\_attn: whether to use coref2qr attention mechanism. - use\_binary\_cls: whether to use binary rewriting classifier. More detailed explanation of other arguments can be found in `utils/utils.py`. ## Citation ```bibtex @inproceedings{tseng-etal-2021-cread, title = "{CREAD}: Combined Resolution of Ellipses and Anaphora in Dialogues", author = "Tseng, Bo-Hsiang and Bhargava, Shruti and Lu, Jiarui and Moniz, Joel Ruben Antony and Piraviperumal, Dhivya and Li, Lin and Yu, Hong", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.naacl-main.265", pages = "3390--3406", } ``` ## License The code in this repository is licensed according to the [LICENSE](LICENSE) file. ## Contact Us Please contact bht26@cam.ac.uk or hong\_yu@apple.com, or raise an issue in this repository.