# cnn-dailymail

**Repository Path**: git_mirror/cnn-dailymail

## Basic Information

- **Project Name**: cnn-dailymail
- **Description**: https://gitee.com/projects/import/url
Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization (Python3)
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-12-22
- **Last Updated**: 2024-04-08

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

This code produces the non-anonymized version of the CNN / Daily Mail summarization dataset for fine-tuning [BART](https://github.com/pytorch/fairseq/tree/master/examples/bart). It processes the dataset into the non-tokenized cased sample format expected by [BPE preprocessing](https://github.com/pytorch/fairseq/blob/master/examples/bart/README.cnn.md).

# Instructions

## 1. Download data
Download and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail.

## 2. Process into .source and .target files
Run

```
python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories
```

replacing `/path/to/cnn/stories` with the path to where you saved the `cnn/stories` directory that you downloaded; similarly for `dailymail/stories`.

For each of the URL lists (`all_train.txt`, `all_val.txt` and `all_test.txt`), the corresponding stories are read from file and written to text files `train.source`, `train.target`, `val.source`, `val.target`, and `test.source` and `test.target`. These will be placed in the newly created `cnn_dm` directory.

The output is now suitable for feeding to the BPE preprocessing step of BART fine-tuning.