# dl4marco-bert **Repository Path**: lduml/dl4marco-bert ## Basic Information - **Project Name**: dl4marco-bert - **Description**: 使用bert对检索的候选文档排序,在微软ms数据集以及treccar上使用 - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 1 - **Created**: 2020-04-04 - **Last Updated**: 2021-03-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 注释 使用BERT搭建分类模型,根据预测结果计算 MRR, MAP,RPrec,MRR@10,NDCG等 可以捋捋是如何使用 item["log_probs"], item["label_ids"] 的 creat_model 返回 (loss, per_example_loss, log_probs) log_probs是softmax后的结果 预测时返回 (log_probs,label_ids) 一个预测概率,一个真实值 ```python log_probs, labels = zip(*results) log_probs = np.stack(log_probs).reshape(-1, 2) labels = np.stack(labels) scores = log_probs[:, 1] pred_docs = scores.argsort()[::-1] gt = set(list(np.where(labels > 0)[0])) ``` 由 log_probs 预测出排序后的结果 比如:预测出doc的顺序 1、3、2、5、4(1最相关,4最不相关) 由 pred_docs与gt计算预测分数 # Passage Re-ranking with BERT ## Introduction **\*\*\*\*\* Most of the code in this repository was copied from the original [BERT repository](https://github.com/google-research/bert).**\*\*\*\*\* This repository contains the code to reproduce our entry to the [MSMARCO passage ranking task](http://www.msmarco.org/leaders.aspx), which was placed first with a large margin over the second place. It also contains the code to reproduce our result on the [TREC-CAR dataset](http://trec-car.cs.unh.edu/), which is ~22 MAP points higher than the best entry from 2017 and a well-tuned BM25. MSMARCO Passage Re-Ranking Leaderboard (Jan 8th 2019) | Eval MRR@10 | Dev MRR@10 ------------------------------------- | :------: | :------: 1st Place - BERT (this code) | **35.87** | **36.53** 2nd Place - IRNet | 28.06 | 27.80 3rd Place - Conv-KNRM | 27.12 | 29.02 TREC-CAR Test Set (Automatic Annotations) | MAP ----------------------------------------------------- | :------: BERT (this code) | **33.5** BM25 [Anserini](https://github.com/castorini/Anserini/blob/master/docs/experiments-car17.md) | 15.6 [MacAvaney et al., 2017](https://trec.nist.gov/pubs/trec26/papers/MPIID5-CAR.pdf) (TREC-CAR 2017 Best Entry) | 14.8 The paper describing our implementation is [here](https://arxiv.org/abs/1901.04085). ## Data We made available the following data: File | Description | Size | MD5 :----|:----|-----:|:----| [BERT_Large_trained_on_MSMARCO.zip](https://drive.google.com/open?id=1crlASTMlsihALlkabAQP6JTYIZwC1Wm8) | BERT-large trained on MS MARCO | 3.4 GB | `2616f874cdabadafc55626035c8ff8e8` [BERT_Base_trained_on_MSMARCO.zip](https://drive.google.com/open?id=1cyUrhs7JaCJTTu-DjFUqP6Bs4f8a6JTX) | BERT-base trained on MS MARCO | 1.1 GB | `7a8c621e01c127b55dbe511812c34910` [MSMARCO_tfrecord.tar.gz](https://drive.google.com/open?id=1IHFMLOMf2WqeQ0TuZx_j3_sf1Z0fc2-6) | MS MARCO TF Records | 9.1 GB | `c15d80fe9a56a2fb54eb7d94e2cfa4ef` [BERT_Large_dev_run.tsv](https://drive.google.com/file/d/168BFaZyIaia1opBAZTI_CEH9XM8lHK63/view?usp=sharing) | BERT-large run dev set (~6980 queries x 1000 docs per query) | 121 MB | `bcbbe19bcb2549dea3f26168c2bc445b` [BERT_Large_test_run.tsv](https://drive.google.com/file/d/1vDcyTODQk48xpbbcJax9I_cBJRilBEVm/view?usp=sharing) | BERT-large run test set (~6836 queries x 1000 docs per query) | 119 MB | `9779903606e5b545f491132d8c2cf292` [BERT_Large_trained_on_TREC_CAR.tar.gz](https://drive.google.com/open?id=1fzcL2nzUJMUd0w4J5JIeASSrN4uHlSqP) | BERT-large trained on TREC-CAR | 3.4 GB | `8baedd876935093bfd2bdfa66f2279bc` [BERT_Large_pretrained_on_TREC_CAR...](https://storage.googleapis.com/bert_treccar_data/pretrained_models/BERT_Large_pretrained_on_TREC_CAR_training_set_1M_iterations.tar.gz) | BERT-large pretrained on TREC-CAR's training set for 1M iterations | 3.4 GB | `9c6f2f8dbf9825899ee460ee52423b84` [treccar_files.tar.gz](https://drive.google.com/open?id=16tk7HmLaqvU0oIO5L_H8elwqKn2cJUzG) | TREC-CAR queries, qrels, runs, and TF Records | 4.0 GB | `4e6b5580e0b2f2c709d76ac9c7e7f362` [bert_predictions_test.run.tar.gz](https://drive.google.com/file/d/1bhTjtz_IK0ER5S-eV0AxyhjHCupLiukN/view?usp=sharing) | TREC-CAR 2017 Automatic Run reranked by BERT-Large |71M | `d5c135c6cf5a6d25199bba29d43b58ba` ## MS MARCO ### Download and extract the data First, we need to download and extract MS MARCO and BERT files: ``` DATA_DIR=./data mkdir ${DATA_DIR} wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ${DATA_DIR} wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz -P ${DATA_DIR} wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.eval.tar.gz -P ${DATA_DIR} wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.small.tsv -P ${DATA_DIR} wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip -P ${DATA_DIR} tar -xvf ${DATA_DIR}/triples.train.small.tar.gz -C ${DATA_DIR} tar -xvf ${DATA_DIR}/top1000.dev.tar.gz -C ${DATA_DIR} tar -xvf ${DATA_DIR}/top1000.eval.tar.gz -C ${DATA_DIR} unzip ${DATA_DIR}/uncased_L-24_H-1024_A-16.zip -d ${DATA_DIR} ``` ### Convert MS MARCO to TFRecord format Next, we need to convert MS MARCO train, dev, and eval files to TFRecord files, which will be later consumed by BERT. ```bash mkdir ${DATA_DIR}/tfrecord !python convert_msmarco_to_tfrecord.py \ --output_folder=E:\deep_learning_data\2020_passage_ranking\tfrecord \ --vocab_file=E:\deep_learning_data\2020_passage_ranking\BERT_Base_trained_on_MSMARCO\vocab.txt \ --train_dataset_path=E:\deep_learning_data\2020_passage_ranking\triples.train.small.5000.tsv \ --dev_dataset_path=E:\deep_learning_data\2020_passage_ranking\top1000.dev\top1000.dev.tsv \ --eval_dataset_path=E:\deep_learning_data\2020_passage_ranking\top1000.eval\top1000.eval.tsv \ --dev_qrels_path=E:\deep_learning_data\2020_passage_ranking\qrels.dev.tsv \ --max_query_length=64\ --max_seq_length=512 \ --num_eval_docs=100 ``` ```bash --output_folder=E:\deep_learning_data\2020_passage_ranking\tfrecord --vocab_file=E:\deep_learning_data\2020_passage_ranking\BERT_Base_trained_on_MSMARCO\vocab.txt --train_dataset_path=E:\deep_learning_data\2020_passage_ranking\triples.train.small.5000.tsv --dev_dataset_path=E:\deep_learning_data\2020_passage_ranking\top1000.dev\top1000.dev.tsv --eval_dataset_path=E:\deep_learning_data\2020_passage_ranking\top1000.eval\top1000.eval.tsv --dev_qrels_path=E:\deep_learning_data\2020_passage_ranking\qrels.dev.tsv --max_query_length=64 --max_seq_length=512 --num_eval_docs=100 ``` ``` mkdir ${DATA_DIR}/tfrecord python convert_msmarco_to_tfrecord.py \ --output_folder=${DATA_DIR}/tfrecord \ --vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \ --train_dataset_path=${DATA_DIR}/triples.train.small.tsv \ --dev_dataset_path=${DATA_DIR}/top1000.dev.tsv \ --eval_dataset_path=${DATA_DIR}/top1000.eval.tsv \ --dev_qrels_path=${DATA_DIR}/qrels.dev.tsv \ --max_query_length=64\ --max_seq_length=512 \ --num_eval_docs=1000 ``` This conversion takes 30-40 hours. Alternatively, you may download the [TFRecord files here](https://drive.google.com/open?id=1IHFMLOMf2WqeQ0TuZx_j3_sf1Z0fc2-6) (~23GB). ### Training We can now start training. We highly recommend using the free TPUs in [our Google's Colab](https://drive.google.com/open?id=1vaON2QlidC0rwZ8JFrdciWW68PYKb9Iu). Otherwise, a modern V100 GPU with 16GB cannot fit even a small batch size of 2 when training a BERT Large model. In case you opt for not using the Colab, here is the command line to start training: ``` !python run_msmarco.py \ --data_dir=E:\deep_learning_data\2020_passage_ranking/tfrecord \ --bert_config_file=E:\deep_learning_model\tf_google_bert模型\uncased_L-12_H-768_A-12\bert_config.json \ --init_checkpoint=E:\deep_learning_model\tf_google_bert模型\uncased_L-12_H-768_A-12\bert_model.ckpt \ --output_dir=E:\deep_learning_data\2020_passage_ranking/output \ --msmarco_output=True \ --do_train=True \ --do_eval=True \ --num_train_steps=1000 \ --num_warmup_steps=100 \ --train_batch_size=4 \ --eval_batch_size=8 \ --learning_rate=3e-6 ``` ``` --data_dir=E:\deep_learning_data\2020_passage_ranking/tfrecord --bert_config_file=E:\deep_learning_model\tf_google_bert模型\uncased_L-12_H-768_A-12\bert_config.json --init_checkpoint=E:\deep_learning_model\tf_google_bert模型\uncased_L-12_H-768_A-12\bert_model.ckpt --output_dir=E:\deep_learning_data\2020_passage_ranking/output --msmarco_output=True --do_train=True --do_eval=True --num_train_steps=1000 --num_warmup_steps=100 --train_batch_size=4 --eval_batch_size=8 --learning_rate=3e-6 ``` ``` python run_msmarco.py \ --data_dir=${DATA_DIR}/tfrecord \ --bert_config_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_config.json \ --init_checkpoint=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_model.ckpt \ --output_dir=${DATA_DIR}/output \ --msmarco_output=True \ --do_train=True \ --do_eval=True \ --num_train_steps=100000 \ --num_warmup_steps=10000 \ --train_batch_size=128 \ --eval_batch_size=128 \ --learning_rate=3e-6 ``` Training for 100k iterations takes approximately 30 hours on a TPU v3. Alternatively, you can [download the trained model used in our submission here](https://drive.google.com/open?id=1crlASTMlsihALlkabAQP6JTYIZwC1Wm8) (~3.4GB). You can also [download a BERT Base model trained on MS MARCO here](https://drive.google.com/open?id=1cyUrhs7JaCJTTu-DjFUqP6Bs4f8a6JTX). This model leads to ~2 points lower MRR@10 (34.7), but it is faster to train and evaluate. It can also fit on a single 12GB GPU. ## TREC-CAR We describe in the next sections how to reproduce our results on the [TREC-CAR](http://trec-car.cs.unh.edu/) dataset. ### Downloading qrels, run and TFRecord files The next steps (Indexing, Retrieval, and TFRecord conversion) take many hours. Alternatively, you can skip them and download [the necessary files for training and evaluation here](https://drive.google.com/open?id=16tk7HmLaqvU0oIO5L_H8elwqKn2cJUzG) (~4.0GB), namely: - queries (*.topics); - query-relevant passage pairs (*.qrels); - query-candidate passage pairs (*.run). - TFRecord files (*.tf) After downloading, you need to extract them to the TRECCAR_DIR folder: ``` TRECCAR_DIR=./treccar/ tar -xf treccar_files.tar.gz --directory ${TRECCAR_DIR} ``` And you are ready to go to the training/evaluation section. ### Downloading and Extracting the data If you decided to index, retrieve and convert to the TFRecord format, you first need to download and extract the TREC-CAR data: ``` TRECCAR_DIR=./treccar/ DATA_DIR=./data mkdir ${DATA_DIR} wget http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz -P ${TRECCAR_DIR} wget http://trec-car.cs.unh.edu/datareleases/v2.0/train.v2.0.tar.xz -P ${TRECCAR_DIR} wget http://trec-car.cs.unh.edu/datareleases/v2.0/benchmarkY1-test.v2.0.tar.xz -P ${TRECCAR_DIR} wget https://storage.googleapis.com/bert_treccar_data/pretrained_models/BERT_Large_pretrained_on_TREC_CAR_training_set_1M_iterations.tar.gz -P ${DATA_DIR} tar -xf ${TRECCAR_DIR}/paragraphCorpus.v2.0.tar.xz tar -xf ${TRECCAR_DIR}/train.v2.0.tar.xz tar -xf ${TRECCAR_DIR}/benchmarkY1-test.v2.0.tar.xz tar -xzf ${DATA_DIR}/BERT_Large_pretrained_on_TREC_CAR_training_set_1M_iterations.tar.gz ``` ### Indexing TREC-CAR We need to index the corpus and retrieve documents using the BM25 algorithm for each query so we have query-document pairs for training. We index the TREC-CAR corpus using [Anserini](https://github.com/castorini/Anserini), an excelent toolkit for information retrieval research. First, we need to install Maven, and clone and compile Anserini's repository: ``` sudo apt-get install maven git clone https://github.com/castorini/Anserini.git cd Anserini mvn clean package appassembler:assemble tar xvfz eval/trec_eval.9.0.4.tar.gz -C eval/ && cd eval/trec_eval.9.0.4 && make cd ../ndeval && make ``` Now we can index the corpus (.cbor files): ``` sh target/appassembler/bin/IndexCollection -collection CarCollection \ -generator LuceneDocumentGenerator -threads 40 -input ${TRECCAR_DIR}/paragraphCorpus.v2.0 -index \ ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -storePositions -storeDocvectors \ -storeRawDocs ``` You should see a message like this after it finishes: ``` 2019-01-15 20:26:28,742 INFO [main] index.IndexCollection (IndexCollection.java:578) - Total 29,794,689 documents indexed in 03:20:35 ``` ### Retrieving pairs of query-candidate document We now retrieve candidate documents for each query using the BM25 algorithm. But first, we need to convert the TREC-CAR files to a format that Anserini can consume. First, we merge qrels folds 0, 1, 2, and 3 into a single file for training. Fold 4 will be the dev set. ``` for f in ${TRECCAR_DIR}/train/fold-[0-3]-base.train.cbor-hierarchical.qrels; do (cat "${f}"; echo); done >${TRECCAR_DIR}/train.qrels cp ${TRECCAR_DIR}/train/fold-4-base.train.cbor-hierarchical.qrels ${TRECCAR_DIR}/dev.qrels cp ${TRECCAR_DIR}/benchmarkY1/benchmarkY1-test/test.pages.cbor-hierarchical.qrels ${TRECCAR_DIR}/test.qrels ``` We need to extract the queries (first column in the space-separated files): ``` cat ${TRECCAR_DIR}/train.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/train.topics cat ${TRECCAR_DIR}/dev.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/dev.topics cat ${TRECCAR_DIR}/test.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/test.topics ``` And remove all duplicated queries: ``` sort -u -o ${TRECCAR_DIR}/train.topics ${TRECCAR_DIR}/train.topics sort -u -o ${TRECCAR_DIR}/dev.topics ${TRECCAR_DIR}/dev.topics sort -u -o ${TRECCAR_DIR}/test.topics ${TRECCAR_DIR}/test.topics ``` We now retrieve the top-10 documents per query for training and development sets. ``` nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/train.topics -output ${TRECCAR_DIR}/train.run -hits 10 -bm25 & nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/dev.topics -output ${TRECCAR_DIR}/dev.run -hits 10 -bm25 & ``` And we retrieve top-1,000 documents per query for the test set. ``` nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/test.topics -output ${TRECCAR_DIR}/test.run -hits 1000 -bm25 & ``` After it finishes, you should see an output message like this: ``` (SearchCollection.java:166) - [Finished] Ranking with similarity: BM25(k1=0.9,b=0.4) 2019-01-16 23:40:56,538 INFO [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:167) - Run 2254 topics searched in 01:53:32 2019-01-16 23:40:56,922 INFO [main] search.SearchCollection (SearchCollection.java:499) - Total run time: 01:53:36 ``` This retrieval step takes 40-80 hours for the training set. We can speed it up by increasing the number of threads (ex: -threads 6) and loading the index into memory (-inmem option). ### Measuring BM25 Performance (optional) To be sure that indexing and retrieval worked fine, we can measure the performance of this list of documents retrieved with BM25: ``` eval/trec_eval.9.0.4/trec_eval -m map -m recip_rank -c ${TRECCAR_DIR}/test.qrels ${TRECCAR_DIR}/test.run ``` It is important to use the -c option as it assigns a score of zero to queries that had no passage returned. The output should be like this: ``` map all 0.1528 recip_rank all 0.2294 ``` ### Converting TREC-CAR to TFRecord We can now convert qrels (query-relevant document pairs), run ( query-candidate document pairs), and the corpus into training, dev, and test TFRecord files that will be consumed by BERT. (we need to install CBOR package: pip install cbor) ``` python convert_treccar_to_tfrecord.py \ --output_folder=${TRECCAR_DIR}/tfrecord \ --vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \ --corpus=${TRECCAR_DIR}/paragraphCorpus/dedup.articles-paragraphs.cbor \ --qrels_train=${TRECCAR_DIR}/train.qrels \ --qrels_dev=${TRECCAR_DIR}/dev.qrels \ --qrels_test=${TRECCAR_DIR}/test.qrels \ --run_train=${TRECCAR_DIR}/train.run \ --run_dev=${TRECCAR_DIR}/dev.run \ --run_test=${TRECCAR_DIR}/test.run \ --max_query_length=64\ --max_seq_length=512 \ --num_train_docs=10 \ --num_dev_docs=10 \ --num_test_docs=1000 ``` This step requires at least 64GB of RAM as we load the entire corpus onto memory. ### Training/Evaluating Before start training, you need to download a [BERT Large model pretrained on the training set of TREC-CAR](https://drive.google.com/open?id=1Ovc8DPtgQ411bUo-_UDSDVqpPsoWXvmG). This pretraining was necessary because the [official pre-trained BERT models](https://github.com/google-research/bert) were pre-trained on the full Wikipedia, and therefore they have seen, although in an unsupervised way, Wikipedia documents that are used in the test set of TREC-CAR. Thus, to avoid this leak of test data into training, we pre-trained the BERT re-ranker only on the half of Wikipedia used by TREC-CAR’s training set. Similar to MS MARCO training, we made available [this Google Colab](https://colab.research.google.com/drive/1uIXKkxkEbwe2Z6-tGmbbH10ptwd2Tr0u) to train and evaluate on TREC-CAR. In case you opt for not using the Colab, here is the command line to start training: ``` python run_treccar.py \ --data_dir=${TRECCAR_DIR}/tfrecord \ --bert_config_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_config.json \ --init_checkpoint=${DATA_DIR}/pretrained_models_exp898_model.ckpt-1000000 \ --output_dir=${TRECCAR_DIR}/output \ --trec_output=True \ --do_train=True \ --do_eval=True \ --trec_output=True \ --num_train_steps=400000 \ --num_warmup_steps=40000 \ --train_batch_size=32 \ --eval_batch_size=32 \ --learning_rate=1e-6 \ --max_dev_examples=3000 \ --num_dev_docs=10 \ --max_test_examples=None \ --num_test_docs=1000 ``` Because trec_output is set to True, this script will produce a TREC-formatted run file "bert_predictions_test.run". We can evaluate the final performance of our BERT model using the official TREC eval tool, which is included in Anserini: ``` eval/trec_eval.9.0.4/trec_eval -m map -m recip_rank -c ${TRECCAR_DIR}/test.qrels ${TRECCAR_DIR}/output/bert_predictions_test.run ``` And the output should be: ``` map all 0.3356 recip_rank all 0.4787 ``` We made available [our run file here](https://drive.google.com/file/d/1bhTjtz_IK0ER5S-eV0AxyhjHCupLiukN/view?usp=sharing). ### Trained models You can download our [BERT Large trained on TREC-CAR here](https://drive.google.com/open?id=1fzcL2nzUJMUd0w4J5JIeASSrN4uHlSqP). #### How do I cite this work? ``` @article{nogueira2019passage, title={Passage Re-ranking with BERT}, author={Nogueira, Rodrigo and Cho, Kyunghyun}, journal={arXiv preprint arXiv:1901.04085}, year={2019} } ```