# pubmed_parser

**Repository Path**: git_mirror/pubmed_parser

## Basic Information

- **Project Name**: pubmed_parser
- **Description**: https://github.com/titipata/pubmed_parser
📋 A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-12-22
- **Last Updated**: 2025-02-08

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/titipata/pubmed_parser/blob/master/LICENSE) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01979/status.svg)](https://doi.org/10.21105/joss.01979)
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3660006.svg)](https://doi.org/10.5281/zenodo.3660006) [![Build Status](https://travis-ci.com/titipata/pubmed_parser.svg?branch=master)](https://travis-ci.com/titipata/pubmed_parser)

Pubmed Parser is a Python library for parsing the [PubMed Open-Access (OA) subset](http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/)
 , [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses the `lxml` library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see our [wiki page](https://github.com/titipata/pubmed_parser/wiki) or
 [documentation page](http://titipata.github.io/pubmed_parser/) for more details. Below, we list some of the core funtionalities and code examples.

## Available Parsers

* `path` provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the [ `data` ](data/) folder.
* for website parsing, you should scrape with pause. Please see the [copyright notice](https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC) because your IP can get blocked if you try to download in bulk.

Below, we list available parsers from `pubmed_parser`.

  * [Parse PubMed OA XML information](#parse-pubmed-oa-xml-information)
  * [Parse PubMed OA citation references](#parse-pubmed-oa-citation-references)
  * [Parse PubMed OA images and captions](#parse-pubmed-oa-images-and-captions)
  * [Parse PubMed OA Paragraph](#parse-pubmed-oa-paragraph)
  * [Parse PubMed OA Table [WIP]](#parse-pubmed-oa-table-wip)
  * [Parse MEDLINE XML](#parse-medline-xml)
  * [Parse MEDLINE Grant ID](#parse-medline-grant-id)
  * [Parse MEDLINE XML from eutils website](#parse-medline-xml-from-eutils-website)
  * [Parse MEDLINE XML citations from website](#parse-medline-xml-citations-from-website)
  * [Parse Outgoing XML citations from website](#parse-outgoing-xml-citations-from-website)

### Parse PubMed OA XML information

We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called `parse_pubmed_xml` which will return a dictionary with the following information:

* `full_title` : article's title
* `abstract` : abstract
* `journal` : Journal name
* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `doi` : DOI of the article
* `publisher_id` : publisher ID
* `author_list` : list of authors with affiliation keys in the following format

``` python
 [['last_name_1', 'first_name_1', 'aff_key_1'],
  ['last_name_1', 'first_name_1', 'aff_key_2'],
  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]
 ```

* `affiliation_list` : list of affiliation keys and affiliation strings in the following format

``` python
 [['aff_key_1', 'affiliation_1'],
  ['aff_key_2', 'affiliation_2'], ...]
```

* `publication_year` : publication year
* `subjects` : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.

``` python
import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)
```

### Parse PubMed OA citation references

The function `parse_pubmed_references` will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows

* `pmid` : PubMed ID of the article
* `pmc` : PubMed Central ID of the article
* `article_title` : title of cited article
* `journal` : journal name
* `journal_type` : type of journal
* `pmid_cited` : PubMed ID of article that article cites
* `doi_cited` : DOI of article that article cites
* `year` : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)

``` python
dicts_out = pp.parse_pubmed_references(path) # return list of dictionary
```

### Parse PubMed OA images and captions

The function `parse_pubmed_caption` can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `fig_caption` : string of caption
* `fig_id` : reference id for figure (use to refer in XML article)
* `fig_label` : label of the figure
* `graphic_ref` : reference to image file name provided from Pubmed OA

``` python
dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary
```

### Parse PubMed OA Paragraph

For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use `parse_pubmed_paragraph` to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `text` : full text of the paragraph
* `reference_ids` : list of reference code within that paragraph.

This IDs can merge with output from `parse_pubmed_references` .

* `section` : section of paragraph (e.g. Background, Discussion, Appendix, etc.)

``` python
dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)
```

### Parse PubMed OA Table [WIP]

You can use `parse_pubmed_table` to parse table from XML file. This function will return list of dictionaries where each has following keys.

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `caption` : caption of the table
* `label` : lable of the table
* `table_columns` : list of column name
* `table_values` : list of values inside the table
* `table_xml` : raw xml text of the table (return if `return_xml=True`)

``` python
dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)
```

### Parse MEDLINE XML

MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function `parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `doi` : DOI
* `other_id` : Other IDs found, each separated by `;`
* `title` : title of the article
* `abstract` : abstract of the article
* `authors` : authors, each separated by `;`
* `mesh_terms` : list of MeSH terms with corresponding MeSH ID, each separated by `;` e.g. `'D000161:Acoustic Stimulation; D000328:Adult; ...`
* `publication_types` : list of publication type list each separated by `;` e.g. `'D016428:Journal Article'`
* `keywords` : list of keywords, each separated by `;`
* `chemical_list` : list of chemical terms, each separated by `;`
* `pubdate` : Publication date. Defaults to year information only.
* `journal` : journal of the given paper
* `medline_ta` : this is abbreviation of the journal name
* `nlm_unique_id` : NLM unique identification
* `issn_linking` : ISSN linkage, typically use to link with Web of Science dataset
* `country` : Country extracted from journal information field
* `reference` : string of PMID each separated by `;` or list of references made to the article
* `delete` : boolean if `False` means paper got updated so you might have two
* `languages` : list of languages, separated by `;`
* `vernacular_title`: vernacular title. Defaults to empty string whenever non-available.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.

``` python
dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
                                 year_info_only=False,
                                 nlm_category=False,
                                 author_list=False,
                                 reference_list=False) # return list of dictionary
```

To extract month and day information from PubDate, set `year_info_only=True`. We also allow parsing structured abstract and we can control display of each section or label by changing `nlm_category` argument.

### Parse MEDLINE Grant ID

Use `parse_grant_id` in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing

* `pmid` : PubMed ID
* `grant_id` : Grant ID
* `grant_acronym` : Acronym of grant
* `country` : Country where grant funding from
* `agency` : Grant agency

If no Grant ID is found, it will return `None`

### Parse MEDLINE XML from eutils website

You can use PubMed parser to parse XML file from [E-Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25501/) using `parse_xml_web` . For this function, you can provide a single `pmid` as an input and get a dictionary with following keys

* `title` : title
* `abstract` : abstract
* `journal` : journal
* `affiliation` : affiliation of first author
* `authors` : string of authors, separated by `;`
* `year` : Publication year
* `keywords` : keywords or MESH terms of the article

``` python
dict_out = pp.parse_xml_web(pmid, save_xml=False)
```

### Parse MEDLINE XML citations from website

The function `parse_citation_web` allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

* `pmc` : PubMed Central ID
* `pmid` : PubMed ID
* `doi` : DOI of the article
* `n_citations` : number of citations for given articles
* `pmc_cited` : list of PMCs that cite the given PMC

``` python
dict_out = pp.parse_citation_web(doc_id, id_type='PMC')
```

### Parse Outgoing XML citations from website

The function `parse_outgoing_citation_web` allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

* `n_citations` : number of cited articles
* `doc_id` : the document identifier given
* `id_type` : the type of identifier given. Either `'PMID'` or `'PMC'`
* `pmid_cited` : list of PMIDs cited by the article

``` python
dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')
```

Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings *without* the `'PMC'` prefix. If no citations are found, or if no article is found matching `doc_id` in the indicated database, it will return `None`.

## Installation

You can install the most update version of the package directly from the repository

``` bash
pip install git+https://github.com/titipata/pubmed_parser.git
```

or install recent release with [PyPI](https://pypi.org/project/pubmed-parser/) using

``` bash
pip install pubmed-parser
```

or clone the repository and install using `pip`

``` bash
git clone https://github.com/titipata/pubmed_parser
pip install ./pubmed_parser
```

You can test your installation by running `pytest --cov=pubmed_parser tests/ --verbose`
in the root of the repository.

## Example snippet to parse PubMed OA dataset

An example usage is shown as follows

``` python
import pubmed_parser as pp
path_xml = pp.list_xml_path('data') # list all xml paths under directory
pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
print(pubmed_dict)

{'abstract': u"Background Despite identical genotypes and ...",
 'affiliation_list':
  [['I1': 'Department of Biological Sciences, ...'],
   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],
  'author_list':
  [['Dennehy', 'John J', 'I1'],
   ['Dennehy', 'John J', 'I2'],
   ['Wang', 'Ing-Nang', 'I1']],
 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',
 'journal': 'BMC Microbiology',
 'pmc': '3166277',
 'pmid': '21810267',
 'publication_year': '2011',
 'publisher_id': '1471-2180-11-174',
 'subjects': 'Research Article'}
```

## Example Usage with PySpark

This is a snippet to parse all PubMed Open Access subset using [PySpark 2.1](https://spark.apache.org/docs/latest/api/python/index.html)

``` python
import os
import pubmed_parser as pp
from pyspark.sql import Row

path_all = pp.list_xml_path('/path/to/xml/folder/')
path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)
parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),
                                               **pp.parse_pubmed_xml(x)))
pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe
pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',
                                 'file_name', 'pmc', 'pmid',
                                 'publication_year', 'publisher_id',
                                 'journal', 'subjects']] # select columns
pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe
```

See [scripts](https://github.com/titipata/pubmed_parser/tree/master/scripts)
folder for more information.

## Core Members

* [Titipat Achakulvisut](http://titipata.github.io)
* [Daniel E. Acuna](http://scienceofscience.org/about)

and [contributors](https://github.com/titipata/pubmed_parser/graphs/contributors)

## Dependencies

* [lxml](http://lxml.de/)
* [unidecode](https://pypi.python.org/pypi/Unidecode)
* [requests](http://docs.python-requests.org/en/master/)

## Citation

If you use Pubmed Parser, please cite it from [JOSS](https://joss.theoj.org/papers/10.21105/joss.01979) as follows

> Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979

or using BibTex

```
@article{Achakulvisut2020,
  doi = {10.21105/joss.01979},
  url = {https://doi.org/10.21105/joss.01979},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {46},
  pages = {1979},
  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},
  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},
  journal = {Journal of Open Source Software}
}
```

## Contributions

We welcome contributions from anyone who would like to improve Pubmed Parser. You can create [GitHub issues](https://github.com/titipata/pubmed_parser/issues) to discuss questions or issues relating to the repository. We suggest you to read our [Contributing Guidelines](https://github.com/titipata/pubmed_parser/blob/master/CONTRIBUTING.md) before creating issues, reporting bugs, or making a contribution to the repository.

## Acknowledgement

This package is developed in [Konrad Kording's Lab](http://kordinglab.com/) at the University of Pennsylvania. We would like to thank reviewers and the editor from [JOSS](https://joss.readthedocs.io/en/latest/) including [`tleonardi`](https://github.com/tleonardi), [`timClicks`](https://github.com/timClicks), and [`majensen`](https://github.com/majensen). They made our repository much better!

## License

MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna