# tabula-py **Repository Path**: zhanggaoqiangcxy/tabula-py ## Basic Information - **Project Name**: tabula-py - **Description**: Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2017-09-10 - **Last Updated**: 2021-06-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # tabula-py [![Build Status](https://travis-ci.org/chezou/tabula-py.svg?branch=master)](https://travis-ci.org/chezou/tabula-py) `tabula-py` is a simple Python wrapper of [tabula-java](https://github.com/tabulapdf/tabula-java), which can read table of PDF. You can read tables from PDF and convert into pandas's DataFrame. ![](http://i.imgur.com/ODM8hst.jpg) # Requirements - Java - Confirmed working with Java 7, 8 - pandas ## OS I confirmed working on macOS and Ubuntu. I can't fully support Windows environment. # Usage ## Install ``` pip install tabula-py ``` If you want to become a contributor, you can install dependency for development of tabula-py as follows: ``` pip install -r requirements.txt -c constraints.txt ``` ## Example tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON. ```py import tabula # Read pdf into DataFrame df = tabula.read_pdf("test.pdf", options) # Read remote pdf into DataFrame df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf") # convert PDF into CSV tabula.convert_into("test.pdf", "output.csv", output_format="csv") # convert all PDFs in a directory tabula.convert_into_by_batch("input_directory", output_format='csv') ``` See [example notebook](./examples/tabula_example.ipynb) ## Options - pages (str, int, `list` of `int`, optional) - An optional values specifying pages to extract from. It allows `str`, `int`, `list` of `int`. - Example: 1, '1-2,3', 'all' or [1,2]. Default is 1 - guess (bool, optional): - Guess the portion of the page to analyze per page. - area (`list` of `float`, optional): - Portion of the page to analyze(top,left,bottom,right). - Example: [269.875, 12.75, 790.5, 561]. Default is entire page - lattice (bool, optional): - [`spreadsheet` option is deprecated] Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). - stream (bool, optional): - [`nospreadsheet` option is deprecated] Force PDF to be extracted using stream-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) - password (bool, optional): - Password to decrypt document. Default is empty - silent (bool, optional): - Suppress all stderr output. - columns (list, optional): - X coordinates of column boundaries. - Example: [10.1, 20.2, 30.3] - output_format (str, optional): - Format for output file or extracted object. - For `read_pdf()`: `json`, `dataframe` - For `convert_into()`: `csv`, `tsv`, `json` - output_path (str, optional): - Output file path. File format of it is depends on `format`. - Same as `--outfile` option of tabula-java. - java_options (`list`, optional): - Set java options like `-Xmx256m`. - pandas_options (`dict`, optional): - Set pandas options like `{'header': None}`. - multiple_tables (bool, optional): - (Experimental) Extract multiple tables. - This option uses JSON as an intermediate format, so if tabula-java output format will change, this option doesn't work. ## FAQ ### I can't `from tabula import read_pdf` If you've installed `tabula`, it will be conflict the namespace. You should install `tabula-py` after removing `tabula`. ``` pip uninstall tabula pip install tabula-py ``` ### Can I use option `xxx`? Yes. You can use `options` argument as following. The format is same as cli of tabula-java. ```py read_pdf_table(file_path, options="--columns 10.1,20.2,30.3") ``` ### How can I ignore useless area? In short, you can extract with `area` and `spreadsheet` option. ```py In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91)) Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Out[4]: Unnamed: 0 Col2 Col3 Col4 Col5 0 A B 12 R G 1 NaN R T 23 H 2 B B 33 R A 3 C T 99 E M 4 D I 12 34 M 5 E I I W 90 6 NaN 1 2 W h 7 NaN 4 3 E H 8 F E E4 R 4 ``` *How to use `area` option* According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want For example, using macOS's preview, I got area information of this [PDF](https://github.com/chezou/tabula-py/files/711877/table.pdf): ![image](https://cloud.githubusercontent.com/assets/916653/22047470/b201de24-dd6a-11e6-9cfc-7bc73e33e3b2.png) ``` java -jar ./target/tabula-1.0.1-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename ``` given ``` Note the left, top, height, and width parameters and calculate the following: y1 = top x1 = left y2 = top + height x2 = left + width ``` I confirmed with tabula-java: ``` java -jar ./tabula/tabula-1.0.1-jar-with-dependencies.jar -a "337.29,226.49,472.85,384.91" table.pdf ``` Without `-r`(same as `--spreadsheet`) option, it does not work properly. ### I faced `CParserError`. How can I extract multiple tables? Use `mutiple_tables` option. Note: This option is experimental.