dump_tables
Extract tables to JSON, CSV, or Markdown
Usage
pdftl
<input>dump_tables[csv|markdown][<page_spec>...][output<output>]
Details
The dump_tables operation extracts tabular data from a PDF file and
outputs it as structured JSON.
It uses the tablers library for table detection and extraction. Tables
are identified by their line/rectangle borders (lattice-style detection).
Note: This operation works only with native text-based PDFs. Scanned PDFs or PDFs where tables are rendered as images will not yield results.
Filtering
min_rows=N— exclude tables with fewer than N rows (e.g.min_rows=2)min_cols=N— exclude tables with fewer than N columnsmin_area=N— exclude tables whose bounding box area is less than N square pointsno_empty— exclude tables where every cell is empty
Output Schema
The output JSON contains a tables list. Each entry corresponds to a
detected table and includes:
page: The 1-indexed page number containing the table.
table_index: The 0-indexed position of this table among all tables on that page.
bbox: Bounding box of the table
[x1, y1, x2, y2]in PDF points.rows: Number of rows detected.
cols: Number of columns detected.
data: A list of rows, each a list of cell objects with:
text: The cell’s text content, or
nullfor merged continuation slots.merged_left:
trueif this slot continues a cell from the left.merged_top:
trueif this slot continues a cell from above.
Output Formats
By default, output is JSON. Pass csv to output each table as CSV blocks
separated by a --- delimiter line. Pass markdown to output tables in
Markdown format.
Dependency note
Table extraction requires the tablers library. Install it with:
pip install pdftl[dump-tables]
or directly:
pip install tablers
Examples
Print tables from in.pdf as JSON to stdout
pdftl in.pdf dump_tables
Save table data from in.pdf to tables.json
pdftl in.pdf dump_tables output tables.json
Save tables from in.pdf as CSV
pdftl in.pdf dump_tables csv output tables.csv
Print tables from in.pdf as Markdown
pdftl in.pdf dump_tables markdown
Extract tables from pages 1, 3, 4, and 5
pdftl in.pdf dump_tables 1 3-5
Skip likely-spurious tables
pdftl in.pdf dump_tables min_rows=2 min_cols=2 no_empty
Tags: info, tables, text
Source: pdftl.operations.dump_tables
Read online: https://pdftl.readthedocs.io/en/latest/operations/dump_tables.html
Type: Operation