dump_tables

Extract tables to JSON, CSV, or Markdown

Usage

pdftl <input> dump_tables [csv|markdown] [<page_spec>...] [output <output>]

Details

The dump_tables operation extracts tabular data from a PDF file and outputs it as structured JSON.

It uses the tablers library for table detection and extraction. Tables are identified by their line/rectangle borders (lattice-style detection).

Note: This operation works only with native text-based PDFs. Scanned PDFs or PDFs where tables are rendered as images will not yield results.

Filtering

  • min_rows=N — exclude tables with fewer than N rows (e.g. min_rows=2)

  • min_cols=N — exclude tables with fewer than N columns

  • min_area=N — exclude tables whose bounding box area is less than N square points

  • no_empty — exclude tables where every cell is empty

Output Schema

The output JSON contains a tables list. Each entry corresponds to a detected table and includes:

  • page: The 1-indexed page number containing the table.

  • table_index: The 0-indexed position of this table among all tables on that page.

  • bbox: Bounding box of the table [x1, y1, x2, y2] in PDF points.

  • rows: Number of rows detected.

  • cols: Number of columns detected.

  • data: A list of rows, each a list of cell objects with:

    • text: The cell’s text content, or null for merged continuation slots.

    • merged_left: true if this slot continues a cell from the left.

    • merged_top: true if this slot continues a cell from above.

Output Formats

By default, output is JSON. Pass csv to output each table as CSV blocks separated by a --- delimiter line. Pass markdown to output tables in Markdown format.

Dependency note

Table extraction requires the tablers library. Install it with:

pip install pdftl[dump-tables]

or directly:

pip install tablers

Examples

Print tables from in.pdf as JSON to stdout

pdftl in.pdf dump_tables

Save table data from in.pdf to tables.json

pdftl in.pdf dump_tables output tables.json

Save tables from in.pdf as CSV

pdftl in.pdf dump_tables csv output tables.csv

Print tables from in.pdf as Markdown

pdftl in.pdf dump_tables markdown

Extract tables from pages 1, 3, 4, and 5

pdftl in.pdf dump_tables 1 3-5

Skip likely-spurious tables

pdftl in.pdf dump_tables min_rows=2 min_cols=2 no_empty

Tags: info, tables, text

Source: pdftl.operations.dump_tables

Read online: https://pdftl.readthedocs.io/en/latest/operations/dump_tables.html

Type: Operation