diff_text

Diff the text content of two PDFs and output bounding boxes

Usage

pdftl <input_a> diff_text <input_b> [options...] [output <output>]

Details

Performs a highly granular, spatially-aware text comparison between two PDFs. Outputs a JSON array of structural changes, including precise bounding box coordinates for where the changes occurred on the page.

Options

  • granularity=<level>: Controls how the diff engine groups text before comparing. Options: char, word, line, paragraph. Using word prevents sub-word shredding on typos. (Default: word)

  • ignore_whitespace=<bool>: If true, drops changes where the only difference is invisible space (e.g., reflow line-breaks). (Default: true)

  • ignore_soft_hyphens=<bool>: If true, strips \ufffe soft hyphens before comparing. Useful for ignoring hyphenation differences caused by text reflowing across margins. (Default: false)

  • include_bboxes=<bool>: If true, includes spatial bounding box coordinates for every change. Turn this off for a cleaner, text-only JSON output. (Default: true)

  • margin_top=<float>, margin_bottom=<float>, margin_left=<float>, margin_right=<float>: Filters out changes that fall entirely within these margins (in points). Excellent for removing noisy page headers, footers, or marginalia. (Default: 0)

Tags: text, compare, utility

Source: pdftl.operations.diff_text

Read online: https://pdftl.readthedocs.io/en/latest/operations/diff_text.html

Type: Operation