diff_text
Diff the text content of two PDFs and output bounding boxes
Usage
pdftl
<input_a>diff_text<input_b>[options...][output<output>]
Details
Performs a highly granular, spatially-aware text comparison between two PDFs. Outputs a JSON array of structural changes, including precise bounding box coordinates for where the changes occurred on the page.
Options
granularity=<level>: Controls how the diff engine groups text before comparing. Options:char,word,line,paragraph. Usingwordprevents sub-word shredding on typos. (Default: word)ignore_whitespace=<bool>: If true, drops changes where the only difference is invisible space (e.g., reflow line-breaks). (Default: true)ignore_soft_hyphens=<bool>: If true, strips\ufffesoft hyphens before comparing. Useful for ignoring hyphenation differences caused by text reflowing across margins. (Default: false)include_bboxes=<bool>: If true, includes spatial bounding box coordinates for every change. Turn this off for a cleaner, text-only JSON output. (Default: true)margin_top=<float>,margin_bottom=<float>,margin_left=<float>,margin_right=<float>: Filters out changes that fall entirely within these margins (in points). Excellent for removing noisy page headers, footers, or marginalia. (Default: 0)
Tags: text, compare, utility
Source: pdftl.operations.diff_text
Read online: https://pdftl.readthedocs.io/en/latest/operations/diff_text.html
Type: Operation