grep
Match text patterns and get bounding boxes
Usage
pdftl
<input>grep<pattern>[pages...][key=val...][output<output>]
Details
The grep operation searches the text content of a PDF for a specified regular
expression or literal string. It outputs a structured JSON array detailing matches,
page locations, context snippets, and precise coordinate bounding boxes.
Arguments:
<pattern>: The regular expression or text string to search for.[pages...]: Optional page ranges (e.g.,1-5,9-end) to restrict the search. If omitted, the entire document is searched.
Search Options:
regex=<b>: If true, treats the pattern as a Python-compatible regular expression. If false, matches the pattern as a plain literal string. (Default: true)ignore_case=<b>ori=<b>: If true, performs a case-insensitive search. (Default: false)multiline=<b>orm=<b>: If true,^and$match the start and end of lines. (Default: true)dotall=<b>ors=<b>: If true, the.special character matches any character, including newlines. (Default: false)max_count=<N>: Stop searching and parsing after locating<N>total matches.
Context Options:
context=<N>: Number of surrounding lines of text to include before and after each match. (Default: 0)before_context=<N>: Number of surrounding lines to include strictly before the match.after_context=<N>: Number of surrounding lines to include strictly after the match.
Typographic Filtering: You can restrict matches to text that meets specific visual criteria.
min_size=<F>,max_size=<F>: Only match text within a given point-size range.font_match=<S>: Only match if the font name contains this substring (e.g., “Bold”).require_bold=<b>: Only match if the text is explicitly bold.require_italic=<b>: Only match if the text is explicitly italicized.fonts=<b>: Always extract and output font metadata for matches. Automatically enabled if any typographic filters are used. (Default: false)
Output Format:
The results are written as a JSON object containing global metadata, a match metrics summary
block (count), and a list of hits. Each hit contains:
page,line: 1-indexed page and line numbers where the match begins.text: The exact string matching the main query.bboxes: Coordinate bounding boxes[x0, y0, x1, y1]grouped per line.context_match: The full string of the line(s) containing the match.match_start_idx,match_end_idx: 0-indexed character offsets marking where the match resides withincontext_match.context_before,context_after: Arrays of surrounding context lines (if requested).captures: If the regex utilizes capture groups (e.g.,Invoice:\s*(\d+)), this array automatically populates with thegroupnumber, exacttext, and precisebboxesfor every distinct captured sub-pattern.
Examples
Find the phrase and automatically extract the monetary value’s bounding box.
pdftl in.pdf grep 'Total:\s*(\$\d+\.\d{2})'
Extract all text on the page formatted as a large bold heading.
pdftl in.pdf grep '.' regex=true min_size=18 require_bold=true
Tags: text, search, utility
Source: pdftl.operations.grep
Read online: https://pdftl.readthedocs.io/en/latest/operations/grep.html
Type: Operation