grep

Match text patterns and get bounding boxes

Usage

pdftl <input> grep <pattern> [pages...] [key=val...] [output <output>]

Details

The grep operation searches the text content of a PDF for a specified regular expression or literal string. It outputs a structured JSON array detailing matches, page locations, context snippets, and precise coordinate bounding boxes.

Arguments:

  • <pattern>: The regular expression or text string to search for.

  • [pages...]: Optional page ranges (e.g., 1-5, 9-end) to restrict the search. If omitted, the entire document is searched.

Search Options:

  • regex=<b>: If true, treats the pattern as a Python-compatible regular expression. If false, matches the pattern as a plain literal string. (Default: true)

  • ignore_case=<b> or i=<b>: If true, performs a case-insensitive search. (Default: false)

  • multiline=<b> or m=<b>: If true, ^ and $ match the start and end of lines. (Default: true)

  • dotall=<b> or s=<b>: If true, the . special character matches any character, including newlines. (Default: false)

  • max_count=<N>: Stop searching and parsing after locating <N> total matches.

Context Options:

  • context=<N>: Number of surrounding lines of text to include before and after each match. (Default: 0)

  • before_context=<N>: Number of surrounding lines to include strictly before the match.

  • after_context=<N>: Number of surrounding lines to include strictly after the match.

Typographic Filtering: You can restrict matches to text that meets specific visual criteria.

  • min_size=<F>, max_size=<F>: Only match text within a given point-size range.

  • font_match=<S>: Only match if the font name contains this substring (e.g., “Bold”).

  • require_bold=<b>: Only match if the text is explicitly bold.

  • require_italic=<b>: Only match if the text is explicitly italicized.

  • fonts=<b>: Always extract and output font metadata for matches. Automatically enabled if any typographic filters are used. (Default: false)

Output Format: The results are written as a JSON object containing global metadata, a match metrics summary block (count), and a list of hits. Each hit contains:

  • page, line: 1-indexed page and line numbers where the match begins.

  • text: The exact string matching the main query.

  • bboxes: Coordinate bounding boxes [x0, y0, x1, y1] grouped per line.

  • context_match: The full string of the line(s) containing the match.

  • match_start_idx, match_end_idx: 0-indexed character offsets marking where the match resides within context_match.

  • context_before, context_after: Arrays of surrounding context lines (if requested).

  • captures: If the regex utilizes capture groups (e.g., Invoice:\s*(\d+)), this array automatically populates with the group number, exact text, and precise bboxes for every distinct captured sub-pattern.

Examples

Find the phrase and automatically extract the monetary value’s bounding box.

pdftl in.pdf grep 'Total:\s*(\$\d+\.\d{2})'

Extract all text on the page formatted as a large bold heading.

pdftl in.pdf grep '.' regex=true min_size=18 require_bold=true

Tags: text, search, utility

Source: pdftl.operations.grep

Read online: https://pdftl.readthedocs.io/en/latest/operations/grep.html

Type: Operation