`grep`

Match text patterns and get bounding boxes

Usage

pdftl <input> grep <pattern> [pages...] [key=val...] [output <output>]

Details

The grep operation searches the text content of a PDF for a specified regular expression or literal string. It outputs a structured JSON array detailing matches, page locations, context snippets, and precise coordinate bounding boxes.

Arguments:

<pattern>: The regular expression or text string to search for.
[pages...]: Optional page ranges (e.g., 1-5, 9-end) to restrict the search. If omitted, the entire document is searched.

Search Options:

regex=: If true, treats the pattern as a Python-compatible regular expression. If false, matches the pattern as a plain literal string. (Default: true)
ignore_case= or i=: If true, performs a case-insensitive search. (Default: false)
multiline= or m=: If true, ^ and $ match the start and end of lines. (Default: true)
dotall= or s=: If true, the . special character matches any character, including newlines. (Default: false)
max_count=<N>: Stop searching and parsing after locating <N> total matches.

Context Options:

context=<N>: Number of surrounding lines of text to include before and after each match. (Default: 0)
before_context=<N>: Number of surrounding lines to include strictly before the match.
after_context=<N>: Number of surrounding lines to include strictly after the match.

Typographic Filtering: You can restrict matches to text that meets specific visual criteria.

min_size=<F>, max_size=<F>: Only match text within a given point-size range.
font_match=<S>: Only match if the font name contains this substring (e.g., “Bold”).
require_bold=: Only match if the text is explicitly bold.
require_italic=: Only match if the text is explicitly italicized.
fonts=: Always extract and output font metadata for matches. Automatically enabled if any typographic filters are used. (Default: false)

Output Format: The results are written as a JSON object containing global metadata, a match metrics summary block (count), and a list of hits. Each hit contains:

page, line: 1-indexed page and line numbers where the match begins.
text: The exact string matching the main query.
bboxes: Coordinate bounding boxes [x0, y0, x1, y1] grouped per line.
context_match: The full string of the line(s) containing the match.
match_start_idx, match_end_idx: 0-indexed character offsets marking where the match resides within context_match.
context_before, context_after: Arrays of surrounding context lines (if requested).
captures: If the regex utilizes capture groups (e.g., Invoice:\s*(\d+)), this array automatically populates with the group number, exact text, and precise bboxes for every distinct captured sub-pattern.

Examples

Find the phrase and automatically extract the monetary value’s bounding box.

pdftl in.pdf grep 'Total:\s*(\$\d+\.\d{2})'

Extract all text on the page formatted as a large bold heading.

pdftl in.pdf grep '.' regex=true min_size=18 require_bold=true

Tags: text, search, utility

Source: pdftl.operations.grep

Read online: https://pdftl.readthedocs.io/en/latest/operations/grep.html

Type: Operation

grep

Usage

Details

Examples

`grep`