# `grep` Match text patterns and get bounding boxes ## Usage > pdftl `` `grep` `` `[pages...]` `[key=val...]` `[output` `]` ## Details The `grep` operation searches the text content of a PDF for a specified regular expression or literal string. It outputs a structured JSON array detailing matches, page locations, context snippets, and precise coordinate bounding boxes. Arguments: * ``: The regular expression or text string to search for. * `[pages...]`: Optional page ranges (e.g., `1-5`, `9-end`) to restrict the search. If omitted, the entire document is searched. Search Options: * `regex=`: If true, treats the pattern as a Python-compatible regular expression. If false, matches the pattern as a plain literal string. (Default: true) * `ignore_case=` or `i=`: If true, performs a case-insensitive search. (Default: false) * `multiline=` or `m=`: If true, `^` and `$` match the start and end of lines. (Default: true) * `dotall=` or `s=`: If true, the `.` special character matches any character, including newlines. (Default: false) * `max_count=`: Stop searching and parsing after locating `` total matches. Context Options: * `context=`: Number of surrounding lines of text to include before and after each match. (Default: 0) * `before_context=`: Number of surrounding lines to include strictly before the match. * `after_context=`: Number of surrounding lines to include strictly after the match. Typographic Filtering: You can restrict matches to text that meets specific visual criteria. * `min_size=`, `max_size=`: Only match text within a given point-size range. * `font_match=`: Only match if the font name contains this substring (e.g., "Bold"). * `require_bold=`: Only match if the text is explicitly bold. * `require_italic=`: Only match if the text is explicitly italicized. * `fonts=`: Always extract and output font metadata for matches. Automatically enabled if any typographic filters are used. (Default: false) Output Format: The results are written as a JSON object containing global metadata, a match metrics summary block (`count`), and a list of `hits`. Each hit contains: * `page`, `line`: 1-indexed page and line numbers where the match begins. * `text`: The exact string matching the main query. * `bboxes`: Coordinate bounding boxes `[x0, y0, x1, y1]` grouped per line. * `context_match`: The full string of the line(s) containing the match. * `match_start_idx`, `match_end_idx`: 0-indexed character offsets marking where the match resides within `context_match`. * `context_before`, `context_after`: Arrays of surrounding context lines (if requested). * `captures`: If the regex utilizes capture groups (e.g., `Invoice:\s*(\d+)`), this array automatically populates with the `group` number, exact `text`, and precise `bboxes` for every distinct captured sub-pattern. ## Examples > Find the phrase and automatically extract the monetary value's bounding box. ``` pdftl in.pdf grep 'Total:\s*(\$\d+\.\d{2})' ``` > Extract all text on the page formatted as a large bold heading. ``` pdftl in.pdf grep '.' regex=true min_size=18 require_bold=true ``` **Tags**: text, search, utility *Source: pdftl.operations.grep* *Read online: [https://pdftl.readthedocs.io/en/latest/operations/grep.html](https://pdftl.readthedocs.io/en/latest/operations/grep.html)* *Type: Operation*