#  `grep`

Match text patterns and get bounding boxes
## Usage
> pdftl `<input>` `grep` `<pattern>` `[pages...]` `[key=val...]` `[output` `<output>]`

## Details
The `grep` operation searches the text content of a PDF for a specified regular
expression or literal string. It outputs a structured JSON array detailing matches,
page locations, context snippets, and precise coordinate bounding boxes.

Arguments:
  * `<pattern>`: The regular expression or text string to search for.
  * `[pages...]`: Optional page ranges (e.g., `1-5`, `9-end`) to restrict the search.
    If omitted, the entire document is searched.

Search Options:
  * `regex=<b>`: If true, treats the pattern as a Python-compatible regular expression.
    If false, matches the pattern as a plain literal string. (Default: true)
  * `ignore_case=<b>` or `i=<b>`: If true, performs a case-insensitive search. (Default: false)
  * `multiline=<b>` or `m=<b>`: If true, `^` and `$` match the start and end of lines.
    (Default: true)
  * `dotall=<b>` or `s=<b>`: If true, the `.` special character matches any character,
    including newlines. (Default: false)
  * `max_count=<N>`: Stop searching and parsing after locating `<N>` total matches.

Context Options:
  * `context=<N>`: Number of surrounding lines of text to include before and after each match.
    (Default: 0)
  * `before_context=<N>`: Number of surrounding lines to include strictly before the match.
  * `after_context=<N>`: Number of surrounding lines to include strictly after the match.

Typographic Filtering:
  You can restrict matches to text that meets specific visual criteria.
  * `min_size=<F>`, `max_size=<F>`: Only match text within a given point-size range.
  * `font_match=<S>`: Only match if the font name contains this substring (e.g., "Bold").
  * `require_bold=<b>`: Only match if the text is explicitly bold.
  * `require_italic=<b>`: Only match if the text is explicitly italicized.
  * `fonts=<b>`: Always extract and output font metadata for matches. Automatically enabled
    if any typographic filters are used. (Default: false)

Output Format:
  The results are written as a JSON object containing global metadata, a match metrics summary
  block (`count`), and a list of `hits`. Each hit contains:
  * `page`, `line`: 1-indexed page and line numbers where the match begins.
  * `text`: The exact string matching the main query.
  * `bboxes`: Coordinate bounding boxes `[x0, y0, x1, y1]` grouped per line.
  * `context_match`: The full string of the line(s) containing the match.
  * `match_start_idx`, `match_end_idx`: 0-indexed character offsets marking where the
    match resides within `context_match`.
  * `context_before`, `context_after`: Arrays of surrounding context lines (if requested).
  * `captures`: If the regex utilizes capture groups (e.g., `Invoice:\s*(\d+)`), this
    array automatically populates with the `group` number, exact `text`, and precise `bboxes`
    for every distinct captured sub-pattern.
## Examples

> Find the phrase and automatically extract the monetary value's bounding box.
```
pdftl in.pdf grep 'Total:\s*(\$\d+\.\d{2})'
```

> Extract all text on the page formatted as a large bold heading.
```
pdftl in.pdf grep '.' regex=true min_size=18 require_bold=true
```


**Tags**: text, search, utility

*Source: pdftl.operations.grep*

*Read online: [https://pdftl.readthedocs.io/en/latest/operations/grep.html](https://pdftl.readthedocs.io/en/latest/operations/grep.html)*

*Type: Operation*