dump_data

Metadata, page and bookmark info (XML-escaped or JSON)

Usage

pdftl <input> dump_data [json] [output <output>]

Details

Extracts document-level metadata and structural information from the input PDF and prints it to the console (or a specified file).

This operation is the primary way to export data for inspection or for later use by the update_info operation. By default, all string values in the output are processed with XML-style escaping (e.g., < becomes &lt;).

Alternatively, passing the json parameter will produce a structured JSON output, which is often easier for other programs to parse.

Output Format Details (Stanza Format)

The default output is a plain text, line-based, key-value format. It consists of both simple top-level fields and multi-line “stanzas”. A stanza is a block of related data that begins with a line like InfoBegin or BookmarkBegin.

The data from this command is consumed by update_info.

Top-Level Fields

These fields appear as simple Key: Value lines.

  • PdfID0: <hex_string>

    • The first part of the PDF’s unique file identifier.

    • Updatable by update_info.

  • PdfID1: <hex_string>

    • The second part of the PDF’s unique file identifier.

    • Not updatable by update_info.

  • NumberOfPages: <integer>

    • The total number of pages in the document.

Stanzas

These are multi-line blocks, each describing a single record. These can all be updated by update_info.

1. Info Stanza (Document Metadata)

Each metadata entry (e.g., Title, Author) gets its own stanza.

  • InfoBegin

  • InfoKey: <key_name> Standard keys include Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate.

  • InfoValue: <value_string>

2. Bookmark Stanza

Represents a single bookmark (outline) item.

  • BookmarkBegin

  • BookmarkTitle: <title_string>

  • BookmarkLevel: <integer> (1 is top level)

  • BookmarkPageNumber: <integer> The 1-indexed target page number.

3. PageMedia Stanza (Page-level Boxes)

Describes geometry boxes for a page. Coordinates are in PDF points, space-separated (e.g., 0 0 595 842).

  • PageMediaBegin

  • PageMediaNumber: <integer> The 1-indexed page number.

  • PageMediaRotation: <0|90|180|270>

  • PageMediaRect: <x1> <y1> <x2> <y2> (MediaBox) Always present.

  • PageMediaDimensions: <width> <height>

  • PageMediaCropRect: <x1> <y1> <x2> <y2> Omitted if identical to PageMediaRect.

  • PageMediaBleedRect: <x1> <y1> <x2> <y2> Omitted if identical to PageMediaCropRect (or PageMediaRect if no crop).

  • PageMediaTrimRect: <x1> <y1> <x2> <y2> Omitted if identical to PageMediaCropRect (or PageMediaRect if no crop).

4. PageLabel Stanza (Logical Page Numbers)

Defines a page numbering style range.

  • PageLabelBegin

  • PageLabelNewIndex: <integer> The 1-indexed physical starting page for this numbering.

  • PageLabelStart: <integer> The starting number for this labelling (e.g., 1).

  • PageLabelPrefix: <string> String to prepend to page label (e.g., A-).

  • PageLabelNumStyle: <Style> Standard styles: DecimalArabic, UppercaseRoman, LowercaseRoman, UppercaseLetters, LowercaseLetters, NoNumber.

Examples

Print XML-escaped metadata for in.pdf

pdftl in.pdf dump_data

Save XML-escaped metadata for in.pdf to data.txt

pdftl in.pdf dump_data output data.txt

Print metadata for in.pdf in JSON format

pdftl in.pdf dump_data json

Tags: info, metadata

Source: pdftl.operations.dump_data

Read online: https://pdftl.readthedocs.io/en/stable/operations/dump_data.html

Type: Operation