`dump_data`

Metadata, page and bookmark info (XML-escaped or JSON)

Usage

pdftl <input> dump_data [json] [output <output>]

Details

Extracts document-level metadata and structural information from the input PDF and prints it to the console (or a specified file).

This operation is the primary way to export data for inspection or for later use by the update_info operation. By default, all string values in the output are processed with XML-style escaping (e.g., < becomes <).

Alternatively, passing the json parameter will produce a structured JSON output, which is often easier for other programs to parse.

Output Format Details (Stanza Format)

The default output is a plain text, line-based, key-value format. It consists of both simple top-level fields and multi-line “stanzas”. A stanza is a block of related data that begins with a line like InfoBegin or BookmarkBegin.

The data from this command is consumed by update_info.

Top-Level Fields

These fields appear as simple Key: Value lines.

PdfID0: <hex_string>
- The first part of the PDF’s unique file identifier.
- Updatable by update_info.
PdfID1: <hex_string>
- The second part of the PDF’s unique file identifier.
- Not updatable by update_info.
NumberOfPages: <integer>
- The total number of pages in the document.

Stanzas

These are multi-line blocks, each describing a single record. These can all be updated by update_info.

1. Info Stanza (Document Metadata)

Each metadata entry (e.g., Title, Author) gets its own stanza.

InfoBegin
InfoKey: <key_name> Standard keys include Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate.
InfoValue: <value_string>

2. Bookmark Stanza

Represents a single bookmark (outline) item.

BookmarkBegin
BookmarkTitle: <title_string>
BookmarkLevel: <integer> (1 is top level)
BookmarkPageNumber: <integer> The 1-indexed target page number.

3. PageMedia Stanza (Page-level Boxes)

Describes geometry boxes for a page. Coordinates are in PDF points, space-separated (e.g., 0 0 595 842).

PageMediaBegin
PageMediaNumber: <integer> The 1-indexed page number.
PageMediaRotation: <0|90|180|270>
PageMediaRect: <x1> <y1> <x2> <y2> (MediaBox) Always present.
PageMediaDimensions: <width> <height>
PageMediaCropRect: <x1> <y1> <x2> <y2> Omitted if identical to PageMediaRect.
PageMediaBleedRect: <x1> <y1> <x2> <y2> Omitted if identical to PageMediaCropRect (or PageMediaRect if no crop).
PageMediaTrimRect: <x1> <y1> <x2> <y2> Omitted if identical to PageMediaCropRect (or PageMediaRect if no crop).

4. PageLabel Stanza (Logical Page Numbers)

Defines a page numbering style range.

PageLabelBegin
PageLabelNewIndex: <integer> The 1-indexed physical starting page for this numbering.
PageLabelStart: <integer> The starting number for this labelling (e.g., 1).
PageLabelPrefix: <string> String to prepend to page label (e.g., A-).
PageLabelNumStyle: <Style> Standard styles: DecimalArabic, UppercaseRoman, LowercaseRoman, UppercaseLetters, LowercaseLetters, NoNumber.

Examples

Print XML-escaped metadata for in.pdf

pdftl in.pdf dump_data

Save XML-escaped metadata for in.pdf to data.txt

pdftl in.pdf dump_data output data.txt

Print metadata for in.pdf in JSON format

pdftl in.pdf dump_data json

Tags: info, metadata

Source: pdftl.operations.dump_data

Read online: https://pdftl.readthedocs.io/en/stable/operations/dump_data.html

Type: Operation

dump_data