dump_images
Extract PDF embedded image metadata to JSON
Usage
pdftl
<input>dump_images[<spec>...][output<output>]
Details
The dump_images operation extracts metadata about embedded images
in a PDF file.
It traverses the PDF’s content streams (including nested Form XObjects) to correctly calculate the absolute bounding boxes of all drawn images using the Current Transformation Matrix (CTM).
Outputs a JSON object containing page-level image metadata, including:
name: Internal PDF resource name
obj_id: PDF object number (shared across pages if the same image is reused)
bbox: Absolute bounding box coordinates [x_min, y_min, x_max, y_max] in PDF points
width_px: Native image width in pixels
height_px: Native image height in pixels
ppi_x: Horizontal resolution in pixels per inch, derived from bbox and pixel dimensions
ppi_y: Vertical resolution in pixels per inch, derived from bbox and pixel dimensions
colorspace: Colorspace family, e.g. /DeviceRGB, /DeviceCMYK, /ICCBased
bits: Bit depth per component
stream_bytes: Compressed stream size in bytes as stored in the PDF
format: Compression filter, e.g. flatedecode (PNG-style), dctdecode (JPEG)
Note: If the same image object is drawn multiple times (e.g. as a tiling pattern), it will appear once per placement with its own bbox and ppi values. The obj_id field can be used to identify duplicate placements of the same underlying stream.
You can optionally provide page specifications to limit extraction to specific pages.
Examples
Print image metadata for in.pdf to console
pdftl in.pdf dump_images
Save image metadata for in.pdf to a file
pdftl in.pdf dump_images output imagesa.json
Save image metadata for in.pdf to a file and save a copy of in.pdf
pdftl in.pdf dump_images output images.json --- output copy.pdf
Print image metadata for pages 1, 3, 4, and 5
pdftl in.pdf dump_images 1 3-5
Tags: info, metadata, images
Source: pdftl.operations.dump_images
Read online: https://pdftl.readthedocs.io/en/stable/operations/dump_images.html
Type: Operation