pdf-parser: A tool for parsing and analyzing PDF files to extract data or metadata

1. What is PDF-Parser?

PDF-Parser is a command-line tool included in Kali Linux (part of the peepdf framework) designed to analyze and extract information from PDF files. It helps in:

Extracting metadata, scripts, and embedded objects.
Detecting malicious content (like JavaScript exploits).
Analyzing PDF structure for forensic investigations.

2. How Does PDF-Parser Work?

PDF-Parser parses the internal structure of PDF files, including:

Objects (streams, dictionaries, arrays).
JavaScript code (often used in exploits).
Compressed data (Zlib/Flate decoding).
Embedded files (executables, documents).
Annotations & Actions (triggered on opening).

It helps security researchers, forensic analysts, and penetration testers analyze suspicious PDFs.

Installation

PDF-Parser is pre-installed in Kali Linux. If missing, install it via:

Bash

sudo apt update && sudo apt install peepdf -y

Alternatively, download it from GitHub.

Basic Usage

Basic Syntax

Bash

pdf-parser [options] <PDF-file>

Common Examples

View PDF Structure

Bash

pdf-parser example.pdf

Search for JavaScript

Bash

pdf-parser --search javascript example.pdf

Extract All Objects

Bash

pdf-parser --object all example.pdf

Check for Compression

Bash

pdf-parser --filter example.pdf

Advanced Usage

Extract Embedded Files

Bash

pdf-parser --extract all example.pdf

Decode Streams (e.g., /FlateDecode)

Bash

pdf-parser --raw --object 5 example.pdf

Search for Open Actions (Malicious Triggers)

Bash

pdf-parser --search "/AA" example.pdf

Check for CVE Exploits

Bash

pdf-parser --search "/JS" --search "/JavaScript" example.pdf

Command-Line Options

Basic Options

Option	Description	Example
`--version`	Show program version	`pdf-parser --version`
`-h, --help`	Display help menu	`pdf-parser -h`
`-m, --man`	Print full manual	`pdf-parser -m`

Search & Filtering Options

Option	Description	Example
`-s SEARCH, --search=SEARCH`	Search for a string in objects (excluding streams)	`pdf-parser -s "/JavaScript" file.pdf`
`--searchstream=SEARCHSTREAM`	Search inside stream content	`pdf-parser --searchstream "eval(" file.pdf`
`--casesensitive`	Case-sensitive search in streams	`pdf-parser --searchstream "Shell" --casesensitive file.pdf`
`--regex`	Use regex for stream searches	`pdf-parser --searchstream "(http\|https)://" --regex file.pdf`
`-f, --filter`	Decode compressed streams (Flate, LZW, etc.)	`pdf-parser -f -o 5 file.pdf`
`-k KEY, --key=KEY`	Search for a specific dictionary key	`pdf-parser -k "/OpenAction" file.pdf`

Object Selection & Extraction

Option	Description	Example
`-o OBJECT, --object=OBJECT`	Inspect a specific object (comma-separated)	`pdf-parser -o 4,7 file.pdf`
`-r REFERENCE, --reference=REFERENCE`	Find objects referencing a given object	`pdf-parser -r 12 file.pdf`
`-t TYPE, --type=TYPE`	Filter objects by type (e.g., `/JavaScript`)	`pdf-parser -t "/JavaScript" file.pdf`
`-e ELEMENTS, --elements=ELEMENTS`	Filter elements (`c`=content, `x`=xref, `t`=trailer, `s`=stream, `i`=indirect)	`pdf-parser -e s file.pdf`
`-x EXTRACT, --extract=EXTRACT`	Extract malformed content to a file	`pdf-parser -x output.bin file.pdf`
`-d DUMP, --dump=DUMP`	Dump stream content to a file	`pdf-parser -o 5 -d stream_data.bin file.pdf`

Output & Debugging

Option	Description	Example
`-w, --raw`	Show raw object content (no decoding)	`pdf-parser -w -o 5 file.pdf`
`-a, --stats`	Display PDF statistics	`pdf-parser -a file.pdf`
`-v, --verbose`	Show malformed PDF elements	`pdf-parser -v file.pdf`
`-n, --nocanonicalizedoutput`	Disable output canonicalization	`pdf-parser -n file.pdf`
`-D, --debug`	Enable debug mode	`pdf-parser -D file.pdf`
`-j, --jsonoutput`	Generate JSON output	`pdf-parser -j file.pdf > output.json`

Advanced Features

Option	Description	Example
`-O, --objstm`	Parse `/ObjStm` (object stream) objects	`pdf-parser -O file.pdf`
`-H, --hash`	Calculate hashes of objects	`pdf-parser -H -o 5 file.pdf`
`--overridingfilters=OVERRIDINGFILTERS`	Override stream filters (e.g., `raw` for raw content)	`pdf-parser --overridingfilters raw -o 5 file.pdf`
`-g, --generate`	Generate a Python script to recreate the PDF	`pdf-parser -g file.pdf > recreate.py`
`--generateembedded=GENERATEEMBEDDED`	Generate Python code to embed an object as a file	`pdf-parser --generateembedded=3 file.pdf`

YARA Integration (Malware Detection)

Option	Description	Example
`-y YARA, --yara=YARA`	Scan streams with YARA rules	`pdf-parser -y /path/to/rules.yara file.pdf`
`--yarastrings`	Print YARA strings	`pdf-parser -y rules.yara --yarastrings file.pdf`
`--unfiltered`	Search in unfiltered streams (raw data)	`pdf-parser --searchstream "malware" --unfiltered file.pdf`

Decoders & Custom Processing

Option	Description	Example
`--decoders=DECODERS`	Load custom decoders (comma-separated)	`pdf-parser --decoders=base64,xor file.pdf`
`--decoderoptions=DECODEROPTIONS`	Pass options to decoders	`pdf-parser --decoders=xor --decoderoptions="key=0x41" file.pdf`

Practical Examples

1. Extract JavaScript from a PDF

Bash

pdf-parser -t "/JavaScript" --filter file.pdf

2. Check for OpenAction Triggers (Common in Malicious PDFs)

Bash

pdf-parser -k "/OpenAction" file.pdf

3. Extract and Decode a Suspicious Stream

Bash

pdf-parser -o 12 -f -d decoded_data.bin file.pdf

4. Scan for Malware Using YARA

Bash

pdf-parser -y /usr/share/yara/rules/malware.yara --unfiltered file.pdf

5. Generate a Python Script to Rebuild the PDF

Bash

pdf-parser -g file.pdf > rebuild_pdf.py

Real-World Use Cases

Malware Analysis

Detect malicious PDFs with embedded JavaScript exploits (e.g., CVE-2017-0199).
Extract payloads (shellcode, executables).

Forensic Investigations

Analyze metadata (author, creation date).
Identify hidden content in legal documents.

Penetration Testing

Test PDF upload vulnerabilities in web apps.
Craft malicious PDFs for social engineering.

Bug Bounty Hunting

Find XSS/PDF injection flaws.

Troubleshooting Tips

Issue: “File is not a PDF”

Ensure the file is not corrupted.
Check with file example.pdf.

Issue: Encrypted PDF

Use pdfcrack or qpdf to decrypt first:

Bash

qpdf --password='' --decrypt encrypted.pdf decrypted.pdf

Issue: No Output

Try --verbose for debugging:

Bash

pdf-parser -v example.pdf

Issue: Unsupported Features

Use peepdf (interactive mode) for deeper analysis:

Bash

peepdf -i example.pdf

pdf-parser: A tool for parsing and analyzing PDF files to extract data or metadata

1. What is PDF-Parser?

2. How Does PDF-Parser Work?

Installation

Basic Usage

Basic Syntax

Common Examples

Advanced Usage

Extract Embedded Files

Decode Streams (e.g., /FlateDecode)

Search for Open Actions (Malicious Triggers)

Check for CVE Exploits

Command-Line Options

Basic Options

Search & Filtering Options

Object Selection & Extraction

Output & Debugging

Advanced Features

YARA Integration (Malware Detection)

Decoders & Custom Processing

Practical Examples

1. Extract JavaScript from a PDF

2. Check for OpenAction Triggers (Common in Malicious PDFs)

3. Extract and Decode a Suspicious Stream

4. Scan for Malware Using YARA

5. Generate a Python Script to Rebuild the PDF

Real-World Use Cases

Troubleshooting Tips

Issue: “File is not a PDF”

Issue: Encrypted PDF

Issue: No Output

Issue: Unsupported Features

Like this:

Leave a ReplyCancel reply

Don't wait until it's too late to protect your online assets.

Previous Post

Next Post

Related Posts