1. What is PDF-Parser?
PDF-Parser is a command-line tool included in Kali Linux (part of the peepdf framework) designed to analyze and extract information from PDF files. It helps in:
- Extracting metadata, scripts, and embedded objects.
- Detecting malicious content (like JavaScript exploits).
- Analyzing PDF structure for forensic investigations.
2. How Does PDF-Parser Work?
PDF-Parser parses the internal structure of PDF files, including:
- Objects (streams, dictionaries, arrays).
- JavaScript code (often used in exploits).
- Compressed data (Zlib/Flate decoding).
- Embedded files (executables, documents).
- Annotations & Actions (triggered on opening).
It helps security researchers, forensic analysts, and penetration testers analyze suspicious PDFs.
Installation
PDF-Parser is pre-installed in Kali Linux. If missing, install it via:
Bash
sudo apt update && sudo apt install peepdf -yAlternatively, download it from GitHub.
Basic Usage
Basic Syntax
Bash
pdf-parser [options] <PDF-file>Common Examples
- View PDF Structure
Bash
pdf-parser example.pdf- Search for JavaScript
Bash
pdf-parser --search javascript example.pdf- Extract All Objects
Bash
pdf-parser --object all example.pdf- Check for Compression
Bash
pdf-parser --filter example.pdfAdvanced Usage
Extract Embedded Files
Bash
pdf-parser --extract all example.pdfDecode Streams (e.g., /FlateDecode)
Bash
pdf-parser --raw --object 5 example.pdfSearch for Open Actions (Malicious Triggers)
Bash
pdf-parser --search "/AA" example.pdfCheck for CVE Exploits
Bash
pdf-parser --search "/JS" --search "/JavaScript" example.pdfCommand-Line Options
Basic Options
| Option | Description | Example |
|---|---|---|
--version | Show program version | pdf-parser --version |
-h, --help | Display help menu | pdf-parser -h |
-m, --man | Print full manual | pdf-parser -m |
Search & Filtering Options
| Option | Description | Example |
|---|---|---|
-s SEARCH, --search=SEARCH | Search for a string in objects (excluding streams) | pdf-parser -s "/JavaScript" file.pdf |
--searchstream=SEARCHSTREAM | Search inside stream content | pdf-parser --searchstream "eval(" file.pdf |
--casesensitive | Case-sensitive search in streams | pdf-parser --searchstream "Shell" --casesensitive file.pdf |
--regex | Use regex for stream searches | pdf-parser --searchstream "(http|https)://" --regex file.pdf |
-f, --filter | Decode compressed streams (Flate, LZW, etc.) | pdf-parser -f -o 5 file.pdf |
-k KEY, --key=KEY | Search for a specific dictionary key | pdf-parser -k "/OpenAction" file.pdf |
Object Selection & Extraction
| Option | Description | Example |
|---|---|---|
-o OBJECT, --object=OBJECT | Inspect a specific object (comma-separated) | pdf-parser -o 4,7 file.pdf |
-r REFERENCE, --reference=REFERENCE | Find objects referencing a given object | pdf-parser -r 12 file.pdf |
-t TYPE, --type=TYPE | Filter objects by type (e.g., /JavaScript) | pdf-parser -t "/JavaScript" file.pdf |
-e ELEMENTS, --elements=ELEMENTS | Filter elements (c=content, x=xref, t=trailer, s=stream, i=indirect) | pdf-parser -e s file.pdf |
-x EXTRACT, --extract=EXTRACT | Extract malformed content to a file | pdf-parser -x output.bin file.pdf |
-d DUMP, --dump=DUMP | Dump stream content to a file | pdf-parser -o 5 -d stream_data.bin file.pdf |
Output & Debugging
| Option | Description | Example |
|---|---|---|
-w, --raw | Show raw object content (no decoding) | pdf-parser -w -o 5 file.pdf |
-a, --stats | Display PDF statistics | pdf-parser -a file.pdf |
-v, --verbose | Show malformed PDF elements | pdf-parser -v file.pdf |
-n, --nocanonicalizedoutput | Disable output canonicalization | pdf-parser -n file.pdf |
-D, --debug | Enable debug mode | pdf-parser -D file.pdf |
-j, --jsonoutput | Generate JSON output | pdf-parser -j file.pdf > output.json |
Advanced Features
| Option | Description | Example |
|---|---|---|
-O, --objstm | Parse /ObjStm (object stream) objects | pdf-parser -O file.pdf |
-H, --hash | Calculate hashes of objects | pdf-parser -H -o 5 file.pdf |
--overridingfilters=OVERRIDINGFILTERS | Override stream filters (e.g., raw for raw content) | pdf-parser --overridingfilters raw -o 5 file.pdf |
-g, --generate | Generate a Python script to recreate the PDF | pdf-parser -g file.pdf > recreate.py |
--generateembedded=GENERATEEMBEDDED | Generate Python code to embed an object as a file | pdf-parser --generateembedded=3 file.pdf |
YARA Integration (Malware Detection)
| Option | Description | Example |
|---|---|---|
-y YARA, --yara=YARA | Scan streams with YARA rules | pdf-parser -y /path/to/rules.yara file.pdf |
--yarastrings | Print YARA strings | pdf-parser -y rules.yara --yarastrings file.pdf |
--unfiltered | Search in unfiltered streams (raw data) | pdf-parser --searchstream "malware" --unfiltered file.pdf |
Decoders & Custom Processing
| Option | Description | Example |
|---|---|---|
--decoders=DECODERS | Load custom decoders (comma-separated) | pdf-parser --decoders=base64,xor file.pdf |
--decoderoptions=DECODEROPTIONS | Pass options to decoders | pdf-parser --decoders=xor --decoderoptions="key=0x41" file.pdf |
Practical Examples
1. Extract JavaScript from a PDF
Bash
pdf-parser -t "/JavaScript" --filter file.pdf2. Check for OpenAction Triggers (Common in Malicious PDFs)
Bash
pdf-parser -k "/OpenAction" file.pdf3. Extract and Decode a Suspicious Stream
Bash
pdf-parser -o 12 -f -d decoded_data.bin file.pdf4. Scan for Malware Using YARA
Bash
pdf-parser -y /usr/share/yara/rules/malware.yara --unfiltered file.pdf5. Generate a Python Script to Rebuild the PDF
Bash
pdf-parser -g file.pdf > rebuild_pdf.pyReal-World Use Cases
- Malware Analysis
- Detect malicious PDFs with embedded JavaScript exploits (e.g., CVE-2017-0199).
- Extract payloads (shellcode, executables).
- Forensic Investigations
- Analyze metadata (author, creation date).
- Identify hidden content in legal documents.
- Penetration Testing
- Test PDF upload vulnerabilities in web apps.
- Craft malicious PDFs for social engineering.
- Bug Bounty Hunting
- Find XSS/PDF injection flaws.
Troubleshooting Tips
Issue: “File is not a PDF”
- Ensure the file is not corrupted.
- Check with
file example.pdf.
Issue: Encrypted PDF
- Use
pdfcrackorqpdfto decrypt first:
Bash
qpdf --password='' --decrypt encrypted.pdf decrypted.pdfIssue: No Output
- Try
--verbosefor debugging:
Bash
pdf-parser -v example.pdfIssue: Unsupported Features
- Use
peepdf(interactive mode) for deeper analysis:
Bash
peepdf -i example.pdf