pdf-parser: A tool for parsing and analyzing PDF files to extract data or metadata

pdf-parser: A tool for parsing and analyzing PDF files to extract data or metadata

1. What is PDF-Parser?

PDF-Parser is a command-line tool included in Kali Linux (part of the peepdf framework) designed to analyze and extract information from PDF files. It helps in:

  • Extracting metadata, scripts, and embedded objects.
  • Detecting malicious content (like JavaScript exploits).
  • Analyzing PDF structure for forensic investigations.

2. How Does PDF-Parser Work?

PDF-Parser parses the internal structure of PDF files, including:

  • Objects (streams, dictionaries, arrays).
  • JavaScript code (often used in exploits).
  • Compressed data (Zlib/Flate decoding).
  • Embedded files (executables, documents).
  • Annotations & Actions (triggered on opening).

It helps security researchers, forensic analysts, and penetration testers analyze suspicious PDFs.


Installation

PDF-Parser is pre-installed in Kali Linux. If missing, install it via:

Bash
sudo apt update && sudo apt install peepdf -y

Alternatively, download it from GitHub.


Basic Usage

Basic Syntax

Bash
pdf-parser [options] <PDF-file>

Common Examples

  1. View PDF Structure
Bash
pdf-parser example.pdf
  1. Search for JavaScript
Bash
pdf-parser --search javascript example.pdf
  1. Extract All Objects
Bash
pdf-parser --object all example.pdf
  1. Check for Compression
Bash
pdf-parser --filter example.pdf

Advanced Usage

Extract Embedded Files

Bash
pdf-parser --extract all example.pdf

Decode Streams (e.g., /FlateDecode)

Bash
pdf-parser --raw --object 5 example.pdf

Search for Open Actions (Malicious Triggers)

Bash
pdf-parser --search "/AA" example.pdf

Check for CVE Exploits

Bash
pdf-parser --search "/JS" --search "/JavaScript" example.pdf

Command-Line Options


Basic Options

OptionDescriptionExample
--versionShow program versionpdf-parser --version
-h, --helpDisplay help menupdf-parser -h
-m, --manPrint full manualpdf-parser -m

Search & Filtering Options

OptionDescriptionExample
-s SEARCH, --search=SEARCHSearch for a string in objects (excluding streams)pdf-parser -s "/JavaScript" file.pdf
--searchstream=SEARCHSTREAMSearch inside stream contentpdf-parser --searchstream "eval(" file.pdf
--casesensitiveCase-sensitive search in streamspdf-parser --searchstream "Shell" --casesensitive file.pdf
--regexUse regex for stream searchespdf-parser --searchstream "(http|https)://" --regex file.pdf
-f, --filterDecode compressed streams (Flate, LZW, etc.)pdf-parser -f -o 5 file.pdf
-k KEY, --key=KEYSearch for a specific dictionary keypdf-parser -k "/OpenAction" file.pdf

Object Selection & Extraction

OptionDescriptionExample
-o OBJECT, --object=OBJECTInspect a specific object (comma-separated)pdf-parser -o 4,7 file.pdf
-r REFERENCE, --reference=REFERENCEFind objects referencing a given objectpdf-parser -r 12 file.pdf
-t TYPE, --type=TYPEFilter objects by type (e.g., /JavaScript)pdf-parser -t "/JavaScript" file.pdf
-e ELEMENTS, --elements=ELEMENTSFilter elements (c=content, x=xref, t=trailer, s=stream, i=indirect)pdf-parser -e s file.pdf
-x EXTRACT, --extract=EXTRACTExtract malformed content to a filepdf-parser -x output.bin file.pdf
-d DUMP, --dump=DUMPDump stream content to a filepdf-parser -o 5 -d stream_data.bin file.pdf

Output & Debugging

OptionDescriptionExample
-w, --rawShow raw object content (no decoding)pdf-parser -w -o 5 file.pdf
-a, --statsDisplay PDF statisticspdf-parser -a file.pdf
-v, --verboseShow malformed PDF elementspdf-parser -v file.pdf
-n, --nocanonicalizedoutputDisable output canonicalizationpdf-parser -n file.pdf
-D, --debugEnable debug modepdf-parser -D file.pdf
-j, --jsonoutputGenerate JSON outputpdf-parser -j file.pdf > output.json

Advanced Features

OptionDescriptionExample
-O, --objstmParse /ObjStm (object stream) objectspdf-parser -O file.pdf
-H, --hashCalculate hashes of objectspdf-parser -H -o 5 file.pdf
--overridingfilters=OVERRIDINGFILTERSOverride stream filters (e.g., raw for raw content)pdf-parser --overridingfilters raw -o 5 file.pdf
-g, --generateGenerate a Python script to recreate the PDFpdf-parser -g file.pdf > recreate.py
--generateembedded=GENERATEEMBEDDEDGenerate Python code to embed an object as a filepdf-parser --generateembedded=3 file.pdf

YARA Integration (Malware Detection)

OptionDescriptionExample
-y YARA, --yara=YARAScan streams with YARA rulespdf-parser -y /path/to/rules.yara file.pdf
--yarastringsPrint YARA stringspdf-parser -y rules.yara --yarastrings file.pdf
--unfilteredSearch in unfiltered streams (raw data)pdf-parser --searchstream "malware" --unfiltered file.pdf

Decoders & Custom Processing

OptionDescriptionExample
--decoders=DECODERSLoad custom decoders (comma-separated)pdf-parser --decoders=base64,xor file.pdf
--decoderoptions=DECODEROPTIONSPass options to decoderspdf-parser --decoders=xor --decoderoptions="key=0x41" file.pdf

Practical Examples

1. Extract JavaScript from a PDF

Bash
pdf-parser -t "/JavaScript" --filter file.pdf

2. Check for OpenAction Triggers (Common in Malicious PDFs)

Bash
pdf-parser -k "/OpenAction" file.pdf

3. Extract and Decode a Suspicious Stream

Bash
pdf-parser -o 12 -f -d decoded_data.bin file.pdf

4. Scan for Malware Using YARA

Bash
pdf-parser -y /usr/share/yara/rules/malware.yara --unfiltered file.pdf

5. Generate a Python Script to Rebuild the PDF

Bash
pdf-parser -g file.pdf > rebuild_pdf.py

Real-World Use Cases

  1. Malware Analysis
  • Detect malicious PDFs with embedded JavaScript exploits (e.g., CVE-2017-0199).
  • Extract payloads (shellcode, executables).
  1. Forensic Investigations
  • Analyze metadata (author, creation date).
  • Identify hidden content in legal documents.
  1. Penetration Testing
  • Test PDF upload vulnerabilities in web apps.
  • Craft malicious PDFs for social engineering.
  1. Bug Bounty Hunting
  • Find XSS/PDF injection flaws.

Troubleshooting Tips

Issue: “File is not a PDF”

  • Ensure the file is not corrupted.
  • Check with file example.pdf.

Issue: Encrypted PDF

  • Use pdfcrack or qpdf to decrypt first:
Bash
qpdf --password='' --decrypt encrypted.pdf decrypted.pdf

Issue: No Output

  • Try --verbose for debugging:
Bash
pdf-parser -v example.pdf

Issue: Unsupported Features

  • Use peepdf (interactive mode) for deeper analysis:
Bash
peepdf -i example.pdf

Total
0
Shares

Leave a Reply

Previous Post
guymager: A forensic imaging tool for creating disk images and performing hash verification

guymager: A forensic imaging tool for creating disk images and performing hash verification

Next Post
pdfid: A tool for identifying the structure and objects in PDF files

pdfid: A tool for identifying the structure and objects in PDF files

Related Posts