pdf-parse

↗ homepage

Pure TypeScript, cross-platform module for extracting text, images, and tabular data from PDFs. Run directly in your browser or in Node!

weekly downloads2.3M

versionv2.4.5

licenseApache-2.0

categoryutility

pdfpdf-parserpdf-parsepdf.jspdfjspdfjs-distpdf2textpdf2jsonpdf2imagepdf2picpdf-to-textpdf-to-imagepdf-viewerpdf-tablepdf-toolspdf-utilspdf-screenshotpdf-thumbnail

Overview

pdf-parse is a pure JavaScript/TypeScript library designed to extract text, metadata, and structured data from PDF files without relying on native dependencies. Built on top of PDF.js, it runs seamlessly in both Node.js (v14+) and browser environments, making it ideal for serverless functions, CI/CD pipelines, and client-side applications. With over 2.2 million weekly downloads, it has become a go-to solution for developers who need reliable PDF parsing without the complexity of system-level dependencies.

The library exposes a simple promise-based API that accepts PDF buffers and returns structured data including extracted text, page count, PDF metadata, and version information. Unlike heavier alternatives that focus on visual rendering or complex document structure extraction, pdf-parse prioritizes simplicity and ease of integration. It handles the most common PDF parsing scenario—extracting readable text from documents—with minimal configuration.

Developers choose pdf-parse when they need to process invoices, extract content from reports, index PDF documents for search, or convert PDFs to plain text for NLP processing. Its zero-dependency architecture means no compilation steps, no Python bridges, and no platform-specific binaries. The library supports customization through callback functions for page rendering logic, allowing developers to preserve formatting details like line breaks based on Y-coordinate positioning. For teams working in containerized environments or browsers where native modules are impractical, pdf-parse provides a reliable, portable solution.

Quick Start

typescript

import fs from 'fs';
import pdf from 'pdf-parse';

// Basic text extraction from local file
async function extractText(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdf(dataBuffer);
  
  console.log(`Pages: ${data.numpages}`);
  console.log(`PDF Version: ${data.info.PDFFormatVersion}`);
  console.log(`Text length: ${data.text.length} characters`);
  
  return data.text;
}

// Custom page rendering to preserve line breaks
async function extractWithFormatting(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  
  const options = {
    max: 5, // Parse only first 5 pages
    pagerender: (pageData) => {
      let renderOptions = {
        normalizeWhitespace: false,
        disableCombineTextItems: false
      };
      
      return pageData.getTextContent(renderOptions)
        .then((textContent) => {
          let lastY, text = '';
          
          for (let item of textContent.items) {
            if (lastY === item.transform[5] || !lastY) {
              text += item.str;
            } else {
              text += '\n' + item.str;
            }
            lastY = item.transform[5];
          }
          
          return text;
        });
    }
  };
  
  const data = await pdf(dataBuffer, options);
  return data.text;
}

// Usage with error handling
try {
  const text = await extractText('./document.pdf');
  const formattedText = await extractWithFormatting('./invoice.pdf');
  
  // Process extracted text (e.g., search, NLP, indexing)
  const keywords = text.match(/\b\w{6,}\b/g);
  console.log('Long words:', keywords?.slice(0, 10));
} catch (error) {
  console.error('PDF parsing failed:', error.message);
}

Use Cases

Invoice and receipt processing: Extract line items, totals, and vendor information from PDF invoices for accounting automation systems. The pagerender callback can be customized to detect table-like structures based on text positioning, enabling structured data extraction from standardized invoice formats.

Document search indexing: Build full-text search indices for PDF archives by extracting all text content and feeding it to search engines like Elasticsearch or Algolia. The library's ability to process PDFs in memory makes it suitable for batch processing large document collections in background jobs.

Content migration and archival: Convert legacy PDF documentation to plain text or markdown for modern content management systems. The metadata extraction helps preserve document properties like author, creation date, and title during migration workflows.

Resume parsing and recruitment: Extract text from candidate resumes submitted as PDFs for keyword matching, skills extraction, and automated screening. The cross-platform nature allows this to run in serverless functions without cold-start penalties from native dependencies.

Automated report generation pipelines: Parse generated PDF reports to verify content accuracy, extract key metrics, or create text summaries. The max option enables parsing only the first few pages when full document processing isn't necessary, improving performance in validation workflows.

Pros & Cons

Pros

+Pure JavaScript with zero native dependencies—runs in browsers, Lambda, Docker containers, and CI environments without compilation
+Simple promise-based API with minimal configuration required for basic text extraction use cases
+Cross-platform compatibility across Node.js and browsers using the same codebase and PDF.js rendering engine
+Lightweight footprint with customizable page rendering via callbacks for preserving line breaks and formatting
+Active ecosystem with maintained TypeScript forks providing native type definitions and ESM support

Cons

−Basic table extraction capabilities—no built-in table detection, requires manual Y-coordinate parsing logic in pagerender callback
−Limited control over complex PDF features like forms, annotations, or encrypted documents with advanced permissions
−Image extraction requires additional configuration and doesn't provide high-level utilities for saving or processing extracted images
−Performance can degrade on very large PDFs (hundreds of pages) since all content is processed in memory
−No native support for password-protected PDFs—requires external tools or alternative libraries for encrypted documents

Install

bash

npm install pdf-parse

bash

pnpm add pdf-parse

bash

bun add pdf-parse

Overview

Quick Start

typescript

import fs from 'fs';
import pdf from 'pdf-parse';

// Basic text extraction from local file
async function extractText(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdf(dataBuffer);
  
  console.log(`Pages: ${data.numpages}`);
  console.log(`PDF Version: ${data.info.PDFFormatVersion}`);
  console.log(`Text length: ${data.text.length} characters`);
  
  return data.text;
}

// Custom page rendering to preserve line breaks
async function extractWithFormatting(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  
  const options = {
    max: 5, // Parse only first 5 pages
    pagerender: (pageData) => {
      let renderOptions = {
        normalizeWhitespace: false,
        disableCombineTextItems: false
      };
      
      return pageData.getTextContent(renderOptions)
        .then((textContent) => {
          let lastY, text = '';
          
          for (let item of textContent.items) {
            if (lastY === item.transform[5] || !lastY) {
              text += item.str;
            } else {
              text += '\n' + item.str;
            }
            lastY = item.transform[5];
          }
          
          return text;
        });
    }
  };
  
  const data = await pdf(dataBuffer, options);
  return data.text;
}

// Usage with error handling
try {
  const text = await extractText('./document.pdf');
  const formattedText = await extractWithFormatting('./invoice.pdf');
  
  // Process extracted text (e.g., search, NLP, indexing)
  const keywords = text.match(/\b\w{6,}\b/g);
  console.log('Long words:', keywords?.slice(0, 10));
} catch (error) {
  console.error('PDF parsing failed:', error.message);
}

Use Cases

Pros & Cons

Pros

+Pure JavaScript with zero native dependencies—runs in browsers, Lambda, Docker containers, and CI environments without compilation
+Simple promise-based API with minimal configuration required for basic text extraction use cases
+Cross-platform compatibility across Node.js and browsers using the same codebase and PDF.js rendering engine
+Lightweight footprint with customizable page rendering via callbacks for preserving line breaks and formatting
+Active ecosystem with maintained TypeScript forks providing native type definitions and ESM support

Cons

−Basic table extraction capabilities—no built-in table detection, requires manual Y-coordinate parsing logic in pagerender callback
−Limited control over complex PDF features like forms, annotations, or encrypted documents with advanced permissions
−Image extraction requires additional configuration and doesn't provide high-level utilities for saving or processing extracted images
−Performance can degrade on very large PDFs (hundreds of pages) since all content is processed in memory
−No native support for password-protected PDFs—requires external tools or alternative libraries for encrypted documents

pdf-parse

Overview

Quick Start

Use Cases

Pros & Cons

Pros

Cons

Related Content

Install

pdf-parse

Overview

Quick Start

Use Cases

Pros & Cons

Pros

Cons

Related Content

Install