Pure TypeScript, cross-platform module for extracting text, images, and tabular data from PDFs. Run directly in your browser or in Node!
pdf-parse is a pure JavaScript/TypeScript library designed to extract text, metadata, and structured data from PDF files without relying on native dependencies. Built on top of PDF.js, it runs seamlessly in both Node.js (v14+) and browser environments, making it ideal for serverless functions, CI/CD pipelines, and client-side applications. With over 2.2 million weekly downloads, it has become a go-to solution for developers who need reliable PDF parsing without the complexity of system-level dependencies.
The library exposes a simple promise-based API that accepts PDF buffers and returns structured data including extracted text, page count, PDF metadata, and version information. Unlike heavier alternatives that focus on visual rendering or complex document structure extraction, pdf-parse prioritizes simplicity and ease of integration. It handles the most common PDF parsing scenario—extracting readable text from documents—with minimal configuration.
Developers choose pdf-parse when they need to process invoices, extract content from reports, index PDF documents for search, or convert PDFs to plain text for NLP processing. Its zero-dependency architecture means no compilation steps, no Python bridges, and no platform-specific binaries. The library supports customization through callback functions for page rendering logic, allowing developers to preserve formatting details like line breaks based on Y-coordinate positioning. For teams working in containerized environments or browsers where native modules are impractical, pdf-parse provides a reliable, portable solution.
import fs from 'fs';
import pdf from 'pdf-parse';
// Basic text extraction from local file
async function extractText(filePath) {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdf(dataBuffer);
console.log(`Pages: ${data.numpages}`);
console.log(`PDF Version: ${data.info.PDFFormatVersion}`);
console.log(`Text length: ${data.text.length} characters`);
return data.text;
}
// Custom page rendering to preserve line breaks
async function extractWithFormatting(filePath) {
const dataBuffer = fs.readFileSync(filePath);
const options = {
max: 5, // Parse only first 5 pages
pagerender: (pageData) => {
let renderOptions = {
normalizeWhitespace: false,
disableCombineTextItems: false
};
return pageData.getTextContent(renderOptions)
.then((textContent) => {
let lastY, text = '';
for (let item of textContent.items) {
if (lastY === item.transform[5] || !lastY) {
text += item.str;
} else {
text += '\n' + item.str;
}
lastY = item.transform[5];
}
return text;
});
}
};
const data = await pdf(dataBuffer, options);
return data.text;
}
// Usage with error handling
try {
const text = await extractText('./document.pdf');
const formattedText = await extractWithFormatting('./invoice.pdf');
// Process extracted text (e.g., search, NLP, indexing)
const keywords = text.match(/\b\w{6,}\b/g);
console.log('Long words:', keywords?.slice(0, 10));
} catch (error) {
console.error('PDF parsing failed:', error.message);
}Invoice and receipt processing: Extract line items, totals, and vendor information from PDF invoices for accounting automation systems. The pagerender callback can be customized to detect table-like structures based on text positioning, enabling structured data extraction from standardized invoice formats.
Document search indexing: Build full-text search indices for PDF archives by extracting all text content and feeding it to search engines like Elasticsearch or Algolia. The library's ability to process PDFs in memory makes it suitable for batch processing large document collections in background jobs.
Content migration and archival: Convert legacy PDF documentation to plain text or markdown for modern content management systems. The metadata extraction helps preserve document properties like author, creation date, and title during migration workflows.
Resume parsing and recruitment: Extract text from candidate resumes submitted as PDFs for keyword matching, skills extraction, and automated screening. The cross-platform nature allows this to run in serverless functions without cold-start penalties from native dependencies.
Automated report generation pipelines: Parse generated PDF reports to verify content accuracy, extract key metrics, or create text summaries. The max option enables parsing only the first few pages when full document processing isn't necessary, improving performance in validation workflows.
npm install pdf-parsepnpm add pdf-parsebun add pdf-parse