Tools: pdf-lib Is an Under-Appreciated Gem: A Tiny PDF API in TypeScript (2026)

Tools: pdf-lib Is an Under-Appreciated Gem: A Tiny PDF API in TypeScript (2026)

pdf-lib Is an Under-Appreciated Gem: A Tiny PDF API in TypeScript

The pdftk problem

pdf-lib fixes all of that

The range parser

Magic bytes as a security boundary

Testing with inline fixtures

Operational details

What it is not

Try it in 30 seconds A small self-hosted HTTP service for merging, splitting, rotating and inspecting PDFs. Built on Hono and pdf-lib. Pure JavaScript, no native dependencies, 190 MB Alpine image, 55 tests. Two afternoons of work. Every backend I have ever touched eventually sprouts a PDF-shaped mole. Finance wants a monthly report that staples three sub-reports together. A lawyer wants pages five through ten of a 300-page filing. Support wants to flip a scanned form upside-down because the intake workflow fed it in the wrong orientation. And every single time, some poor engineer opens a Jira ticket and writes a one-off script using pdftk, or drags a 400 MB Java container into the compose file for Apache PDFBox, or worse, shells out to a headless LibreOffice. None of these are the right abstraction. The right abstraction is an HTTP service with four endpoints: /merge, /split, /rotate, /info. Small, boring, self-hosted, readable in an afternoon. I wrote one and called it pdf-merge-api. The punchline is that the entire thing is about 600 lines of TypeScript and zero native dependencies, because pdf-lib is an under-appreciated gem. 🔗 GitHub: https://github.com/sen-ltd/pdf-merge-api This article is about three things I learned building it: why pure-JavaScript PDF manipulation is the right default in 2026, how to parse a range spec without regretting it, and why the magic-byte check in front of your PDF library is free security you should not skip. If you google "merge PDFs linux command line", the first answer is pdftk. pdftk is a fine tool that happens to be packaged terribly. The original was written in Java and compiled with GCJ, which nobody maintains anymore. The modern replacement is pdftk-java, which requires a JRE. There are builds for Debian and RPM and Homebrew but the Alpine package ecosystem is a wasteland of forks and you will spend an evening figuring out which one has a working libgcj dependency tree. Alternatives: ghostscript, which is a 200 MB install and a CVE magnet; qpdf, which is better but still native; mutool, which is part of the 300 MB MuPDF distribution. All of these get wrapped in a subprocess call from Node or Python or Go and the glue code is the same shape every time: Five failure modes: the tempdir might not be writable, the write might half-complete, the exec might time out, the exec might succeed but return a non-zero code, and the cleanup might throw on an unrelated error and mask the actual failure. The shell-escape surface for file names is an entire CVE class by itself. Every time you reach for a subprocess you are trading one real problem for four operational ones. pdf-lib is pure JavaScript. It reads and writes PDFs as Uint8Array buffers in memory, no file I/O, no native addons, no node-gyp compilation on Alpine, no apk add poppler-utils in your Dockerfile. The full merge flow becomes: That is the whole thing. No tempdirs. No shell escaping. No exec timeouts. The copyPages call takes care of duplicating resources (fonts, images, form fields) across documents correctly, which is the actual hard part of merging. I did not have to think about it. The tradeoff — and this is the honest part, because pure-JS is rarely strictly better — is that pdf-lib does not rasterize pages. It cannot render a PDF to a PNG, and it cannot extract text. Those are separate problems with separate tools. If you need them, wire up pdf.js for rendering or pdfjs-dist + a text extractor, or just run a different microservice for those jobs. I keep the services single-purpose on purpose. The "merge and inspect" service does not need to grow text extraction. The /split endpoint takes a ranges query string like 1-3,5,7-9 and returns a new PDF with exactly those pages, in that order. The shape of the string is borrowed straight from the print dialog you already know. Parsing it is the kind of problem where your first instinct is "regex" and your second instinct, after writing the regex, is "I should have just written a parser." I skipped straight to the parser: Three design choices worth pointing out. Order is preserved, not normalized. parseRanges('3,1,2', { maxPage: 10 }) returns [3,1,2]. If the caller wanted sorted semantics they can sort. Preserving order means /split can also reorder pages, which is a feature you don't realize you want until you need it. Overlaps produce duplicates. 1-3,2-4 returns [1,2,3,2,3,4]. I could have deduplicated but chose not to: if a caller asks for a page twice they almost certainly mean it, and duplicate pages in the output PDF are harmless. Surprising-but-useful beats helpful-but-lossy. Reverse ranges like 5-3 are allowed. This one was driven by a real case: I had a designer who wanted the last three pages of a report in reverse order to use as an appendix. Allowing 5-3 → [5,4,3] made the API do exactly what she asked without forcing her to list pages individually. And the error handling: a dedicated RangeParseError subclass. This matters at the HTTP layer because the error mapper can do if (err instanceof RangeParseError) return 422 instead of pattern-matching on strings. Named errors are how you keep an error-handler honest as the codebase grows. Five named error classes, five distinct HTTP statuses, one switch. The route handlers throw from any depth and the response is correct. Here is the part I want to yell about. When a user uploads a file to a PDF processor, the safest thing you can do is not parse it as a PDF unless you are sure it's a PDF. That sounds circular but it isn't. PDF parsers — all of them, including pdf-lib — are complex state machines that walk an untrusted byte stream. The less untrusted nonsense they see, the better. The cheapest possible filter is the one in the spec itself: a valid PDF file starts with the bytes %PDF- — literally 0x25 0x50 0x44 0x46 0x2d. That is the first line of the PDF file format and no compliant reader will accept a file without it. So the first thing my upload path does is this: Five bytes. Five microseconds. And it catches: Every single one of those fails the check and returns a clean 415 response with a readable error message, long before pdf-lib's parser touches the bytes. Nothing allocates a giant cross-reference table. Nothing recursively walks an object stream. Nothing gets the chance to hit a parser bug. This is as close to free security as you ever get — and crucially, it is also a user experience improvement, because the error message is "this file is not a PDF" instead of a ten-line stack trace from deep inside a parser. Note that Content-Type: application/pdf in the multipart header is not sufficient. The client controls the headers. If you only trust what the client sends you, you have learned nothing from the last 20 years of web security. Always read the first five bytes. Here is my favorite thing about choosing pdf-lib: I never have to carry test fixtures. I was initially going to check a few tiny PDFs into tests/fixtures/ and it felt gross — binary blobs in the repo, git-diff churn when I regenerate them, different PDF readers rendering them slightly differently. Then I realized I could just generate the test PDFs from inside the test file, because pdf-lib is already a dependency. No fixtures. No file I/O. No sampling mysteries. 55 tests across five files — 17 for the range parser, 5 for the magic-byte check, 14 for the PDF operations, 15 for the HTTP routes via app.request(), and 4 for config helpers — and the whole suite runs in under 400 ms. If you remember one thing from this article, remember this: sampling a real PDF in tests beats mocking every time. You find issues that mocks will never find because a round-trip through a real parser catches bugs that only exist in real parser state machines. A few small things worth mentioning because they add up: Honesty section, because I care about the professional-language rule in my own writing: The shape of the service is "four verbs over multipart, JSON errors, no state." That is the shape I want more services to have. Four endpoints, four curl lines, one container. Drop it in your compose file the next time finance asks for the merge script and keep the rest of your infrastructure boring. The takeaway, beyond the specific service: reach for pure-JavaScript libraries for document manipulation when you can. Native dependencies are an operational tax that compounds over years. pdf-lib is the under-appreciated gem in this space — it solves 80% of the PDF problems a normal web application ever has, with zero ops footprint. The next time you are about to apk add poppler-utils, check whether pdf-lib can do what you need instead. The answer is often yes. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

const tmp = await fs.mkdtemp('/tmp/pdf-'); await fs.writeFile(`${tmp}/a.pdf`, fileA); await fs.writeFile(`${tmp}/b.pdf`, fileB); const { stdout, stderr } = await exec( `pdftk ${tmp}/a.pdf ${tmp}/b.pdf cat output ${tmp}/out.pdf` ); const result = await fs.readFile(`${tmp}/out.pdf`); await fs.rm(tmp, { recursive: true }); return result; const tmp = await fs.mkdtemp('/tmp/pdf-'); await fs.writeFile(`${tmp}/a.pdf`, fileA); await fs.writeFile(`${tmp}/b.pdf`, fileB); const { stdout, stderr } = await exec( `pdftk ${tmp}/a.pdf ${tmp}/b.pdf cat output ${tmp}/out.pdf` ); const result = await fs.readFile(`${tmp}/out.pdf`); await fs.rm(tmp, { recursive: true }); return result; const tmp = await fs.mkdtemp('/tmp/pdf-'); await fs.writeFile(`${tmp}/a.pdf`, fileA); await fs.writeFile(`${tmp}/b.pdf`, fileB); const { stdout, stderr } = await exec( `pdftk ${tmp}/a.pdf ${tmp}/b.pdf cat output ${tmp}/out.pdf` ); const result = await fs.readFile(`${tmp}/out.pdf`); await fs.rm(tmp, { recursive: true }); return result; import { PDFDocument } from 'pdf-lib'; export async function mergePdfs(inputs: Uint8Array[]): Promise<Uint8Array> { const out = await PDFDocument.create(); for (const input of inputs) { const src = await PDFDocument.load(input); const pages = await out.copyPages(src, src.getPageIndices()); for (const page of pages) out.addPage(page); } return out.save(); } import { PDFDocument } from 'pdf-lib'; export async function mergePdfs(inputs: Uint8Array[]): Promise<Uint8Array> { const out = await PDFDocument.create(); for (const input of inputs) { const src = await PDFDocument.load(input); const pages = await out.copyPages(src, src.getPageIndices()); for (const page of pages) out.addPage(page); } return out.save(); } import { PDFDocument } from 'pdf-lib'; export async function mergePdfs(inputs: Uint8Array[]): Promise<Uint8Array> { const out = await PDFDocument.create(); for (const input of inputs) { const src = await PDFDocument.load(input); const pages = await out.copyPages(src, src.getPageIndices()); for (const page of pages) out.addPage(page); } return out.save(); } export function parseRanges(spec: string, opts: { maxPage: number }): number[] { if (typeof spec !== 'string' || spec.trim() === '') throw new RangeParseError('range spec is empty'); const parts = spec.split(',').map(p => p.trim()).filter(p => p.length > 0); if (parts.length === 0) throw new RangeParseError('range spec has no parts'); const result: number[] = []; for (const part of parts) { if (part.includes('-')) { const halves = part.split('-'); if (halves.length !== 2) throw new RangeParseError(`invalid range segment: ${part}`); const a = Number(halves[0]); const b = Number(halves[1]); if (!Number.isInteger(a) || !Number.isInteger(b)) throw new RangeParseError(`range endpoints must be integers`); if (a < 1 || b < 1 || a > opts.maxPage || b > opts.maxPage) throw new RangeParseError(`range ${part} out of bounds`); if (a <= b) { for (let i = a; i <= b; i++) result.push(i); } else { // reverse range: "5-3" → [5,4,3] for (let i = a; i >= b; i--) result.push(i); } } else { const n = Number(part); if (!Number.isInteger(n) || n < 1 || n > opts.maxPage) throw new RangeParseError(`invalid page: ${part}`); result.push(n); } } return result; } export function parseRanges(spec: string, opts: { maxPage: number }): number[] { if (typeof spec !== 'string' || spec.trim() === '') throw new RangeParseError('range spec is empty'); const parts = spec.split(',').map(p => p.trim()).filter(p => p.length > 0); if (parts.length === 0) throw new RangeParseError('range spec has no parts'); const result: number[] = []; for (const part of parts) { if (part.includes('-')) { const halves = part.split('-'); if (halves.length !== 2) throw new RangeParseError(`invalid range segment: ${part}`); const a = Number(halves[0]); const b = Number(halves[1]); if (!Number.isInteger(a) || !Number.isInteger(b)) throw new RangeParseError(`range endpoints must be integers`); if (a < 1 || b < 1 || a > opts.maxPage || b > opts.maxPage) throw new RangeParseError(`range ${part} out of bounds`); if (a <= b) { for (let i = a; i <= b; i++) result.push(i); } else { // reverse range: "5-3" → [5,4,3] for (let i = a; i >= b; i--) result.push(i); } } else { const n = Number(part); if (!Number.isInteger(n) || n < 1 || n > opts.maxPage) throw new RangeParseError(`invalid page: ${part}`); result.push(n); } } return result; } export function parseRanges(spec: string, opts: { maxPage: number }): number[] { if (typeof spec !== 'string' || spec.trim() === '') throw new RangeParseError('range spec is empty'); const parts = spec.split(',').map(p => p.trim()).filter(p => p.length > 0); if (parts.length === 0) throw new RangeParseError('range spec has no parts'); const result: number[] = []; for (const part of parts) { if (part.includes('-')) { const halves = part.split('-'); if (halves.length !== 2) throw new RangeParseError(`invalid range segment: ${part}`); const a = Number(halves[0]); const b = Number(halves[1]); if (!Number.isInteger(a) || !Number.isInteger(b)) throw new RangeParseError(`range endpoints must be integers`); if (a < 1 || b < 1 || a > opts.maxPage || b > opts.maxPage) throw new RangeParseError(`range ${part} out of bounds`); if (a <= b) { for (let i = a; i <= b; i++) result.push(i); } else { // reverse range: "5-3" → [5,4,3] for (let i = a; i >= b; i--) result.push(i); } } else { const n = Number(part); if (!Number.isInteger(n) || n < 1 || n > opts.maxPage) throw new RangeParseError(`invalid page: ${part}`); result.push(n); } } return result; } export function errorResponse(c: Context, err: unknown): Response { if (err instanceof MissingFileError) return c.json({ error: 'missing_file' }, 422); if (err instanceof PayloadTooLargeError) return c.json({ error: 'payload_too_large', limit_mb: err.limitMb }, 413); if (err instanceof UnsupportedMediaTypeError) return c.json({ error: 'unsupported_media_type' }, 415); if (err instanceof RangeParseError) return c.json({ error: 'invalid_range', message: err.message }, 422); if (err instanceof Error) return c.json({ error: 'pdf_processing_failed', message: err.message }, 422); return c.json({ error: 'internal_error' }, 500); } export function errorResponse(c: Context, err: unknown): Response { if (err instanceof MissingFileError) return c.json({ error: 'missing_file' }, 422); if (err instanceof PayloadTooLargeError) return c.json({ error: 'payload_too_large', limit_mb: err.limitMb }, 413); if (err instanceof UnsupportedMediaTypeError) return c.json({ error: 'unsupported_media_type' }, 415); if (err instanceof RangeParseError) return c.json({ error: 'invalid_range', message: err.message }, 422); if (err instanceof Error) return c.json({ error: 'pdf_processing_failed', message: err.message }, 422); return c.json({ error: 'internal_error' }, 500); } export function errorResponse(c: Context, err: unknown): Response { if (err instanceof MissingFileError) return c.json({ error: 'missing_file' }, 422); if (err instanceof PayloadTooLargeError) return c.json({ error: 'payload_too_large', limit_mb: err.limitMb }, 413); if (err instanceof UnsupportedMediaTypeError) return c.json({ error: 'unsupported_media_type' }, 415); if (err instanceof RangeParseError) return c.json({ error: 'invalid_range', message: err.message }, 422); if (err instanceof Error) return c.json({ error: 'pdf_processing_failed', message: err.message }, 422); return c.json({ error: 'internal_error' }, 500); } const PDF_MAGIC = new Uint8Array([0x25, 0x50, 0x44, 0x46, 0x2d]); export function isPdfBytes(bytes: Uint8Array): boolean { if (bytes.length < PDF_MAGIC.length) return false; for (let i = 0; i < PDF_MAGIC.length; i++) { if (bytes[i] !== PDF_MAGIC[i]) return false; } return true; } export function assertPdfBytes(bytes: Uint8Array): void { if (!isPdfBytes(bytes)) { throw new UnsupportedMediaTypeError( 'uploaded file does not start with %PDF- magic bytes', ); } } const PDF_MAGIC = new Uint8Array([0x25, 0x50, 0x44, 0x46, 0x2d]); export function isPdfBytes(bytes: Uint8Array): boolean { if (bytes.length < PDF_MAGIC.length) return false; for (let i = 0; i < PDF_MAGIC.length; i++) { if (bytes[i] !== PDF_MAGIC[i]) return false; } return true; } export function assertPdfBytes(bytes: Uint8Array): void { if (!isPdfBytes(bytes)) { throw new UnsupportedMediaTypeError( 'uploaded file does not start with %PDF- magic bytes', ); } } const PDF_MAGIC = new Uint8Array([0x25, 0x50, 0x44, 0x46, 0x2d]); export function isPdfBytes(bytes: Uint8Array): boolean { if (bytes.length < PDF_MAGIC.length) return false; for (let i = 0; i < PDF_MAGIC.length; i++) { if (bytes[i] !== PDF_MAGIC[i]) return false; } return true; } export function assertPdfBytes(bytes: Uint8Array): void { if (!isPdfBytes(bytes)) { throw new UnsupportedMediaTypeError( 'uploaded file does not start with %PDF- magic bytes', ); } } async function makePdf(pageCount: number, opts: { title?: string } = {}) { const doc = await PDFDocument.create(); for (let i = 0; i < pageCount; i++) { doc.addPage([200, 200]).drawText(`p${i + 1}`); } if (opts.title) doc.setTitle(opts.title); return doc.save(); } it('preserves metadata of first document in merge', async () => { const a = await makePdf(1, { title: 'First Report' }); const b = await makePdf(1, { title: 'Second' }); const merged = await mergePdfs([a, b]); const doc = await PDFDocument.load(merged); expect(doc.getTitle()).toBe('First Report'); }); async function makePdf(pageCount: number, opts: { title?: string } = {}) { const doc = await PDFDocument.create(); for (let i = 0; i < pageCount; i++) { doc.addPage([200, 200]).drawText(`p${i + 1}`); } if (opts.title) doc.setTitle(opts.title); return doc.save(); } it('preserves metadata of first document in merge', async () => { const a = await makePdf(1, { title: 'First Report' }); const b = await makePdf(1, { title: 'Second' }); const merged = await mergePdfs([a, b]); const doc = await PDFDocument.load(merged); expect(doc.getTitle()).toBe('First Report'); }); async function makePdf(pageCount: number, opts: { title?: string } = {}) { const doc = await PDFDocument.create(); for (let i = 0; i < pageCount; i++) { doc.addPage([200, 200]).drawText(`p${i + 1}`); } if (opts.title) doc.setTitle(opts.title); return doc.save(); } it('preserves metadata of first document in merge', async () => { const a = await makePdf(1, { title: 'First Report' }); const b = await makePdf(1, { title: 'Second' }); const merged = await mergePdfs([a, b]); const doc = await PDFDocument.load(merged); expect(doc.getTitle()).toBe('First Report'); }); git clone https://github.com/sen-ltd/pdf-merge-api cd pdf-merge-api docker build -t pdf-merge-api . docker run --rm -p 8000:8000 pdf-merge-api # In another terminal: curl -F "[email protected]" -F "[email protected]" \ -o merged.pdf http://localhost:8000/merge curl -F "[email protected]" \ "http://localhost:8000/split?ranges=1-2,5" -o excerpt.pdf curl -F "[email protected]" http://localhost:8000/info | jq curl -F "[email protected]" \ "http://localhost:8000/rotate?rotation=90&pages=1,3" -o rotated.pdf git clone https://github.com/sen-ltd/pdf-merge-api cd pdf-merge-api docker build -t pdf-merge-api . docker run --rm -p 8000:8000 pdf-merge-api # In another terminal: curl -F "[email protected]" -F "[email protected]" \ -o merged.pdf http://localhost:8000/merge curl -F "[email protected]" \ "http://localhost:8000/split?ranges=1-2,5" -o excerpt.pdf curl -F "[email protected]" http://localhost:8000/info | jq curl -F "[email protected]" \ "http://localhost:8000/rotate?rotation=90&pages=1,3" -o rotated.pdf git clone https://github.com/sen-ltd/pdf-merge-api cd pdf-merge-api docker build -t pdf-merge-api . docker run --rm -p 8000:8000 pdf-merge-api # In another terminal: curl -F "[email protected]" -F "[email protected]" \ -o merged.pdf http://localhost:8000/merge curl -F "[email protected]" \ "http://localhost:8000/split?ranges=1-2,5" -o excerpt.pdf curl -F "[email protected]" http://localhost:8000/info | jq curl -F "[email protected]" \ "http://localhost:8000/rotate?rotation=90&pages=1,3" -o rotated.pdf - HTML files that were misnamed .pdf by a browser auto-download - ZIP archives (PK\x03\x04) that a user zipped and then renamed because the upload form rejected .zip - JPEGs (\xff\xd8\xff) that an iOS device produced instead of a PDF - Totally random garbage that a fuzzer is posting at you to crash pdf-lib - MAX_UPLOAD_MB via env var, default 20. Oversize uploads get 413 with a structured JSON body. Without this, one pathological request can OOM the process. 20 MB is a reasonable default for a document service that is not trying to be a document store. - In-memory only. The service never touches disk. Uploads come in as Uint8Array, get processed in memory, get written to the response stream. No tempdirs, no cleanup race conditions, no quota management. If a container falls over, there is no state to recover. - createApp() factory. Tests drive the entire app via app.request() — Hono's fetch-spec-compatible handler — with real FormData objects. No sockets, no ports, no test servers to tear down. The multipart parser, the magic-byte check, and every happy and sad path of each route is exercised in a unit-test-speed suite. - Structured JSON logs. One line per request on stdout: method, path, status, duration. Any log shipper — fluent-bit, vector, CloudWatch agent — picks it up unchanged. I used to reach for pino for this; the replacement is seven lines of code and it's clearer. - Non-root, multi-stage Alpine build. Final image is 190 MB, mostly the Node runtime. No native packages. Starts in under 200 ms. - No text extraction. pdf-lib does not expose text content. If you need it, use pdf.js or a dedicated pdftxt-api. I deliberately keep this service single-purpose. - No rasterization. You cannot get a PNG preview of page 1 out of this service. Same reason — different library for that. - No encrypted PDFs. pdf-lib has limited support for password-protected documents. I don't try. - No form filling. pdf-lib can do this, but I didn't wire an endpoint for it because I didn't have a use case. Would be an afternoon to add. - No streaming. Everything is buffered in memory up to MAX_UPLOAD_MB. This is fine for documents; it would not be fine for a service handling 500 MB scans.