Tools: I Built a Tool to Convert Invoice PDFs into Excel - Here’s What Surprised Me

Tools: I Built a Tool to Convert Invoice PDFs into Excel - Here’s What Surprised Me

Source: Dev.to

Why this, and aren't there already converters? ## What actually works well ## Lessons from building a very small tool ## Where the project is now ## A final thought I recently built a small tool that converts invoice PDFs into Excel spreadsheets. To be honest, when I started this whole thing, I assumed this would be a fairly straightforward project that I could vibe code in a few hours. I'm a software engineer by day and I already run a semi-successful iOS app business so thought this would be a piece of cake. The idea was simple. Upload a PDF, extract a table, export a spreadsheet. But it turns out invoice PDFs are one of the messiest, most deceptive file formats I’ve ever worked with. What looked like a simple file conversion problem ended up teaching me a lot about how data actually exists in the real world, and why so many “simple automation” problems stay unsolved longer than you’d expect. The original idea came from my wife. She is a HR consultant working from home and I've helped her do her books in the past. Together we've set up some pretty sophisticated spreadsheets for her, but all of this entails adding data from her invoices into the various cells in the spreadsheet in order to work out her total costs etc. This is usually extracted manually from the various PDF invoices she has from services she pays for, as well as the invoice platform that she herself uses to charge for her services. After doing some digging, it turns out that a lot of people constantly receive invoices as PDFs, but (without paying for/building a sophisticated software system) spreadsheets are still the easiest way to analyse costs, track expenses, or prepare data for bookkeeping. So sooner or later, someone ends up manually copying numbers into Excel. Which is exactly the system we had set up for my wife. I assumed there must be a better way to extract the data, but this is where I hit upon my first problem. PDFs can be converted to CSVs in tons of places, but my issue was more specific to what my wife needed. Invoices that have relational data. Line items, quantities, tax, overall vendor details and invoice numbers etc. This wasn't a case of getting arbitrary data into a list. It needed to relate. The problem is, PDFs aren’t data, they’re essentially drawings. The first surprise was understanding what a PDF actually is. When you open an invoice, you see rows, columns, totals, and neatly aligned tables. Your brain interprets structure instantly. But a PDF doesn’t store “tables” or “cells.” It stores positioned text. Essentially they contain instructions for drawing characters at specific coordinates on a page. There’s no inherent relationship between those pieces of text. No rows. No columns. Just placement. Extracting structure means reconstructing meaning from layout which is much harder than it sounds when just using OCR. In practice, it started to feel a bit like trying to rebuild a spreadsheet from a screenshot. More specifically, line items are where everything seemed to break down when using generic converters. The real challenge is the aforementioned line items, the individual rows listing products or services. That’s where reality kicks in. Descriptions wrap onto multiple lines. Columns shift slightly between pages. Some vendors merge cells. Others add tax rows halfway through tables. Sometimes the same invoice format changes subtly month to month. Two invoices from the same company can look identical but parse completely differently. I quickly realised that extracting line items reliably was the actual problem worth solving. Not just “PDF to Excel.” So I started asking around some of my wife's friends who also run their own books (one of which is a self-employed accountant) The biggest surprise: people didn’t want integrations. Most business professionals already use sophisticated tools connected directly to accounting software. They don’t need another converter it turns out. The people who did care were different: Basically, people who just needed numbers in Excel quickly. They didn’t want integrations or complex workflows. “I have a PDF. Give me a spreadsheet.” That simplicity changed how I thought about the product entirely. Edge cases turned out to be the normal cases. In development, you naturally test with clean example documents. I did initially and I thought it was working great, until I tested with a friends iOS app business invoice. It just didn't know what to do with only a handful of line items. Real invoices are not clean. In testing, I started seeing: At first these felt like edge cases. Eventually I realised they were just the norm. The messiness wasn’t unusual, it turned out it was standard. That was probably the biggest mindset shift: software often assumes ideal inputs, but real-world documents don't often cooperate. After experimenting with a lot of invoices, some patterns became clear. Conversion works best when: When those conditions are met, extracting structured spreadsheet data becomes surprisingly reliable. Understanding those limits turned out to be just as important as improving accuracy. I realised that AI can play a role in interpreting data too. That was the real shift to get me from 60% accuracy to nearly 100%. I now us AI in thee product to interpret the data structure and to return a neatly formatted JSON object, and sanitise the data in the app when it's returned. And the end result is perfect. To be clear, the app is still in it's infancy, I'm constantly tweaking it, running experiments. It's not making big bucks yet either. But my wife is happy. That is worth something in my books ;) Working on this changed how I think about software problems in general. A few things stood out: The result of all this experimentation became a small tool called BillToSheet, which focuses specifically on turning invoice PDFs into structured Excel or CSV files, especially extracting usable line items. My wife uses it, as do a few of her friends. It's now also starting to gain traction online with users from around the world converting invoices for free. It’s still evolving, and honestly the learning hasn’t stopped. Every new invoice format reveals another assumption I didn’t realise I was making. But that’s part of the appeal: small, focused tools can improve steadily as they encounter real-world use. The biggest surprise wasn’t how hard PDFs are technically. It was realising how much everyday business still runs on documents that were never designed to be data. Invoices look structured because humans read them visually. Teaching software to understand that same structure turns out to be a much deeper challenge and a strangely satisfying one to work on. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - invoice numbers - small business owners doing their own admin like my wife's friends - freelancers preparing expenses (I have a small network of business friends who gave me this feeedback) - ecommerce sellers tracking supplier costs - scanned invoices instead of digital PDFs - rotated pages - inconsistent currency formats - tables split across multiple pages - strange VAT layouts - the PDF is digitally generated (not scanned) - tables are consistently aligned - vendors use stable templates - text hasn’t been flattened into images - Boring problems are often real problems. - Invoice handling isn’t exciting, but people deal with it constantly. - Automation is rarely simple at the edges. - The last 20% of cases take most of the effort. - Spreadsheets are still universal. - Despite endless SaaS tools, Excel remains the lowest common denominator for getting work done. - Users value speed more than perfection. - Saving someone an hour matters more than solving every possible case.