Tools: Docling CLI to parse PDFs and export it to multiple formats (2026)

Tools: Docling CLI to parse PDFs and export it to multiple formats (2026)

What is Docling ???

I'll be taking you through the process of parsing PDFs into structured formats.

Step 1: Set up

Step 2: Installing docling

Step 3: Creating input and outputs folders

Step 4: Changing the pdfs into html format

Step 5: Changing the pdfs into other formats

1. Markdown

2. Json

3. Plain text

4. yaml

5. html_split_page

6. DOCtags

7. vtt

Step 6: Analyzing the result findings.

1. Pdf with tables

2. Pdf with text and images

3. Pdf with tables and paragraphs Docling is an open source document processing library that converts various document formats into structured outputs.

Docling plays an important part in the RAG pipeline. Check the docling's version I used three types of pdfss;

one with tables, the other with text and images and the other had tables and paragraphs. Here are my key findings; Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

mkdir docling_cli cd docling_cli mkdir docling_cli cd docling_cli mkdir docling_cli cd docling_cli pip install docling docling --version pip install docling docling --version pip install docling docling --version docling --to html *.pdf --output ~Documents/docling_cli/outputs/html_outputs docling --to html *.pdf --output ~Documents/docling_cli/outputs/html_outputs docling --to html *.pdf --output ~Documents/docling_cli/outputs/html_outputs - Create the project structure in your terminal; - Create your virtual environment and activate it. Fedora - create a folder called data where you will stored your desired pdfs. - create a new folder and name it outputs then inside the folders create new folders called; markdown outputs, html outputs and json outputs. - In HTML, the rows and columns came out better than they were in the original pdf. - Markdown outputs were good too as it wrote the tables in markdown format without losing anything. - JSON was broke everything down into nested objects - Plain text was good too but not as compared to markdown. - HTML lost the color of the images. - Paragraphs in all formats came out nicely as texts.