Tools
βοΈ5 AI Document Parsing Tools That Actually Work ππ₯
2025-12-12
0 views
admin
1. Tensorlake ## Features ## Code Example: From PDF to markdown ## Code Example: Tiny agentic app on the Tensorlake runtime ## Deploying and running on Tensorlake Cloud ## 2. Docling ## Features ## Code Example - Convert and print markdown ## Code Example: Same idea from the CLI ## 3. Unstructured ## Features ## Code Example: Quickstart with partition ## Code Example: Batch processing with Ingest CLI ## 4. Amazon Textract ## Features ## Code Example: Detect text from a local file ## Code Example: Extract tables and forms from S3 ## 5. Google Cloud Document AI ## Use it when you want: ## Code Example: Send a PDF and read the text ## Code Example: Turn a parsed form into fields ## Conclusion Working with real world documents is still pain. PDFs, invoices, random exports from legacy tools. Half the work is just getting them into a clean, structured format your models can use. π This post is about that first step. The one that usually gets ignored in demos and tutorials. Parsing and structuring the documents. The tools here handle OCR, layout, tables, forms and file format so you can focus on the logic around them. I am walking through a few I actually like using, with short code snippets you can drop straight into your own projects. π‘ Document Ingestion API plus a serverless runtime for agentic data workflows Tensorlake gives you two big things in one place: You can send PDFs, Office files, images or raw text and get back well structured content with preserved layout. Long story short, you can treat it as a Document Ingestion API that handles PDFs, Office files, scans and images, then add agent style applications on top using their serverless runtime. So, instead of handling OCR and background jobs with retry logic, you get one single platform that parses, chunks, classifies and then feeds the results into the agent or tools. If you are building invoice extractors, contract analyzers, or any complex data ingestion or agents that need to actually read documents, Tensorlake sits right in the middle of your stack as the ingestion and workflow layer. Now, let's go through a quick code example of some common use cases. First, install the SDK and use the DocumentAI client to upload a PDF, start a parse job and stream the markdown chunks once parsing is done. Now, to extract the text from a PDF, you can do something like: This is the basic flow you would use in a backend job that takes uploaded PDFs and turns them into LLM friendly text for something like RAG or search. Once you have the chunks, you can push them straight into a vector store or a database. You can have more control over parsing, like using structured parsing, which you can find here: Structured Extraction. I leave it up to you to explore more about this. To run a small agentic app on top of Tensorlake, it's as simple as: This above code creates a city guide application using OpenAI Agents with tool calls. I'm not going to explain the code here, as the blog will get unnecessarily longer. You can find the explanation for this code in their GitHub README. To run the application on Tensorlake Cloud, it first needs to be deployed. To put it short, Tensorlake takes care of spinning up containers, injecting secrets and keeping the function durable so it can retry tool calls without you building your own queue system. Here's a quick Tensorlake document ingestion demo to see it in action working with a complex document. π Docling is from the IBM Research Team, licensed under MIT (free and open to commercial use), and turns PDFs, Office docs, images, audio and more into a unified DoclingDocument format. You can then export that into markdown, HTML, DocTags or lossless JSON and plug it straight into RAG, agents or search. It runs locally and comes with strong layout and table understanding plus OCR and vision models for scanned or complex documents. Multi format parsing - PDF, DOCX, PPTX, XLSX, HTML, images, audio and more into one structured representation. Advanced PDF understanding - Page layout, reading order, tables, code, formulas and images handled out of the box. Multiple export targets - Export a single DoclingDocument to markdown, HTML, DocTags or structured JSON. Local and privacy friendly - Designed to run completely locally. Gen AI integrations - Hooks into LangChain, LlamaIndex, Haystack and others out of the box. The basic flow is intentionally simple: create a converter, give it a source and then decide how you want to export the result. This example shows the βone document in, one markdown document outβ path that you would usually add into your indexing step. This gives you one markdown document you can split into chunks and feed into a vector database. Docling also comes with a CLI. You can install it with the following command: Now, you can run it using the following command: Obviously, there are a few more complex use cases with a lot more flags you can add. For this, visit their documentation. Here's a quick video by Red Hat to see it in action. π Unstructured gives you an open source library plus a managed platform to turn unstructured content into structured data for LLM apps. It partitions PDFs, slides, HTML, Office files and images into a standard set of elements that downstream tools can easily consume. On top of that, the ingest layer adds connectors, chunking and embeddings so you can build full ETL style pipelines around your document sources. This is the core pattern you will see in most examples, and it is enough to plug into a RAG pipeline. You end up with a list of elements that know their category, which makes it easy to filter for titles, paragraphs or tables before you use it further. For real projects you usually need to process many files at once and save the outputs somewhere. It comes with an ingest CLI and is built for exactly that. This runs a full pipeline that reads documents from LOCAL_FILE_INPUT_DIR, partitions them with the hi_res strategy, chunks them by title and writes the structured outputs into your output directory. From there, you can index or analyze them however you like. Here's a quick API quickstart to get an idea. π Amazon Textract is AWSβs managed OCR and document analysis service that pulls text, handwriting, layout and structured data out of scanned documents and PDFs. It runs inside your AWS account, plugs into services like S3, Lambda, SNS and SQS, and is used at scale by companies like PayTM for document workflows. This is the basic pattern if you just want the text out of a document. You read the file as bytes, call detect_document_text and print the lines Textract finds. What is happening here: To pull structured data from forms and tables, you use analyze_document with the FORMS and TABLES feature types and point Textract at a document in S3. There is a lot of other complex stuff that you can do with Textract. For more details, check out the Textract documentation. In production you usually wire this up with S3 triggers and Lambda so new documents are picked up and processed by themselves. Here's a quick intro to Amazon Textract. π Document AI is Google Cloudβs document stack that gives you ready made processors for invoices, receipts, forms, IDs and general OCR. You pick a processor, send it a file and get back a Document object with text, structure, entities and layout info, not just raw strings. The nice part is how it fits into the rest of GCP (Google Cloud Platform). You can drop files into Cloud Storage, trigger processing with Pub/Sub or Cloud Functions or Cloud Run, then push clean data into BigQuery or your app. This is the usual Python flow. You create a processor in the console, grab its ID, then call it from your code. You send raw bytes plus the MIME type, Document AI runs the selected processor and you get back a Document object. For quick use cases, grabbing doc.text is enough. If you use a form style processor, Document AI already marks fields as key value pairs, which you can loop over and map into your own schema. This is the point where a scanned form basically becomes a Python dict. From here, you can push the data into BigQuery, Firestore or any service you use on GCP. This is just a start, and there's a lot more to it. Visit the documentation to learn more. Here's a quick introduction to Google Cloud Document AI. π If you think of any other handy AI tools that I haven't covered in this article, do share them in the comments section below. βοΈ So, that is it for this article. Thank you so much for reading! ππ«‘ Templates let you quickly answer FAQs or store snippets for re-use. Which one actually seems like youβd want to use in your projects? I don't get the idea for this parging document. Why not use pypdf or something similar to it? Pypdf doc: github.com/py-pdf/pypdf Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK:
pip install tensorlake Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
pip install tensorlake COMMAND_BLOCK:
pip install tensorlake COMMAND_BLOCK:
from tensorlake.documentai import DocumentAI, ParseStatus doc_ai = DocumentAI(api_key="your-api-key") # Upload and parse document
file_id = doc_ai.upload("/path/to/document.pdf") # Start parsing
parse_id = doc_ai.parse(file_id) # Wait until parsing is complete
result = doc_ai.wait_for_completion(parse_id) if result.status == ParseStatus.SUCCESSFUL: # Each chunk is a piece of clean markdown for chunk in result.chunks: print(chunk.content) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from tensorlake.documentai import DocumentAI, ParseStatus doc_ai = DocumentAI(api_key="your-api-key") # Upload and parse document
file_id = doc_ai.upload("/path/to/document.pdf") # Start parsing
parse_id = doc_ai.parse(file_id) # Wait until parsing is complete
result = doc_ai.wait_for_completion(parse_id) if result.status == ParseStatus.SUCCESSFUL: # Each chunk is a piece of clean markdown for chunk in result.chunks: print(chunk.content) COMMAND_BLOCK:
from tensorlake.documentai import DocumentAI, ParseStatus doc_ai = DocumentAI(api_key="your-api-key") # Upload and parse document
file_id = doc_ai.upload("/path/to/document.pdf") # Start parsing
parse_id = doc_ai.parse(file_id) # Wait until parsing is complete
result = doc_ai.wait_for_completion(parse_id) if result.status == ParseStatus.SUCCESSFUL: # Each chunk is a piece of clean markdown for chunk in result.chunks: print(chunk.content) COMMAND_BLOCK:
import os
from agents import Agent, Runner
from agents.tool import WebSearchTool, function_tool
from tensorlake.applications import application, function, run_local_application, Image # Container image with the dependencies the function needs
FUNCTION_CONTAINER_IMAGE = Image( base_image="python:3.11-slim", name="city_guide_image",
).run("pip install openai openai-agents") @function_tool
@function( description="Gets the weather for a city", secrets=["OPENAI_API_KEY"], image=FUNCTION_CONTAINER_IMAGE,
)
def get_weather_tool(city: str) -> str: agent = Agent( name="Weather Reporter", instructions="Use web search to find current weather in the city", tools=[WebSearchTool()], ) result = Runner.run_sync(agent, f"City: {city}") return result.final_output.strip() @application(tags={"type": "example", "use_case": "city_guide"})
@function( description="Creates a simple city guide", secrets=["OPENAI_API_KEY"], image=FUNCTION_CONTAINER_IMAGE,
)
def city_guide_app(city: str) -> str: agent = Agent( name="Guide Creator", instructions="Make a friendly city guide that includes the current temperature", tools=[get_weather_tool], ) result = Runner.run_sync(agent, f"City: {city}") return result.final_output.strip() if __name__ == "__main__": city = "Paris" if not os.environ.get("OPENAI_API_KEY"): print("Error: OPENAI_API_KEY is not set") raise SystemExit(1) request = run_local_application("city_guide_app", city) response = request.output() print(response) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import os
from agents import Agent, Runner
from agents.tool import WebSearchTool, function_tool
from tensorlake.applications import application, function, run_local_application, Image # Container image with the dependencies the function needs
FUNCTION_CONTAINER_IMAGE = Image( base_image="python:3.11-slim", name="city_guide_image",
).run("pip install openai openai-agents") @function_tool
@function( description="Gets the weather for a city", secrets=["OPENAI_API_KEY"], image=FUNCTION_CONTAINER_IMAGE,
)
def get_weather_tool(city: str) -> str: agent = Agent( name="Weather Reporter", instructions="Use web search to find current weather in the city", tools=[WebSearchTool()], ) result = Runner.run_sync(agent, f"City: {city}") return result.final_output.strip() @application(tags={"type": "example", "use_case": "city_guide"})
@function( description="Creates a simple city guide", secrets=["OPENAI_API_KEY"], image=FUNCTION_CONTAINER_IMAGE,
)
def city_guide_app(city: str) -> str: agent = Agent( name="Guide Creator", instructions="Make a friendly city guide that includes the current temperature", tools=[get_weather_tool], ) result = Runner.run_sync(agent, f"City: {city}") return result.final_output.strip() if __name__ == "__main__": city = "Paris" if not os.environ.get("OPENAI_API_KEY"): print("Error: OPENAI_API_KEY is not set") raise SystemExit(1) request = run_local_application("city_guide_app", city) response = request.output() print(response) COMMAND_BLOCK:
import os
from agents import Agent, Runner
from agents.tool import WebSearchTool, function_tool
from tensorlake.applications import application, function, run_local_application, Image # Container image with the dependencies the function needs
FUNCTION_CONTAINER_IMAGE = Image( base_image="python:3.11-slim", name="city_guide_image",
).run("pip install openai openai-agents") @function_tool
@function( description="Gets the weather for a city", secrets=["OPENAI_API_KEY"], image=FUNCTION_CONTAINER_IMAGE,
)
def get_weather_tool(city: str) -> str: agent = Agent( name="Weather Reporter", instructions="Use web search to find current weather in the city", tools=[WebSearchTool()], ) result = Runner.run_sync(agent, f"City: {city}") return result.final_output.strip() @application(tags={"type": "example", "use_case": "city_guide"})
@function( description="Creates a simple city guide", secrets=["OPENAI_API_KEY"], image=FUNCTION_CONTAINER_IMAGE,
)
def city_guide_app(city: str) -> str: agent = Agent( name="Guide Creator", instructions="Make a friendly city guide that includes the current temperature", tools=[get_weather_tool], ) result = Runner.run_sync(agent, f"City: {city}") return result.final_output.strip() if __name__ == "__main__": city = "Paris" if not os.environ.get("OPENAI_API_KEY"): print("Error: OPENAI_API_KEY is not set") raise SystemExit(1) request = run_local_application("city_guide_app", city) response = request.output() print(response) CODE_BLOCK:
export TENSORLAKE_API_KEY="Paste your API key here" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
export TENSORLAKE_API_KEY="Paste your API key here" CODE_BLOCK:
export TENSORLAKE_API_KEY="Paste your API key here" CODE_BLOCK:
tensorlake secrets set OPENAI_API_KEY "Paste your API key here" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
tensorlake secrets set OPENAI_API_KEY "Paste your API key here" CODE_BLOCK:
tensorlake secrets set OPENAI_API_KEY "Paste your API key here" CODE_BLOCK:
tensorlake deploy examples/readme_example/city_guide.py Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
tensorlake deploy examples/readme_example/city_guide.py CODE_BLOCK:
tensorlake deploy examples/readme_example/city_guide.py COMMAND_BLOCK:
from tensorlake.applications import run_remote_application city = "San Francisco" # Run the application remotely
request = run_remote_application("city_guide_app", city)
print(f"Request ID: {request.id}") # Get the output
response = request.output()
print(response) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from tensorlake.applications import run_remote_application city = "San Francisco" # Run the application remotely
request = run_remote_application("city_guide_app", city)
print(f"Request ID: {request.id}") # Get the output
response = request.output()
print(response) COMMAND_BLOCK:
from tensorlake.applications import run_remote_application city = "San Francisco" # Run the application remotely
request = run_remote_application("city_guide_app", city)
print(f"Request ID: {request.id}") # Get the output
response = request.output()
print(response) COMMAND_BLOCK:
from docling.document_converter import DocumentConverter source = "https://arxiv.org/pdf/2408.09869" # can also be a local Path(...)
converter = DocumentConverter()
result = converter.convert(source) markdown = result.document.export_to_markdown()
print(markdown) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from docling.document_converter import DocumentConverter source = "https://arxiv.org/pdf/2408.09869" # can also be a local Path(...)
converter = DocumentConverter()
result = converter.convert(source) markdown = result.document.export_to_markdown()
print(markdown) COMMAND_BLOCK:
from docling.document_converter import DocumentConverter source = "https://arxiv.org/pdf/2408.09869" # can also be a local Path(...)
converter = DocumentConverter()
result = converter.convert(source) markdown = result.document.export_to_markdown()
print(markdown) COMMAND_BLOCK:
pip install docling Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
pip install docling COMMAND_BLOCK:
pip install docling COMMAND_BLOCK:
# Convert a PDF at a URL to markdown on stdout
docling https://arxiv.org/pdf/2206.01062 # Use the GraniteDocling vision language model in the pipeline
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Convert a PDF at a URL to markdown on stdout
docling https://arxiv.org/pdf/2206.01062 # Use the GraniteDocling vision language model in the pipeline
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062 COMMAND_BLOCK:
# Convert a PDF at a URL to markdown on stdout
docling https://arxiv.org/pdf/2206.01062 # Use the GraniteDocling vision language model in the pipeline
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062 COMMAND_BLOCK:
from unstructured.partition.auto import partition # Read and partition a document
elements = partition("example-docs/layout-parser-paper.pdf") # Inspect a few elements
for el in elements[:5]: print(repr(el.category), "->", str(el)[:80], "...") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from unstructured.partition.auto import partition # Read and partition a document
elements = partition("example-docs/layout-parser-paper.pdf") # Inspect a few elements
for el in elements[:5]: print(repr(el.category), "->", str(el)[:80], "...") COMMAND_BLOCK:
from unstructured.partition.auto import partition # Read and partition a document
elements = partition("example-docs/layout-parser-paper.pdf") # Inspect a few elements
for el in elements[:5]: print(repr(el.category), "->", str(el)[:80], "...") COMMAND_BLOCK:
# Chunk and partition an entire folder of files
unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunking-strategy by_title \ --chunk-max-characters 1024 \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Chunk and partition an entire folder of files
unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunking-strategy by_title \ --chunk-max-characters 1024 \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res COMMAND_BLOCK:
# Chunk and partition an entire folder of files
unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunking-strategy by_title \ --chunk-max-characters 1024 \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res COMMAND_BLOCK:
import boto3 textract = boto3.client("textract") # uses your AWS credentials file_path = "sample-doc.png" # can be any image format with open(file_path, "rb") as f: image_bytes = f.read() response = textract.detect_document_text( Document={"Bytes": image_bytes}
) for block in response["Blocks"]: if block["BlockType"] == "LINE": print(block["Text"]) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import boto3 textract = boto3.client("textract") # uses your AWS credentials file_path = "sample-doc.png" # can be any image format with open(file_path, "rb") as f: image_bytes = f.read() response = textract.detect_document_text( Document={"Bytes": image_bytes}
) for block in response["Blocks"]: if block["BlockType"] == "LINE": print(block["Text"]) COMMAND_BLOCK:
import boto3 textract = boto3.client("textract") # uses your AWS credentials file_path = "sample-doc.png" # can be any image format with open(file_path, "rb") as f: image_bytes = f.read() response = textract.detect_document_text( Document={"Bytes": image_bytes}
) for block in response["Blocks"]: if block["BlockType"] == "LINE": print(block["Text"]) COMMAND_BLOCK:
import boto3 textract = boto3.client("textract") bucket_name = "my-doc-bucket"
object_key = "invoices/invoice-001.png" response = textract.analyze_document( Document={ "S3Object": { "Bucket": bucket_name, "Name": object_key, } }, FeatureTypes=["FORMS", "TABLES"],
) print(f"Found {len(response['Blocks'])} blocks") # Quick peek at found tables
for block in response["Blocks"]: if block["BlockType"] == "TABLE": print("Detected a table with Id:", block["Id"]) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import boto3 textract = boto3.client("textract") bucket_name = "my-doc-bucket"
object_key = "invoices/invoice-001.png" response = textract.analyze_document( Document={ "S3Object": { "Bucket": bucket_name, "Name": object_key, } }, FeatureTypes=["FORMS", "TABLES"],
) print(f"Found {len(response['Blocks'])} blocks") # Quick peek at found tables
for block in response["Blocks"]: if block["BlockType"] == "TABLE": print("Detected a table with Id:", block["Id"]) COMMAND_BLOCK:
import boto3 textract = boto3.client("textract") bucket_name = "my-doc-bucket"
object_key = "invoices/invoice-001.png" response = textract.analyze_document( Document={ "S3Object": { "Bucket": bucket_name, "Name": object_key, } }, FeatureTypes=["FORMS", "TABLES"],
) print(f"Found {len(response['Blocks'])} blocks") # Quick peek at found tables
for block in response["Blocks"]: if block["BlockType"] == "TABLE": print("Detected a table with Id:", block["Id"]) CODE_BLOCK:
from google.cloud import documentai project_id = "your-project-id"
location = "us"
processor_id = "your-processor-id"
file_path = "path/to/document.pdf" client = documentai.DocumentProcessorServiceClient() name = client.processor_path(project_id, location, processor_id) with open(file_path, "rb") as f: file_bytes = f.read() raw_document = documentai.RawDocument( content=file_bytes, mime_type="application/pdf",
) request = documentai.ProcessRequest( name=name, raw_document=raw_document,
) result = client.process_document(request=request)
doc = result.document print(doc.text[:1000]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
from google.cloud import documentai project_id = "your-project-id"
location = "us"
processor_id = "your-processor-id"
file_path = "path/to/document.pdf" client = documentai.DocumentProcessorServiceClient() name = client.processor_path(project_id, location, processor_id) with open(file_path, "rb") as f: file_bytes = f.read() raw_document = documentai.RawDocument( content=file_bytes, mime_type="application/pdf",
) request = documentai.ProcessRequest( name=name, raw_document=raw_document,
) result = client.process_document(request=request)
doc = result.document print(doc.text[:1000]) CODE_BLOCK:
from google.cloud import documentai project_id = "your-project-id"
location = "us"
processor_id = "your-processor-id"
file_path = "path/to/document.pdf" client = documentai.DocumentProcessorServiceClient() name = client.processor_path(project_id, location, processor_id) with open(file_path, "rb") as f: file_bytes = f.read() raw_document = documentai.RawDocument( content=file_bytes, mime_type="application/pdf",
) request = documentai.ProcessRequest( name=name, raw_document=raw_document,
) result = client.process_document(request=request)
doc = result.document print(doc.text[:1000]) COMMAND_BLOCK:
def clean(text: str) -> str: return text.replace("\n", " ").strip() form_doc = doc # from the previous example. see above fields = [] for page in form_doc.pages: for field in page.form_fields: name = clean(field.field_name.text_anchor.content) value = clean(field.field_value.text_anchor.content) conf = field.field_value.confidence fields.append((name, value, conf)) for name, value, conf in fields: print(f"{name}: {value} (conf {conf:.2f})") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
def clean(text: str) -> str: return text.replace("\n", " ").strip() form_doc = doc # from the previous example. see above fields = [] for page in form_doc.pages: for field in page.form_fields: name = clean(field.field_name.text_anchor.content) value = clean(field.field_value.text_anchor.content) conf = field.field_value.confidence fields.append((name, value, conf)) for name, value, conf in fields: print(f"{name}: {value} (conf {conf:.2f})") COMMAND_BLOCK:
def clean(text: str) -> str: return text.replace("\n", " ").strip() form_doc = doc # from the previous example. see above fields = [] for page in form_doc.pages: for field in page.form_fields: name = clean(field.field_name.text_anchor.content) value = clean(field.field_value.text_anchor.content) conf = field.field_value.confidence fields.append((name, value, conf)) for name, value, conf in fields: print(f"{name}: {value} (conf {conf:.2f})") - A Document Ingestion API that turns messy files into clean markdown or structured JSON
- A serverless platform to run agentic workflows on top of that data - Multi format parsing: Parse PDFs, Office docs, spreadsheets, presentations, images and raw text to markdown or JSON.
- Layout aware output: Preserves tables, sections and reading order so your RAG or search stays aligned with the original document, which many other tools miss. - Schema based extraction: Use JSON Schema or Pydantic models to pull out only the fields you care about.
- Agentic runtime: Decorate Python functions, run them in sandboxes and let Tensorlake handle scaling, retries and state. - Set TENSORLAKE_API_KEY in your shell session: - Set OPENAI_API_KEY in your Tensorlake Secrets so that your application can make calls to OpenAI: - Deploy the application to Tensorlake Cloud: - Run the remote test script found in examples/readme_example/test_remote_app.py: - The application will execute on Tensorlake Cloud, with each function running in its own isolated sandbox. - Multi format parsing - PDF, DOCX, PPTX, XLSX, HTML, images, audio and more into one structured representation.
- Advanced PDF understanding - Page layout, reading order, tables, code, formulas and images handled out of the box.
- Multiple export targets - Export a single DoclingDocument to markdown, HTML, DocTags or structured JSON.
- Local and privacy friendly - Designed to run completely locally.
- Gen AI integrations - Hooks into LangChain, LlamaIndex, Haystack and others out of the box. - One partition API autodetects file type and routes to the right parser for you.
- LLM friendly outputs structured elements with text, metadata and coordinates when needed.
- Source and destination connectors GitHub, S3 and more via the Ingest CLI and Python library.
- Hosted Partition Endpoint offloads compute to their API when you want better models or scale. - Structured extraction pulls data from tables, forms and key value pairs, not just plain text.
- Layout and handwriting support detects paragraphs, titles, layout elements and handwritten text in scans.
- Works naturally with S3, Lambda, SNS, SQS and other AWS services.
- Sync and async APIs low latency calls for single pages plus batch jobs for large multipage docs.
- Security and compliance encryption, IAM and regional controls for regulated workloads. - Textract analyzes the image or PDF and returns a list of Blocks that represent words, lines and other elements.
- You filter for blocks of type LINE and print their Text, which is enough for many basic OCR use cases or as a first step before sending text into an LLM. - Prebuilt processors - invoice, receipt, form, ID and general OCR processors that work out of the box.
- Tables and forms - key value pairs and tables straight from scanned PDFs and images.
- Custom models - custom extractors, classifiers and splitters when your docs do not match the prebuilt ones.
- Cloud pipelines - runs close to Cloud Storage, Cloud Run and Vertex AI so it is easy to wire into existing GCP setups. - Email [email protected]
- Location Kathmandu, Nepal
- Education BTech in Computer Science
- Pronouns he/him
- Work Software Engineer
- Joined Jul 26, 2023 - Joined Oct 16, 2023
how-totutorialguidedev.toaimlopenaillmservershellpythonssldatabasegitgithub