Tools

Tools: How to Scrape Markdown for RAG Pipelines

2026-02-14 0 views admin

Source: Dev.to

If you are building an AI application like a chatbot, a summarizer, or a research agent, you have likely run into the garbage in, garbage out problem. You want to let your user interact with your chatbot about your products. So, you spin up a headless browser with Puppeteer, dump the document.body.innerHTML, and feed it to OpenAI or Claude. The solution is to stop scraping HTML and start extracting Markdown. In this tutorial, I’ll show you how to use the Geekflare API to turn any webpage into LLM-ready Markdown. LLMs love Markdown. It represents the structure of a document Headers, Lists, Tables without the noise of HTML. We are going to use Node.js for this, but you can use Python, Go, or any of your favorite languages. We aren't going to use Puppeteer. We don't want to manage headless Chrome instances. We will offload that to the API. The Geekflare Scraping API handles the rendering, blocking, and formatting. Connecting to an LLM for RAG Now that you have clean Markdown, the cost savings are massive. If you send raw HTML to GPT 5.2, a standard blog post might cost you 4,000 tokens. If you send the Markdown version, it will be ~1,200 tokens. Here is a code example of how the pipeline looks: Conclusion
Building a scraping pipeline in-house is fun until you have to maintain it. Websites change their DOM structure, new anti-bot measures are deployed, and your IP gets banned. If your goal is to build an AI product, don't waste time building a scraper. Offload the infrastructure so you can focus on the intelligence. You can grab a scraping API key and try the Markdown extraction. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
<div class="content-wrapper"> <h1 class="hero-title">The Future of AI</h1> <div class="ad-banner">...</div> <p class="text-body">AI is changing how we code...</p>
</div> Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
<div class="content-wrapper"> <h1 class="hero-title">The Future of AI</h1> <div class="ad-banner">...</div> <p class="text-body">AI is changing how we code...</p>
</div> CODE_BLOCK:
<div class="content-wrapper"> <h1 class="hero-title">The Future of AI</h1> <div class="ad-banner">...</div> <p class="text-body">AI is changing how we code...</p>
</div> COMMAND_BLOCK:
# The Future of AI AI is changing how we code... Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# The Future of AI AI is changing how we code... COMMAND_BLOCK:
# The Future of AI AI is changing how we code... CODE_BLOCK:
const axios = require('axios'); const GEEKFLARE_API_KEY = 'YOUR_API_KEY'; async function scrapeToMarkdown(targetUrl) { try { const response = await axios.post( 'https://api.geekflare.com/webscraping', { url: targetUrl, format: 'markdown', }, { headers: { 'x-api-key': GEEKFLARE_API_KEY, 'Content-Type': 'application/json' } } ); console.log("--- SCRAPED MARKDOWN ---"); console.log(response.data.data); } catch (error) { console.error("Scraping failed:", error.response ? error.response.data : error.message); }
} scrapeToMarkdown('https://docs.docker.com/get-started/'); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
const axios = require('axios'); const GEEKFLARE_API_KEY = 'YOUR_API_KEY'; async function scrapeToMarkdown(targetUrl) { try { const response = await axios.post( 'https://api.geekflare.com/webscraping', { url: targetUrl, format: 'markdown', }, { headers: { 'x-api-key': GEEKFLARE_API_KEY, 'Content-Type': 'application/json' } } ); console.log("--- SCRAPED MARKDOWN ---"); console.log(response.data.data); } catch (error) { console.error("Scraping failed:", error.response ? error.response.data : error.message); }
} scrapeToMarkdown('https://docs.docker.com/get-started/'); CODE_BLOCK:
const axios = require('axios'); const GEEKFLARE_API_KEY = 'YOUR_API_KEY'; async function scrapeToMarkdown(targetUrl) { try { const response = await axios.post( 'https://api.geekflare.com/webscraping', { url: targetUrl, format: 'markdown', }, { headers: { 'x-api-key': GEEKFLARE_API_KEY, 'Content-Type': 'application/json' } } ); console.log("--- SCRAPED MARKDOWN ---"); console.log(response.data.data); } catch (error) { console.error("Scraping failed:", error.response ? error.response.data : error.message); }
} scrapeToMarkdown('https://docs.docker.com/get-started/'); CODE_BLOCK:
const markdown = await scrapeToMarkdown('https://example.com/article'); const completion = await openai.chat.completions.create({ messages: [ { role: "system", content: "You are a helpful assistant. Answer based on the context provided." }, { role: "user", content: `Context: ${markdown}\n\nQuestion: Summarize this article.` } ], model: "gpt-5.2",
}); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
const markdown = await scrapeToMarkdown('https://example.com/article'); const completion = await openai.chat.completions.create({ messages: [ { role: "system", content: "You are a helpful assistant. Answer based on the context provided." }, { role: "user", content: `Context: ${markdown}\n\nQuestion: Summarize this article.` } ], model: "gpt-5.2",
}); CODE_BLOCK:
const markdown = await scrapeToMarkdown('https://example.com/article'); const completion = await openai.chat.completions.create({ messages: [ { role: "system", content: "You are a helpful assistant. Answer based on the context provided." }, { role: "user", content: `Context: ${markdown}\n\nQuestion: Summarize this article.` } ], model: "gpt-5.2",
}); - Token Waste: Raw HTML is 60% boilerplate with divs, classes, scripts, styles, etc. You are paying for tokens that carry no semantic meaning.
- Hallucinations: LLMs get confused by navigation bars, footers, and cookie banners.
- Bot Detection: If you try to scrape a modern React site from your local server, you’ll get blocked by Cloudflare or CAPTCHAs. - Geekflare API Key
- Node.js installed - Create a file named scrape.js:

🏷️ Tags

how-totutorialguidedev.toaimlopenaillmgptserverdockernodepython

Tools: How to Scrape Markdown for RAG Pipelines

🏷️ Tags

More from Tools

Tools: How to generate a PDF from HTML in Node.js (without Puppeteer)

Tools: How I Manage AI Coding Rules Across Claude Code, Cursor, and Codex With One CLI

Tools: Your Dev Tools Are Leaking Data. Here’s Why I Built Mine to Run Entirely in the Browser.

Tools: Vibe Coding is best for repid development but, most of programmer don't knows about .

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting