Tools

Tools: Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

2026-02-02 0 views admin

Dev.to

The Manual Content Problem ## The Solution: GitHub Crawlers ## How the Prompts Crawler Works ## How the Skills Crawler Works ## How the MCP Crawler Works ## The Cron Schedule ## Rate Limiting Matters ## What I Learned ## Current Stats ## The Honest Struggle ## What's Next ## Try It TL;DR: I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here's how. When I launched indx.sh, I had a content problem. The AI coding ecosystem moves fast: Manually tracking all this? Impossible. I built three automated crawlers that run daily: All run as Vercel cron jobs, so the directory stays fresh without manual work. The newest crawler searches for AI coding rules across multiple tools: First run indexed 175 prompts across Cursor, Claude Code, and Copilot. The key insight: GitHub's code search API lets you search by filename. filename:SKILL.md returns every repo with that file. MCP servers are trickier - there's no single file convention. I use multiple search strategies: GitHub's API has limits. Without a token: 10 requests/minute. With a token: 5,000 requests/hour. I handle this carefully: 1. Incremental is better than bulk Early versions tried to crawl everything at once. Timeouts, rate limits, chaos. Now I process 50 items per run and let it accumulate. 2. Deduplication by slug Same repo can appear in multiple search strategies. I generate consistent slugs (owner-repo-path) and upsert instead of insert. 3. Don't trust descriptions Many repos have empty or useless descriptions. I fall back to: "AI rules from {owner}/{repo}". Not pretty, but works. 4. Official = trusted Repos from modelcontextprotocol, anthropics, or anthropic-ai orgs get auto-verified badges. Community repos need manual verification. After the crawlers have been running: GitHub search isn't perfect. I get false positives - repos that mention "mcp" but aren't MCP servers. Manual review still matters for quality. Also: the 50-item limit per cron run means it takes days to fully index everything. Vercel's 10-second timeout for hobby plans is real. Browse the auto-discovered resources at indx.sh: Got a resource that's not indexed? Submit it or wait for the crawlers to find it. This is part 2 of the "Building indx.sh" series. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or CODE_BLOCK: const FILE_SEARCHES = [ { query: 'filename:.cursorrules', tool: 'cursor' }, { query: 'filename:CLAUDE.md', tool: 'claude-code' }, { query: 'filename:copilot-instructions.md', tool: 'copilot' }, ]; const REPO_SEARCHES = [ 'cursor-rules in:name,description', 'awesome-cursorrules', 'topic:cursor-rules', ]; CODE_BLOCK: const FILE_SEARCHES = [ { query: 'filename:.cursorrules', tool: 'cursor' }, { query: 'filename:CLAUDE.md', tool: 'claude-code' }, { query: 'filename:copilot-instructions.md', tool: 'copilot' }, ]; const REPO_SEARCHES = [ 'cursor-rules in:name,description', 'awesome-cursorrules', 'topic:cursor-rules', ]; CODE_BLOCK: const FILE_SEARCHES = [ { query: 'filename:.cursorrules', tool: 'cursor' }, { query: 'filename:CLAUDE.md', tool: 'claude-code' }, { query: 'filename:copilot-instructions.md', tool: 'copilot' }, ]; const REPO_SEARCHES = [ 'cursor-rules in:name,description', 'awesome-cursorrules', 'topic:cursor-rules', ]; CODE_BLOCK: // Search GitHub for SKILL.md files const { items } = await searchGitHub('filename:SKILL.md'); for (const item of items) { // Fetch the actual SKILL.md content const content = await fetchFileContent(owner, repo, item.path); // Parse frontmatter (name, description, tags) const metadata = parseFrontmatter(content); // Upsert to database await prisma.skill.upsert({ where: { slug }, create: { ...metadata, content, githubStars }, update: { githubStars }, // Keep stars fresh }); } CODE_BLOCK: // Search GitHub for SKILL.md files const { items } = await searchGitHub('filename:SKILL.md'); for (const item of items) { // Fetch the actual SKILL.md content const content = await fetchFileContent(owner, repo, item.path); // Parse frontmatter (name, description, tags) const metadata = parseFrontmatter(content); // Upsert to database await prisma.skill.upsert({ where: { slug }, create: { ...metadata, content, githubStars }, update: { githubStars }, // Keep stars fresh }); } CODE_BLOCK: // Search GitHub for SKILL.md files const { items } = await searchGitHub('filename:SKILL.md'); for (const item of items) { // Fetch the actual SKILL.md content const content = await fetchFileContent(owner, repo, item.path); // Parse frontmatter (name, description, tags) const metadata = parseFrontmatter(content); // Upsert to database await prisma.skill.upsert({ where: { slug }, create: { ...metadata, content, githubStars }, update: { githubStars }, // Keep stars fresh }); } CODE_BLOCK: const SEARCH_STRATEGIES = [ 'mcp server in:name,description', 'model context protocol server', 'topic:mcp', '@modelcontextprotocol/server', 'mcp server typescript', 'mcp server python', ]; CODE_BLOCK: const SEARCH_STRATEGIES = [ 'mcp server in:name,description', 'model context protocol server', 'topic:mcp', '@modelcontextprotocol/server', 'mcp server typescript', 'mcp server python', ]; CODE_BLOCK: const SEARCH_STRATEGIES = [ 'mcp server in:name,description', 'model context protocol server', 'topic:mcp', '@modelcontextprotocol/server', 'mcp server typescript', 'mcp server python', ]; CODE_BLOCK: { "crons": [ { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * " }, { "path": "/api/cron/crawl-skills", "schedule": "0 4 * " }, { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * " }, { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * " } ] } CODE_BLOCK: { "crons": [ { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * " }, { "path": "/api/cron/crawl-skills", "schedule": "0 4 * " }, { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * " }, { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * " } ] } CODE_BLOCK: { "crons": [ { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * " }, { "path": "/api/cron/crawl-skills", "schedule": "0 4 * " }, { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * " }, { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * " } ] } CODE_BLOCK: if (res.status === 403) { const resetTime = res.headers.get('X-RateLimit-Reset'); console.log(`Rate limited. Resets at ${new Date(resetTime 1000)}`); await sleep(60000); // Wait and retry } CODE_BLOCK: if (res.status === 403) { const resetTime = res.headers.get('X-RateLimit-Reset'); console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`); await sleep(60000); // Wait and retry } CODE_BLOCK: if (res.status === 403) { const resetTime = res.headers.get('X-RateLimit-Reset'); console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`); await sleep(60000); // Wait and retry } - New MCP servers pop up daily - Developers publish cursor rules and skill definitions constantly - Official repositories get updates - Star counts change - Prompts Crawler - Discovers .cursorrules, CLAUDE.md, and copilot-instructions.md files - Skills Crawler - Finds repos with SKILL.md files - MCP Crawler - Finds Model Context Protocol servers - Fetch the content from GitHub - Generate a slug from owner-repo-filename - Infer category and tags from content - Auto-verify repos with 100+ stars - Upsert to database - Search GitHub repos sorted by stars - Filter for MCP-related content - Fetch package.json for npm package names - Infer categories from description/topics - Mark official repos (from modelcontextprotocol org) as verified - 3:00 AM - Sync GitHub star counts for existing resources - 4:00 AM - Discover new skills - 5:00 AM - Discover new MCP servers - 6:00 AM - Discover new prompts/rules - Small delays between requests - Process in batches (50 items per cron run) - Graceful retry on rate limit errors - 790+ MCP servers indexed - 1,300+ skills discovered - 300+ prompts/rules indexed - Daily updates keep star counts fresh - Better category inference using AI - README parsing for richer descriptions - Automatic quality scoring based on stars, activity, docs - User submissions to fill gaps - Rules & Prompts - Cursor, Claude Code, Copilot rules - MCP Servers - sorted by GitHub stars - Skills - searchable by name/tags

🏷️ Tags

toolsutilitiessecurity toolsbuildingautomatingcontentdiscoverycrawlgithubresources

Tools: Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

🏷️ Tags

More from Tools

Tools: How We Generate AI Network Digests for MegaETH at MiniBlocks.io

Tools: How My AI Agent's Memory Created an Optimism Feedback Loop

Tools: Your Boss Can Read Your Mind Now: The AI Surveillance Explosion in the American Workplace

Tools: Surveillance Capitalism Is the Business Model of AI — And You're the Product

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting