Tools: Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

Tools: Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

Source: Dev.to

The Manual Content Problem ## The Solution: GitHub Crawlers ## How the Prompts Crawler Works ## How the Skills Crawler Works ## How the MCP Crawler Works ## The Cron Schedule ## Rate Limiting Matters ## What I Learned ## Current Stats ## The Honest Struggle ## What's Next ## Try It TL;DR: I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here's how. When I launched indx.sh, I had a content problem. The AI coding ecosystem moves fast: Manually tracking all this? Impossible. I built three automated crawlers that run daily: All run as Vercel cron jobs, so the directory stays fresh without manual work. The newest crawler searches for AI coding rules across multiple tools: First run indexed 175 prompts across Cursor, Claude Code, and Copilot. The key insight: GitHub's code search API lets you search by filename. filename:SKILL.md returns every repo with that file. MCP servers are trickier - there's no single file convention. I use multiple search strategies: GitHub's API has limits. Without a token: 10 requests/minute. With a token: 5,000 requests/hour. I handle this carefully: 1. Incremental is better than bulk Early versions tried to crawl everything at once. Timeouts, rate limits, chaos. Now I process 50 items per run and let it accumulate. 2. Deduplication by slug Same repo can appear in multiple search strategies. I generate consistent slugs (owner-repo-path) and upsert instead of insert. 3. Don't trust descriptions Many repos have empty or useless descriptions. I fall back to: "AI rules from {owner}/{repo}". Not pretty, but works. 4. Official = trusted Repos from modelcontextprotocol, anthropics, or anthropic-ai orgs get auto-verified badges. Community repos need manual verification. After the crawlers have been running: GitHub search isn't perfect. I get false positives - repos that mention "mcp" but aren't MCP servers. Manual review still matters for quality. Also: the 50-item limit per cron run means it takes days to fully index everything. Vercel's 10-second timeout for hobby plans is real. Browse the auto-discovered resources at indx.sh: Got a resource that's not indexed? Submit it or wait for the crawlers to find it. This is part 2 of the "Building indx.sh" series. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: const FILE_SEARCHES = [ { query: 'filename:.cursorrules', tool: 'cursor' }, { query: 'filename:CLAUDE.md', tool: 'claude-code' }, { query: 'filename:copilot-instructions.md', tool: 'copilot' }, ]; const REPO_SEARCHES = [ 'cursor-rules in:name,description', 'awesome-cursorrules', 'topic:cursor-rules', ]; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: const FILE_SEARCHES = [ { query: 'filename:.cursorrules', tool: 'cursor' }, { query: 'filename:CLAUDE.md', tool: 'claude-code' }, { query: 'filename:copilot-instructions.md', tool: 'copilot' }, ]; const REPO_SEARCHES = [ 'cursor-rules in:name,description', 'awesome-cursorrules', 'topic:cursor-rules', ]; CODE_BLOCK: const FILE_SEARCHES = [ { query: 'filename:.cursorrules', tool: 'cursor' }, { query: 'filename:CLAUDE.md', tool: 'claude-code' }, { query: 'filename:copilot-instructions.md', tool: 'copilot' }, ]; const REPO_SEARCHES = [ 'cursor-rules in:name,description', 'awesome-cursorrules', 'topic:cursor-rules', ]; CODE_BLOCK: // Search GitHub for SKILL.md files const { items } = await searchGitHub('filename:SKILL.md'); for (const item of items) { // Fetch the actual SKILL.md content const content = await fetchFileContent(owner, repo, item.path); // Parse frontmatter (name, description, tags) const metadata = parseFrontmatter(content); // Upsert to database await prisma.skill.upsert({ where: { slug }, create: { ...metadata, content, githubStars }, update: { githubStars }, // Keep stars fresh }); } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: // Search GitHub for SKILL.md files const { items } = await searchGitHub('filename:SKILL.md'); for (const item of items) { // Fetch the actual SKILL.md content const content = await fetchFileContent(owner, repo, item.path); // Parse frontmatter (name, description, tags) const metadata = parseFrontmatter(content); // Upsert to database await prisma.skill.upsert({ where: { slug }, create: { ...metadata, content, githubStars }, update: { githubStars }, // Keep stars fresh }); } CODE_BLOCK: // Search GitHub for SKILL.md files const { items } = await searchGitHub('filename:SKILL.md'); for (const item of items) { // Fetch the actual SKILL.md content const content = await fetchFileContent(owner, repo, item.path); // Parse frontmatter (name, description, tags) const metadata = parseFrontmatter(content); // Upsert to database await prisma.skill.upsert({ where: { slug }, create: { ...metadata, content, githubStars }, update: { githubStars }, // Keep stars fresh }); } CODE_BLOCK: const SEARCH_STRATEGIES = [ 'mcp server in:name,description', 'model context protocol server', 'topic:mcp', '@modelcontextprotocol/server', 'mcp server typescript', 'mcp server python', ]; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: const SEARCH_STRATEGIES = [ 'mcp server in:name,description', 'model context protocol server', 'topic:mcp', '@modelcontextprotocol/server', 'mcp server typescript', 'mcp server python', ]; CODE_BLOCK: const SEARCH_STRATEGIES = [ 'mcp server in:name,description', 'model context protocol server', 'topic:mcp', '@modelcontextprotocol/server', 'mcp server typescript', 'mcp server python', ]; CODE_BLOCK: { "crons": [ { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" }, { "path": "/api/cron/crawl-skills", "schedule": "0 4 * * *" }, { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * * *" }, { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * * *" } ] } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "crons": [ { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" }, { "path": "/api/cron/crawl-skills", "schedule": "0 4 * * *" }, { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * * *" }, { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * * *" } ] } CODE_BLOCK: { "crons": [ { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" }, { "path": "/api/cron/crawl-skills", "schedule": "0 4 * * *" }, { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * * *" }, { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * * *" } ] } CODE_BLOCK: if (res.status === 403) { const resetTime = res.headers.get('X-RateLimit-Reset'); console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`); await sleep(60000); // Wait and retry } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: if (res.status === 403) { const resetTime = res.headers.get('X-RateLimit-Reset'); console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`); await sleep(60000); // Wait and retry } CODE_BLOCK: if (res.status === 403) { const resetTime = res.headers.get('X-RateLimit-Reset'); console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`); await sleep(60000); // Wait and retry } - New MCP servers pop up daily - Developers publish cursor rules and skill definitions constantly - Official repositories get updates - Star counts change - Prompts Crawler - Discovers .cursorrules, CLAUDE.md, and copilot-instructions.md files - Skills Crawler - Finds repos with SKILL.md files - MCP Crawler - Finds Model Context Protocol servers - Fetch the content from GitHub - Generate a slug from owner-repo-filename - Infer category and tags from content - Auto-verify repos with 100+ stars - Upsert to database - Search GitHub repos sorted by stars - Filter for MCP-related content - Fetch package.json for npm package names - Infer categories from description/topics - Mark official repos (from modelcontextprotocol org) as verified - 3:00 AM - Sync GitHub star counts for existing resources - 4:00 AM - Discover new skills - 5:00 AM - Discover new MCP servers - 6:00 AM - Discover new prompts/rules - Small delays between requests - Process in batches (50 items per cron run) - Graceful retry on rate limit errors - 790+ MCP servers indexed - 1,300+ skills discovered - 300+ prompts/rules indexed - Daily updates keep star counts fresh - Better category inference using AI - README parsing for richer descriptions - Automatic quality scoring based on stars, activity, docs - User submissions to fill gaps - Rules & Prompts - Cursor, Claude Code, Copilot rules - MCP Servers - sorted by GitHub stars - Skills - searchable by name/tags