Tools: How I Run 77 Web Scrapers on a Schedule Without Breaking the Bank - Analysis

Tools: How I Run 77 Web Scrapers on a Schedule Without Breaking the Bank - Analysis

The Problem Nobody Talks About

Architecture: 3 Layers

Layer 1: Keep Scrapers Stupid Simple

Layer 2: GitHub Actions as Free Orchestration

Layer 3: Monitoring That Actually Works

What I Learned After 77 Scrapers

1. APIs > HTML Scraping (always)

2. Retry Strategy: Exponential Backoff

3. The $15/Month Budget Breakdown

Want More? In 2024, I was running 12 scrapers on my laptop. A cron job that silently died at 3 AM. Data gaps I only noticed when a client asked why their dashboard was empty. By 2026, I manage 77 web scrapers. They run on schedule, retry on failure, alert me when something breaks, and cost me less than $15/month total. Here is the exact setup. Building a scraper is the easy part. Running it reliably is the hard part. Most tutorials end at python scraper.py. They never cover: Each scraper does ONE thing: Why this works: No classes. No abstractions. No framework. When HN changes something, I fix 1 line in 1 file. For scrapers that run daily or hourly, GitHub Actions is unbeatable: Cost: $0. GitHub gives 2,000 free CI/CD minutes per month. At 4 runs/day x 2 min/run, that is 240 min/month per scraper. I run 8 scrapers completely free. Forget Grafana dashboards. For scrapers, you need exactly 2 alerts: The key insight: Monitor data QUALITY, not just success/failure. A scraper can "succeed" and return 0 results because the site changed its structure. 50 of my 77 scrapers use public APIs, not HTML parsing. APIs are: I write about web scraping, APIs, and developer tools every week. Need a custom scraper? I build production-grade data extraction tools. 📧 [email protected] Check out awesome-web-scraping-2026 — 130+ scraping tools, ranked and categorized. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

Layer 1: Scrapers (Python scripts, each <200 lines) Layer 2: Orchestration (GitHub Actions / cron on VPS) Layer 3: Monitoring (dead simple: webhook → Telegram) Layer 1: Scrapers (Python scripts, each <200 lines) Layer 2: Orchestration (GitHub Actions / cron on VPS) Layer 3: Monitoring (dead simple: webhook → Telegram) Layer 1: Scrapers (Python scripts, each <200 lines) Layer 2: Orchestration (GitHub Actions / cron on VPS) Layer 3: Monitoring (dead simple: webhook → Telegram) # scraper_hackernews.py — 40 lines total import httpx import json from datetime import datetime def scrape(): resp = httpx.get("https://hacker-news.firebaseio.com/v0/topstories.json") story_ids = resp.json()[:30] stories = [] for sid in story_ids: story = httpx.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json").json() stories.append({ "title": story.get("title"), "url": story.get("url"), "score": story.get("score"), "time": datetime.fromtimestamp(story.get("time", 0)).isoformat() }) with open("output/hn_top30.json", "w") as f: json.dump(stories, f, indent=2) return len(stories) if __name__ == "__main__": count = scrape() print(f"Scraped {count} stories") # scraper_hackernews.py — 40 lines total import httpx import json from datetime import datetime def scrape(): resp = httpx.get("https://hacker-news.firebaseio.com/v0/topstories.json") story_ids = resp.json()[:30] stories = [] for sid in story_ids: story = httpx.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json").json() stories.append({ "title": story.get("title"), "url": story.get("url"), "score": story.get("score"), "time": datetime.fromtimestamp(story.get("time", 0)).isoformat() }) with open("output/hn_top30.json", "w") as f: json.dump(stories, f, indent=2) return len(stories) if __name__ == "__main__": count = scrape() print(f"Scraped {count} stories") # scraper_hackernews.py — 40 lines total import httpx import json from datetime import datetime def scrape(): resp = httpx.get("https://hacker-news.firebaseio.com/v0/topstories.json") story_ids = resp.json()[:30] stories = [] for sid in story_ids: story = httpx.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json").json() stories.append({ "title": story.get("title"), "url": story.get("url"), "score": story.get("score"), "time": datetime.fromtimestamp(story.get("time", 0)).isoformat() }) with open("output/hn_top30.json", "w") as f: json.dump(stories, f, indent=2) return len(stories) if __name__ == "__main__": count = scrape() print(f"Scraped {count} stories") # .github/workflows/scrape-hn.yml name: Scrape Hacker News on: schedule: - cron: "0 */6 * * *" # Every 6 hours workflow_dispatch: jobs: scrape: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.12" - run: pip install httpx - run: python scraper_hackernews.py - name: Commit results run: | git config user.name "Scraper Bot" git config user.email "[email protected]" git add output/ git diff --cached --quiet || git commit -m "data: HN $(date -u +%Y-%m-%d)" git push # .github/workflows/scrape-hn.yml name: Scrape Hacker News on: schedule: - cron: "0 */6 * * *" # Every 6 hours workflow_dispatch: jobs: scrape: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.12" - run: pip install httpx - run: python scraper_hackernews.py - name: Commit results run: | git config user.name "Scraper Bot" git config user.email "[email protected]" git add output/ git diff --cached --quiet || git commit -m "data: HN $(date -u +%Y-%m-%d)" git push # .github/workflows/scrape-hn.yml name: Scrape Hacker News on: schedule: - cron: "0 */6 * * *" # Every 6 hours workflow_dispatch: jobs: scrape: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.12" - run: pip install httpx - run: python scraper_hackernews.py - name: Commit results run: | git config user.name "Scraper Bot" git config user.email "[email protected]" git add output/ git diff --cached --quiet || git commit -m "data: HN $(date -u +%Y-%m-%d)" git push # monitor.py import httpx, os, json TELEGRAM_BOT = os.environ.get("TG_BOT_TOKEN") CHAT_ID = os.environ.get("TG_CHAT_ID") def alert(message: str): if TELEGRAM_BOT and CHAT_ID: httpx.post( f"https://api.telegram.org/bot{TELEGRAM_BOT}/sendMessage", json={"chat_id": CHAT_ID, "text": f"🚨 {message}"} ) def check_output(filepath: str, min_items: int = 1): try: with open(filepath) as f: data = json.load(f) if len(data) < min_items: alert(f"Low data: {filepath} has {len(data)} items (expected {min_items}+)") except Exception as e: alert(f"Failed: {filepath} — {e}") # monitor.py import httpx, os, json TELEGRAM_BOT = os.environ.get("TG_BOT_TOKEN") CHAT_ID = os.environ.get("TG_CHAT_ID") def alert(message: str): if TELEGRAM_BOT and CHAT_ID: httpx.post( f"https://api.telegram.org/bot{TELEGRAM_BOT}/sendMessage", json={"chat_id": CHAT_ID, "text": f"🚨 {message}"} ) def check_output(filepath: str, min_items: int = 1): try: with open(filepath) as f: data = json.load(f) if len(data) < min_items: alert(f"Low data: {filepath} has {len(data)} items (expected {min_items}+)") except Exception as e: alert(f"Failed: {filepath} — {e}") # monitor.py import httpx, os, json TELEGRAM_BOT = os.environ.get("TG_BOT_TOKEN") CHAT_ID = os.environ.get("TG_CHAT_ID") def alert(message: str): if TELEGRAM_BOT and CHAT_ID: httpx.post( f"https://api.telegram.org/bot{TELEGRAM_BOT}/sendMessage", json={"chat_id": CHAT_ID, "text": f"🚨 {message}"} ) def check_output(filepath: str, min_items: int = 1): try: with open(filepath) as f: data = json.load(f) if len(data) < min_items: alert(f"Low data: {filepath} has {len(data)} items (expected {min_items}+)") except Exception as e: alert(f"Failed: {filepath} — {e}") import time def scrape_with_retry(func, max_retries=3): for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: alert(f"Final failure after {max_retries} attempts: {e}") raise time.sleep(2 ** attempt) # 1s, 2s, 4s import time def scrape_with_retry(func, max_retries=3): for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: alert(f"Final failure after {max_retries} attempts: {e}") raise time.sleep(2 ** attempt) # 1s, 2s, 4s import time def scrape_with_retry(func, max_retries=3): for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: alert(f"Final failure after {max_retries} attempts: {e}") raise time.sleep(2 ** attempt) # 1s, 2s, 4s - What happens when the target site changes its HTML? - How do you retry failed runs without duplicate data? - How do you monitor 77 scrapers without going insane? - Fetch data from ONE source - Parse it into JSON - Save to ONE output file - Run failed → Telegram notification - Data looks wrong → Telegram notification - 10x more stable (no CSS selector breakage) - 5x faster (JSON vs parsing DOM) - Free (most APIs have generous free tiers)