Tools

Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite

2025-12-14 0 views admin

Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite

Source: Dev.to

API at http://localhost:8000 ## Docs at http://localhost:8000/docs ## With custom options ## Environment variables ## Run a scrape ## Run with plugins ## Backup ## Optimize/vacuum ## View job history ## Install dependencies ## Your first scrape ## Launch the dashboard Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite December 2024 | By Sanjeev Kumar TL;DR I built a complete Reddit scraper suite that requires zero API keys. It comes with a beautiful Streamlit dashboard, REST API for integration with tools like Grafana and Metabase, plugin system for post-processing, scheduled scraping, notifications, and much more. Best of all—it’s completely open source. 🔗 GitHub: reddit-universal-scraper The Problem If you’ve ever tried to scrape Reddit data for analysis, research, or just personal projects, you know the pain: Monitor Mode - Live watching python main.py python --mode monitor Continuously checks for new posts every 5 minutes. Ideal for tracking breaking news or trending discussions. The Dashboard Experience One of the standout features is the 7-tab Streamlit dashboard that makes data exploration a joy: 📊 Overview Tab At a glance, see: - Total posts and comments - Cumulative score across all posts - Media post breakdown - Posts-over-time chart - Top 10 posts by score 📈 Analytics Tab This is where it gets interesting: - Sentiment Analysis: Run VADER-based sentiment scoring on your entire dataset - Keyword Cloud: See the most frequently used terms - Best Posting Times: Data-driven insights on when posts get the most engagement 🔍 Search Tab Full-text search across all scraped data with filters for: - Minimum score - Post type (text, image, video, gallery, link) - Author - Custom sorting 💬 Comments Analysis • View top-scoring comments • See who the most active commenters are • Track comment patterns over time ⚙️ Scraper Controls Start new scrapes right from the dashboard! Configure: - Target subreddit/user - Post limits - Mode (full/history) - Media and comment toggles 📋 Job History Full observability into every scrape job: - Status tracking (running, completed, failed) - Duration metrics - Post/comment/media counts - Error logging 🔌 Integrations Pre-configured instructions for connecting: - Metabase - Grafana - DreamFactory - DuckDB The Plugin Architecture I designed a plugin system to allow extensible post-processing. The architecture is simple but powerful: class Plugin: """Base class for all plugins.""" name = "base" description = "Base plugin" enabled = True def process_posts(self, posts): return posts def process_comments(self, comments): return comments Built-in Plugins Sentiment Tagger Analyzes the emotional tone of every post and comment using VADER sentiment analysis: class SentimentTagger(Plugin): name = "sentiment_tagger" description = "Adds sentiment scores and labels to posts" def process_posts(self, posts): for post in posts: text = f"{post.get('title', '')} {post.get('selftext', '')}" score, label = analyze_sentiment(text) post['sentiment_score'] = score post['sentiment_label'] = label return posts Deduplicator Removes duplicate posts that may appear across multiple scraping sessions. Keyword Extractor Pulls out the most significant terms from your scraped content for trend analysis. Creating Your Own Plugin Drop a new Python file in the plugins/ directory: from plugins import Plugin class MyCustomPlugin(Plugin): name = "my_plugin" description = "Does something cool" enabled = True Enable plugins during scraping: python main.py python --mode full --plugins REST API for External Integrations The REST API opens up the scraper to a whole ecosystem of tools: python main.py --api Key Endpoints Endpoint Description GET /posts List posts with filters (subreddit, limit, offset) GET /comments List comments GET /subreddits All scraped subreddits GET /jobs Job history GET /query?sql=... Raw SQL queries for power users GET /grafana/query Grafana-compatible time-series data Real-World Integration: Grafana Dashboard python main.py --schedule delhi --every 30 --mode full --limit 50 Get Notified Configure Discord or Telegram alerts when scrapes complete: export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..." export TELEGRAM_BOT_TOKEN="123456:ABC..." export TELEGRAM_CHAT_ID="987654321" Now you get notified with scrape summaries directly in your preferred platform. Dry Run Mode: Test Before You Commit One of my favorite features is dry run mode. It simulates the entire scrape without saving any data: python main.py python --mode full --limit 50 --dry-run Output: 🧪 DRY RUN MODE - No data will be saved 🧪 DRY RUN COMPLETE! 📊 Would scrape: 100 posts 💬 Would scrape: 245 comments Perfect for: - Testing your scrape configuration - Estimating data volume before committing - Debugging without cluttering your dataset Docker Deployment Quick Start docker build -t reddit-scraper . docker run -v ./data:/app/data reddit-scraper python --limit 100 docker run -v ./data:/app/data reddit-scraper python --plugins Full Stack with Docker Compose docker-compose up -d This spins up: - Dashboard at http://localhost:8501 - REST API at http://localhost:8000 Deploy to Any VPS ssh user@your-server-ip git clone https://github.com/ksanjeev284/reddit-universal-scraper.git cd reddit-universal-scraper docker-compose up -d Open the firewall: sudo ufw allow 8000 sudo ufw allow 8501 You now have a production-ready Reddit scraping platform! Data Export Options CSV (Default) All scraped data is saved as CSV files: - data/r_/posts.csv - data/r_/comments.csv Parquet (Analytics-Optimized) Export to columnar format for analytics tools: python main.py --export-parquet python Query directly with DuckDB: import duckdb duckdb.query("SELECT * FROM 'data/parquet/.parquet'").df() Database Maintenance python main.py --backup python main.py --vacuum python main.py --job-history Data Schema Posts Table Column Description id Reddit post ID title Post title author Username score Net upvotes num_comments Comment count post_type text/image/video/gallery/link selftext Post body (for text posts) created_utc Timestamp permalink Reddit URL is_nsfw NSFW flag flair Post flair sentiment_score -1.0 to 1.0 (with plugins) Comments Table Column Description comment_id Comment ID post_permalink Parent post URL author Username body Comment text score Upvotes depth Nesting level is_submitter Whether author is OP pip install -r requirements.txt python main.py python --mode full --limit 50 python main.py --dashboard That’s it! You’re now scraping Reddit like a pro. Contributing This is an open-source project and contributions are welcome! Whether it’s: - Bug fixes - New plugins - Documentation improvements - Feature suggestions Open an issue or submit a PR on GitHub. If you found this useful, consider giving the project a ⭐ on GitHub! Connect • GitHub: @ksanjeev284 • Project: reddit-universal-scraper Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: def process_posts(self, posts): # Your logic here return posts Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: def process_posts(self, posts): # Your logic here return posts COMMAND_BLOCK: def process_posts(self, posts): # Your logic here return posts - Reddit’s API is heavily rate-limited (especially after the 2023 API changes) - API keys require approval and are increasingly restricted - Existing scrapers are often single-purpose - scrape posts OR comments, not both - No easy way to visualize or analyze the data after scraping - Running scrapes manually is tedious - you want automation I decided to solve all of these problems at once. The Solution: Universal Reddit Scraper Suite After weeks of development, I created a full-featured scraper that: Feature What It Does 📊 Full Scraping Posts, comments, images, videos, galleries—everything 🚫 No API Keys Uses Reddit’s public JSON endpoints and mirrors 📈 Web Dashboard Beautiful 7-tab Streamlit UI for analysis 🚀 REST API Connect Metabase, Grafana, DuckDB, and more 🔌 Plugin System Extensible post-processing (sentiment analysis, deduplication, keywords) 📅 Scheduled Scraping Cron-style automation 📧 Notifications Discord & Telegram alerts when scrapes complete 🐳 Docker Ready One command to deploy anywhere Architecture Deep Dive How It Works Without API Keys The secret sauce is in the approach. Instead of using Reddit’s official (and restricted) API, I leverage: - Reddit’s public JSON endpoints: Every Reddit page has a .json suffix that returns structured data - Multiple mirror fallbacks: When one source is rate-limited, the scraper automatically rotates through alternatives like Redlib instances - Smart rate limiting: Built-in delays and cool-down periods to stay under the radar MIRRORS = [ "https://old.reddit.com", "https://redlib.catsarch.com", "https://redlib.vsls.cz", "https://r.nf", "https://libreddit.northboot.xyz", "https://redlib.tux.pizza" ] When one source fails, it automatically tries the next. No manual intervention needed. The Core Scraping Engine The scraper operates in three modes: - Full Mode - The complete package python main.py python --mode full --limit 100 This scrapes posts, downloads all media (images, videos, galleries), and fetches comments with their full thread hierarchy. - History Mode - Fast metadata-only python main.py python --mode history --limit 500 Perfect for quickly building a dataset of post metadata without the overhead of media downloads. - Monitor Mode - Live watching python main.py python --mode monitor Continuously checks for new posts every 5 minutes. Ideal for tracking breaking news or trending discussions. The Dashboard Experience One of the standout features is the 7-tab Streamlit dashboard that makes data exploration a joy: 📊 Overview Tab At a glance, see: - Total posts and comments - Cumulative score across all posts - Media post breakdown - Posts-over-time chart - Top 10 posts by score 📈 Analytics Tab This is where it gets interesting: - Sentiment Analysis: Run VADER-based sentiment scoring on your entire dataset - Keyword Cloud: See the most frequently used terms - Best Posting Times: Data-driven insights on when posts get the most engagement 🔍 Search Tab Full-text search across all scraped data with filters for: - Minimum score - Post type (text, image, video, gallery, link) - Author - Custom sorting 💬 Comments Analysis • View top-scoring comments • See who the most active commenters are • Track comment patterns over time ⚙️ Scraper Controls Start new scrapes right from the dashboard! Configure: - Target subreddit/user - Post limits - Mode (full/history) - Media and comment toggles 📋 Job History Full observability into every scrape job: - Status tracking (running, completed, failed) - Duration metrics - Post/comment/media counts - Error logging 🔌 Integrations Pre-configured instructions for connecting: - Metabase - Grafana - DreamFactory - DuckDB The Plugin Architecture I designed a plugin system to allow extensible post-processing. The architecture is simple but powerful: class Plugin: """Base class for all plugins.""" name = "base" description = "Base plugin" enabled = True def process_posts(self, posts): return posts def process_comments(self, comments): return comments Built-in Plugins - Sentiment Tagger Analyzes the emotional tone of every post and comment using VADER sentiment analysis: class SentimentTagger(Plugin): name = "sentiment_tagger" description = "Adds sentiment scores and labels to posts" def process_posts(self, posts): for post in posts: text = f"{post.get('title', '')} {post.get('selftext', '')}" score, label = analyze_sentiment(text) post['sentiment_score'] = score post['sentiment_label'] = label return posts - Deduplicator Removes duplicate posts that may appear across multiple scraping sessions. - Keyword Extractor Pulls out the most significant terms from your scraped content for trend analysis. Creating Your Own Plugin Drop a new Python file in the plugins/ directory: from plugins import Plugin - Install the “JSON API” or “Infinity” plugin in Grafana - Add datasource pointing to http://localhost:8000 - Use the /grafana/query endpoint for time-series panels SELECT date(created_utc) as time, COUNT() as posts FROM posts GROUP BY date(created_utc) Now you have a real-time dashboard tracking Reddit activity! Scheduled Scraping & Notifications Automation Made Easy Set up recurring scrapes with cron-style scheduling: # Scrape every 60 minutes python main.py --schedule delhi --every 60 - Academic Research • Analyze subreddit community dynamics • Track sentiment over time during events • Study user engagement patterns - Market Research • Monitor brand mentions • Track product feedback • Identify emerging trends - Content Creation • Find popular topics in your niche • Analyze what makes posts go viral • Discover optimal posting times - Data Journalism • Archive discussions around breaking news • Analyze public sentiment during events • Track narrative evolution - Personal Projects • Build a dataset for ML training • Create Reddit-based recommendation systems • Archive communities you care about Performance Considerations Respect Reddit’s Servers The scraper includes built-in delays: - 3 second cooldown between API requests - 30 second wait if all mirrors fail - Automatic mirror rotation to distribute load Optimize Your Scrapes • Use --mode history for faster metadata-only scrapes • Use --no-media if you don’t need images/videos • Use --no-comments for post-only data Handle Large Datasets • Parquet export for analytics queries • SQLite database for structured storage • Automatic deduplication to avoid bloat What’s Next? Roadmap I’m actively developing new features: • ☐ Async scraping for even faster data collection • ☐ Multi-subreddit monitoring in a single command • ☐ Email notifications in addition to Discord/Telegram • ☐ Cloud deployment templates (AWS, GCP, Azure) • ☐ Web-based scraper configuration (no CLI needed) Getting Started Prerequisites • Python 3.10+ • pip Installation # Clone the repo git clone https://github.com/ksanjeev284/reddit-universal-scraper.git cd reddit-universal-scraper

🏷️ Tags

how-totutorialguidedev.toaimlservercronfirewalldockerpythondatabasegitgithub