Tools
Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite
2025-12-14
0 views
admin
API at http://localhost:8000 ## Docs at http://localhost:8000/docs ## With custom options ## Environment variables ## Run a scrape ## Run with plugins ## Backup ## Optimize/vacuum ## View job history ## Install dependencies ## Your first scrape ## Launch the dashboard Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite December 2024 | By Sanjeev Kumar TL;DR
I built a complete Reddit scraper suite that requires zero API keys. It comes with a beautiful Streamlit dashboard, REST API for integration with tools like Grafana and Metabase, plugin system for post-processing, scheduled scraping, notifications, and much more. Best of all—it’s completely open source.
🔗 GitHub: reddit-universal-scraper The Problem
If you’ve ever tried to scrape Reddit data for analysis, research, or just personal projects, you know the pain: Monitor Mode - Live watching
python main.py python --mode monitor
Continuously checks for new posts every 5 minutes. Ideal for tracking breaking news or trending discussions. The Dashboard Experience
One of the standout features is the 7-tab Streamlit dashboard that makes data exploration a joy:
📊 Overview Tab
At a glance, see: - Total posts and comments - Cumulative score across all posts - Media post breakdown - Posts-over-time chart - Top 10 posts by score
📈 Analytics Tab
This is where it gets interesting: - Sentiment Analysis: Run VADER-based sentiment scoring on your entire dataset - Keyword Cloud: See the most frequently used terms - Best Posting Times: Data-driven insights on when posts get the most engagement
🔍 Search Tab
Full-text search across all scraped data with filters for: - Minimum score - Post type (text, image, video, gallery, link) - Author - Custom sorting
💬 Comments Analysis
• View top-scoring comments
• See who the most active commenters are
• Track comment patterns over time
⚙️ Scraper Controls
Start new scrapes right from the dashboard! Configure: - Target subreddit/user - Post limits - Mode (full/history) - Media and comment toggles
📋 Job History
Full observability into every scrape job: - Status tracking (running, completed, failed) - Duration metrics - Post/comment/media counts - Error logging
🔌 Integrations
Pre-configured instructions for connecting: - Metabase - Grafana - DreamFactory - DuckDB The Plugin Architecture
I designed a plugin system to allow extensible post-processing. The architecture is simple but powerful:
class Plugin:
"""Base class for all plugins."""
name = "base"
description = "Base plugin"
enabled = True def process_posts(self, posts): return posts def process_comments(self, comments): return comments
Built-in Plugins Sentiment Tagger Analyzes the emotional tone of every post and comment using VADER sentiment analysis:
class SentimentTagger(Plugin):
name = "sentiment_tagger"
description = "Adds sentiment scores and labels to posts" def process_posts(self, posts): for post in posts: text = f"{post.get('title', '')} {post.get('selftext', '')}" score, label = analyze_sentiment(text) post['sentiment_score'] = score post['sentiment_label'] = label return posts Deduplicator Removes duplicate posts that may appear across multiple scraping sessions. Keyword Extractor Pulls out the most significant terms from your scraped content for trend analysis.
Creating Your Own Plugin
Drop a new Python file in the plugins/ directory:
from plugins import Plugin class MyCustomPlugin(Plugin): name = "my_plugin" description = "Does something cool" enabled = True Enable plugins during scraping:
python main.py python --mode full --plugins REST API for External Integrations
The REST API opens up the scraper to a whole ecosystem of tools:
python main.py --api Key Endpoints
Endpoint Description
GET /posts List posts with filters (subreddit, limit, offset)
GET /comments List comments
GET /subreddits All scraped subreddits
GET /jobs Job history
GET /query?sql=... Raw SQL queries for power users
GET /grafana/query Grafana-compatible time-series data
Real-World Integration: Grafana Dashboard python main.py --schedule delhi --every 30 --mode full --limit 50
Get Notified
Configure Discord or Telegram alerts when scrapes complete: export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..."
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_CHAT_ID="987654321"
Now you get notified with scrape summaries directly in your preferred platform. Dry Run Mode: Test Before You Commit
One of my favorite features is dry run mode. It simulates the entire scrape without saving any data:
python main.py python --mode full --limit 50 --dry-run
Output:
🧪 DRY RUN MODE - No data will be saved
🧪 DRY RUN COMPLETE! 📊 Would scrape: 100 posts 💬 Would scrape: 245 comments
Perfect for: - Testing your scrape configuration - Estimating data volume before committing - Debugging without cluttering your dataset Docker Deployment
Quick Start docker build -t reddit-scraper . docker run -v ./data:/app/data reddit-scraper python --limit 100 docker run -v ./data:/app/data reddit-scraper python --plugins
Full Stack with Docker Compose
docker-compose up -d
This spins up: - Dashboard at http://localhost:8501 - REST API at http://localhost:8000
Deploy to Any VPS
ssh user@your-server-ip
git clone https://github.com/ksanjeev284/reddit-universal-scraper.git
cd reddit-universal-scraper
docker-compose up -d
Open the firewall:
sudo ufw allow 8000
sudo ufw allow 8501
You now have a production-ready Reddit scraping platform! Data Export Options
CSV (Default)
All scraped data is saved as CSV files: - data/r_/posts.csv - data/r_/comments.csv
Parquet (Analytics-Optimized)
Export to columnar format for analytics tools:
python main.py --export-parquet python
Query directly with DuckDB:
import duckdb
duckdb.query("SELECT * FROM 'data/parquet/*.parquet'").df()
Database Maintenance python main.py --backup python main.py --vacuum python main.py --job-history Data Schema
Posts Table
Column Description
id Reddit post ID
title Post title
author Username
score Net upvotes
num_comments Comment count
post_type text/image/video/gallery/link
selftext Post body (for text posts)
created_utc Timestamp
permalink Reddit URL
is_nsfw NSFW flag
flair Post flair
sentiment_score -1.0 to 1.0 (with plugins)
Comments Table
Column Description
comment_id Comment ID
post_permalink Parent post URL
author Username
body Comment text
score Upvotes
depth Nesting level
is_submitter Whether author is OP pip install -r requirements.txt python main.py python --mode full --limit 50 python main.py --dashboard
That’s it! You’re now scraping Reddit like a pro. Contributing
This is an open-source project and contributions are welcome! Whether it’s: - Bug fixes - New plugins - Documentation improvements - Feature suggestions
Open an issue or submit a PR on GitHub. If you found this useful, consider giving the project a ⭐ on GitHub! Connect
• GitHub: @ksanjeev284
• Project: reddit-universal-scraper Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK:
def process_posts(self, posts): # Your logic here return posts Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
def process_posts(self, posts): # Your logic here return posts COMMAND_BLOCK:
def process_posts(self, posts): # Your logic here return posts - Reddit’s API is heavily rate-limited (especially after the 2023 API changes)
- API keys require approval and are increasingly restricted
- Existing scrapers are often single-purpose - scrape posts OR comments, not both
- No easy way to visualize or analyze the data after scraping
- Running scrapes manually is tedious - you want automation
I decided to solve all of these problems at once.
________________________________________
The Solution: Universal Reddit Scraper Suite
After weeks of development, I created a full-featured scraper that:
Feature What It Does
📊 Full Scraping Posts, comments, images, videos, galleries—everything
🚫 No API Keys Uses Reddit’s public JSON endpoints and mirrors
📈 Web Dashboard Beautiful 7-tab Streamlit UI for analysis
🚀 REST API Connect Metabase, Grafana, DuckDB, and more
🔌 Plugin System Extensible post-processing (sentiment analysis, deduplication, keywords)
📅 Scheduled Scraping Cron-style automation
📧 Notifications Discord & Telegram alerts when scrapes complete
🐳 Docker Ready One command to deploy anywhere
________________________________________
Architecture Deep Dive
How It Works Without API Keys
The secret sauce is in the approach. Instead of using Reddit’s official (and restricted) API, I leverage:
- Reddit’s public JSON endpoints: Every Reddit page has a .json suffix that returns structured data
- Multiple mirror fallbacks: When one source is rate-limited, the scraper automatically rotates through alternatives like Redlib instances
- Smart rate limiting: Built-in delays and cool-down periods to stay under the radar
MIRRORS = [
"https://old.reddit.com",
"https://redlib.catsarch.com",
"https://redlib.vsls.cz",
"https://r.nf",
"https://libreddit.northboot.xyz",
"https://redlib.tux.pizza"
]
When one source fails, it automatically tries the next. No manual intervention needed.
The Core Scraping Engine
The scraper operates in three modes:
- Full Mode - The complete package
python main.py python --mode full --limit 100
This scrapes posts, downloads all media (images, videos, galleries), and fetches comments with their full thread hierarchy.
- History Mode - Fast metadata-only
python main.py python --mode history --limit 500
Perfect for quickly building a dataset of post metadata without the overhead of media downloads.
- Monitor Mode - Live watching
python main.py python --mode monitor
Continuously checks for new posts every 5 minutes. Ideal for tracking breaking news or trending discussions. The Dashboard Experience
One of the standout features is the 7-tab Streamlit dashboard that makes data exploration a joy:
📊 Overview Tab
At a glance, see: - Total posts and comments - Cumulative score across all posts - Media post breakdown - Posts-over-time chart - Top 10 posts by score
📈 Analytics Tab
This is where it gets interesting: - Sentiment Analysis: Run VADER-based sentiment scoring on your entire dataset - Keyword Cloud: See the most frequently used terms - Best Posting Times: Data-driven insights on when posts get the most engagement
🔍 Search Tab
Full-text search across all scraped data with filters for: - Minimum score - Post type (text, image, video, gallery, link) - Author - Custom sorting
💬 Comments Analysis
• View top-scoring comments
• See who the most active commenters are
• Track comment patterns over time
⚙️ Scraper Controls
Start new scrapes right from the dashboard! Configure: - Target subreddit/user - Post limits - Mode (full/history) - Media and comment toggles
📋 Job History
Full observability into every scrape job: - Status tracking (running, completed, failed) - Duration metrics - Post/comment/media counts - Error logging
🔌 Integrations
Pre-configured instructions for connecting: - Metabase - Grafana - DreamFactory - DuckDB The Plugin Architecture
I designed a plugin system to allow extensible post-processing. The architecture is simple but powerful:
class Plugin:
"""Base class for all plugins."""
name = "base"
description = "Base plugin"
enabled = True def process_posts(self, posts): return posts def process_comments(self, comments): return comments
Built-in Plugins
- Sentiment Tagger Analyzes the emotional tone of every post and comment using VADER sentiment analysis:
class SentimentTagger(Plugin):
name = "sentiment_tagger"
description = "Adds sentiment scores and labels to posts" def process_posts(self, posts): for post in posts: text = f"{post.get('title', '')} {post.get('selftext', '')}" score, label = analyze_sentiment(text) post['sentiment_score'] = score post['sentiment_label'] = label return posts
- Deduplicator Removes duplicate posts that may appear across multiple scraping sessions.
- Keyword Extractor Pulls out the most significant terms from your scraped content for trend analysis.
Creating Your Own Plugin
Drop a new Python file in the plugins/ directory:
from plugins import Plugin - Install the “JSON API” or “Infinity” plugin in Grafana
- Add datasource pointing to http://localhost:8000
- Use the /grafana/query endpoint for time-series panels
SELECT date(created_utc) as time, COUNT(*) as posts FROM posts GROUP BY date(created_utc)
Now you have a real-time dashboard tracking Reddit activity!
________________________________________
Scheduled Scraping & Notifications
Automation Made Easy
Set up recurring scrapes with cron-style scheduling:
# Scrape every 60 minutes
python main.py --schedule delhi --every 60 - Academic Research
• Analyze subreddit community dynamics
• Track sentiment over time during events
• Study user engagement patterns
- Market Research
• Monitor brand mentions
• Track product feedback
• Identify emerging trends
- Content Creation
• Find popular topics in your niche
• Analyze what makes posts go viral
• Discover optimal posting times
- Data Journalism
• Archive discussions around breaking news
• Analyze public sentiment during events
• Track narrative evolution
- Personal Projects
• Build a dataset for ML training
• Create Reddit-based recommendation systems
• Archive communities you care about
________________________________________
Performance Considerations
Respect Reddit’s Servers
The scraper includes built-in delays: - 3 second cooldown between API requests - 30 second wait if all mirrors fail - Automatic mirror rotation to distribute load
Optimize Your Scrapes
• Use --mode history for faster metadata-only scrapes
• Use --no-media if you don’t need images/videos
• Use --no-comments for post-only data
Handle Large Datasets
• Parquet export for analytics queries
• SQLite database for structured storage
• Automatic deduplication to avoid bloat
________________________________________
What’s Next? Roadmap
I’m actively developing new features:
• ☐ Async scraping for even faster data collection
• ☐ Multi-subreddit monitoring in a single command
• ☐ Email notifications in addition to Discord/Telegram
• ☐ Cloud deployment templates (AWS, GCP, Azure)
• ☐ Web-based scraper configuration (no CLI needed)
________________________________________
Getting Started
Prerequisites
• Python 3.10+
• pip
Installation
# Clone the repo
git clone https://github.com/ksanjeev284/reddit-universal-scraper.git
cd reddit-universal-scraper
how-totutorialguidedev.toaimlservercronfirewalldockerpythondatabasegitgithub