Tools

Tools: Stop Writing Selectors: How I Vibe Coded a Production AppSumo Scraper

2026-01-24 0 views admin

Tools: Stop Writing Selectors: How I Vibe Coded a Production AppSumo Scraper

Source: Dev.to

Why Standard LLMs Fail on AppSumo ## The Solution: The AI Scraper Builder ## Code Walkthrough: Analyzing the Generated Script ## 1. The Data Schema ## 2. Extraction Without Selectors ## 3. Handling Proxies and Anti-Bot Measures ## Handling Concurrency and Pipelines ## Running the Scraper ## The Result: Structured Data ## To Wrap Up We’ve all been there. You ask an LLM like ChatGPT or Claude to write a simple web scraper for a site like AppSumo. It confidently spits out a script using soup.select('.price-tag-123'). You run it, and nothing happens. The classes are dynamic, the data is buried in a Next.js hydration blob, or the site’s anti-bot protection kicks you out before the page even loads. This is the "Vibe Coding" bottleneck. You want to move from idea to execution using AI, but web scraping often forces you back into the weeds of manual DOM inspection and brittle CSS selectors. We can break that cycle. This guide covers how to build a production-ready AppSumo scraper using Python and Playwright without writing a single manual CSS selector. Instead, we’ll use "hidden" data structures and AI-generated architecture to create a script that lasts. If you try to build a scraper using a generic prompt, you’ll likely run into three major roadblocks: Dynamic/Tailwind Classes: AppSumo uses utility-first CSS (Tailwind) and dynamic class names. An LLM might guess a selector like .text-midnight, but if the developers change the padding or color scheme, the scraper breaks. Client-Side Rendering: As a modern Next.js application, much of AppSumo’s data isn't in the initial HTML. It’s loaded dynamically. If you use a simple requests and BeautifulSoup approach, you’ll often find yourself staring at an empty div. Hallucination: LLMs often imagine that websites have logical IDs like #product-price. AppSumo doesn't work that way. To build something reliable, stop looking at what the website looks like and start looking at how it stores its data. Instead of asking a general-purpose AI to guess selectors, I used the ScrapeOps AI Scraper Builder. This tool analyzes a target URL and generates a Playwright script that targets the most stable data sources on the page: JSON-LD and NEXT_DATA. By pasting an AppSumo product URL into the builder, we get a script that doesn't care if a button turns from blue to green. It targets the raw data blobs the website uses to render itself. Let’s look at the core script from the AppSumo Scrapers repository. We’ll focus on the Playwright implementation found in python/playwright/product_data/scraper/appsumo.com_scraper_product_v1.py First, we define the requirements. Using Python dataclasses ensures the script remains type-safe and structured. This is the most critical part of the script. Instead of searching for a price inside a <span>, the script evaluates a JavaScript block to find the JSON-LD (Structured Data) and NEXT_DATA (Next.js state) objects. AppSumo, like many modern sites, embeds a JSON object containing the product name, price, and reviews for SEO purposes. This JSON is highly structured and rarely changes, making it significantly more reliable than CSS selectors. AppSumo employs anti-bot measures that block standard headless browsers. The generated script handles this using playwright-stealth and the ScrapeOps Proxy integrated directly into the browser launch: To make this production-ready, the script includes a DataPipeline class that handles deduplication and saves data in JSONL format. JSONL is ideal for scraping because it allows you to stream data to a file line-by-line. If the script crashes on the 500th page, you preserve the first 499 results. To run this yourself, follow these steps. The repository includes implementations for Python, Node.js, Selenium, and BeautifulSoup. Install Dependencies: Get a free key from ScrapeOps and paste it into the API_KEY variable in the script. The result is a clean, structured JSONL file. There are no HTML tags or messy whitespace—just data ready for a database or spreadsheet: Vibe coding is a fast way to build, but it requires a specific strategy for the web. By moving away from brittle CSS selectors and toward structured data blobs like JSON-LD, you can build scrapers that are both faster to write and harder to break. Don't fight the DOM: Look for __NEXT_DATA__ or ld+json scripts first. Use specialized tools: The ScrapeOps AI Scraper Builder handles the heavy lifting of script generation. Think in Pipelines: Use JSONL and deduplication for production-grade data. For more examples, including Node.js versions and search page scrapers, check out the full AppSumo Scrapers GitHub repository. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: @dataclass class ScrapedData: name: str = "" brand: str = "" price: float = 0.0 preDiscountPrice: float = 0.0 currency: str = "USD" availability: str = "in_stock" aggregateRating: Dict[str, Any] = field(default_factory=dict) description: str = "" features: List[str] = field(default_factory=list) images: List[Dict[str, str]] = field(default_factory=list) url: str = "" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: @dataclass class ScrapedData: name: str = "" brand: str = "" price: float = 0.0 preDiscountPrice: float = 0.0 currency: str = "USD" availability: str = "in_stock" aggregateRating: Dict[str, Any] = field(default_factory=dict) description: str = "" features: List[str] = field(default_factory=list) images: List[Dict[str, str]] = field(default_factory=list) url: str = "" CODE_BLOCK: @dataclass class ScrapedData: name: str = "" brand: str = "" price: float = 0.0 preDiscountPrice: float = 0.0 currency: str = "USD" availability: str = "in_stock" aggregateRating: Dict[str, Any] = field(default_factory=dict) description: str = "" features: List[str] = field(default_factory=list) images: List[Dict[str, str]] = field(default_factory=list) url: str = "" COMMAND_BLOCK: async def extract_data(page: Page) -> Optional[ScrapedData]: # Extraction via JSON-LD json_ld_data = await page.evaluate("""() => { const scripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]')); for (const s of scripts) { try { const data = JSON.parse(s.innerText); const findProduct = (obj) => { if (Array.isArray(obj)) return obj.find(item => item['@type'] === 'Product'); if (obj['@type'] === 'Product') return obj; return null; }; const product = findProduct(data); if (product) return product; } catch (e) {} } return null; }""") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: async def extract_data(page: Page) -> Optional[ScrapedData]: # Extraction via JSON-LD json_ld_data = await page.evaluate("""() => { const scripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]')); for (const s of scripts) { try { const data = JSON.parse(s.innerText); const findProduct = (obj) => { if (Array.isArray(obj)) return obj.find(item => item['@type'] === 'Product'); if (obj['@type'] === 'Product') return obj; return null; }; const product = findProduct(data); if (product) return product; } catch (e) {} } return null; }""") COMMAND_BLOCK: async def extract_data(page: Page) -> Optional[ScrapedData]: # Extraction via JSON-LD json_ld_data = await page.evaluate("""() => { const scripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]')); for (const s of scripts) { try { const data = JSON.parse(s.innerText); const findProduct = (obj) => { if (Array.isArray(obj)) return obj.find(item => item['@type'] === 'Product'); if (obj['@type'] === 'Product') return obj; return null; }; const product = findProduct(data); if (product) return product; } catch (e) {} } return null; }""") COMMAND_BLOCK: # ScrapeOps Residential Proxy Configuration PROXY_CONFIG = { "server": "http://residential-proxy.scrapeops.io:8181", "username": "scrapeops", "password": API_KEY } async def run_scraper(): async with async_playwright() as p: browser = await p.chromium.launch( headless=True, proxy=PROXY_CONFIG ) context = await browser.new_context() page = await context.new_page() await stealth_async(page) # Apply stealth patterns Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # ScrapeOps Residential Proxy Configuration PROXY_CONFIG = { "server": "http://residential-proxy.scrapeops.io:8181", "username": "scrapeops", "password": API_KEY } async def run_scraper(): async with async_playwright() as p: browser = await p.chromium.launch( headless=True, proxy=PROXY_CONFIG ) context = await browser.new_context() page = await context.new_page() await stealth_async(page) # Apply stealth patterns COMMAND_BLOCK: # ScrapeOps Residential Proxy Configuration PROXY_CONFIG = { "server": "http://residential-proxy.scrapeops.io:8181", "username": "scrapeops", "password": API_KEY } async def run_scraper(): async with async_playwright() as p: browser = await p.chromium.launch( headless=True, proxy=PROXY_CONFIG ) context = await browser.new_context() page = await context.new_page() await stealth_async(page) # Apply stealth patterns CODE_BLOCK: class DataPipeline: def init(self, jsonl_filename="output.jsonl"): self.items_seen = set() self.jsonl_filename = jsonl_filename def is_duplicate(self, input_data): item_key = input_data.get("productId") if item_key in self.items_seen: return True self.items_seen.add(item_key) return False def add_data(self, scraped_data: ScrapedData): data_dict = asdict(scraped_data) if not self.is_duplicate(data_dict): with open(self.jsonl_filename, mode="a", encoding="UTF-8") as f: f.write(json.dumps(data_dict) + "\n") Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: class DataPipeline: def init(self, jsonl_filename="output.jsonl"): self.items_seen = set() self.jsonl_filename = jsonl_filename def is_duplicate(self, input_data): item_key = input_data.get("productId") if item_key in self.items_seen: return True self.items_seen.add(item_key) return False def add_data(self, scraped_data: ScrapedData): data_dict = asdict(scraped_data) if not self.is_duplicate(data_dict): with open(self.jsonl_filename, mode="a", encoding="UTF-8") as f: f.write(json.dumps(data_dict) + "\n") CODE_BLOCK: class DataPipeline: def init(self, jsonl_filename="output.jsonl"): self.items_seen = set() self.jsonl_filename = jsonl_filename def is_duplicate(self, input_data): item_key = input_data.get("productId") if item_key in self.items_seen: return True self.items_seen.add(item_key) return False def add_data(self, scraped_data: ScrapedData): data_dict = asdict(scraped_data) if not self.is_duplicate(data_dict): with open(self.jsonl_filename, mode="a", encoding="UTF-8") as f: f.write(json.dumps(data_dict) + "\n") COMMAND_BLOCK: git clone https://github.com/scraper-bank/AppSumo.com-Scrapers.git cd AppSumo.com-Scrapers/python/playwright COMMAND_BLOCK: git clone https://github.com/scraper-bank/AppSumo.com-Scrapers.git cd AppSumo.com-Scrapers/python/playwright COMMAND_BLOCK: pip install playwright playwright-stealth playwright install chromium COMMAND_BLOCK: pip install playwright playwright-stealth playwright install chromium CODE_BLOCK: python product_data/scraper/appsumo.com_scraper_product_v1.py CODE_BLOCK: python product_data/scraper/appsumo.com_scraper_product_v1.py CODE_BLOCK: { "name": "Triplo AI", "brand": "Triplo AI", "price": 59.0, "preDiscountPrice": 102.0, "currency": "USD", "availability": "in_stock", "aggregateRating": {"ratingValue": 4.9, "reviewCount": 128}, "category": "Productivity" } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "name": "Triplo AI", "brand": "Triplo AI", "price": 59.0, "preDiscountPrice": 102.0, "currency": "USD", "availability": "in_stock", "aggregateRating": {"ratingValue": 4.9, "reviewCount": 128}, "category": "Productivity" } CODE_BLOCK: { "name": "Triplo AI", "brand": "Triplo AI", "price": 59.0, "preDiscountPrice": 102.0, "currency": "USD", "availability": "in_stock", "aggregateRating": {"ratingValue": 4.9, "reviewCount": 128}, "category": "Productivity" } - Dynamic/Tailwind Classes: AppSumo uses utility-first CSS (Tailwind) and dynamic class names. An LLM might guess a selector like .text-midnight, but if the developers change the padding or color scheme, the scraper breaks. - Client-Side Rendering: As a modern Next.js application, much of AppSumo’s data isn't in the initial HTML. It’s loaded dynamically. If you use a simple requests and BeautifulSoup approach, you’ll often find yourself staring at an empty div. - Hallucination: LLMs often imagine that websites have logical IDs like #product-price. AppSumo doesn't work that way. - Clone the Repo: git clone https://github.com/scraper-bank/AppSumo.com-Scrapers.git cd AppSumo.com-Scrapers/python/playwright - Install Dependencies: pip install playwright playwright-stealth playwright install chromium - Add your API Key: Get a free key from ScrapeOps and paste it into the API_KEY variable in the script. - Execute: python product_data/scraper/appsumo.com_scraper_product_v1.py - Don't fight the DOM: Look for __NEXT_DATA__ or ld+json scripts first. - Use specialized tools: The ScrapeOps AI Scraper Builder handles the heavy lifting of script generation. - Think in Pipelines: Use JSONL and deduplication for production-grade data.

🏷️ Tags

how-totutorialguidedev.toaimlllmgptchatgptservernodepythonjavascriptdatabasegit