Tools

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

2025-12-16 0 views admin

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

Source: Dev.to

What We'll Build ## Prerequisites & Setup ## Step 1: Initialize Your Project ## Step 2: Configure Your Settings ## Step 3: Finding Our Selectors (with scrapy shell) ## Step 4: Building the Spider (Crawling & Parsing) ## Step 5: Running The Spider & Saving Data ## Conclusion & Next Steps ## What's Next? Join the Community. Scrapy can feel daunting. It's a massive, powerful framework, and the documentation can be overwhelming for a newcomer. Where do you even begin? In this definitive guide, we will walk you through, step-by-step, how to build a real, multi-page crawling spider. You will go from an empty folder to a clean JSON file of structured data in about 15 minutes. We'll use modern, async/await Python and cover project setup, finding selectors, following links (crawling), and saving your data. We will build a Scrapy spider that crawls the "Fantasy" category on books.toscrape.com, follows the "Next" button to crawl every page in that category, follows the link for every book, and scrapes the name, price, and URL from all 48 books, saving the result to a clean books.json file. Here's a preview of our final spider code: Before we start, you'll need Python 3.x installed. We'll also be using a virtual environment to keep our dependencies clean. You can use standard pip or a modern package manager like uv. First, let's create a project folder and activate a virtual environment. Now, let's install Scrapy. With Scrapy installed, we can use its built-in command-line tools to generate our project boilerplate. First, create the project itself. You'll see a tutorial folder and a scrapy.cfg file appear. This folder contains all your project's logic. Next, we'll generate our first spider. If you look in tutorial/spiders/, you'll now see books.py. This is where we'll write our code. Before we write our spider, let's quickly adjust two settings in tutorial/settings.py. By default, Scrapy respects robots.txt files. This is a good practice, but our test site (toscrape.com) doesn't have one, which can cause a 404 error in our logs. We'll turn it off for this tutorial. Scrapy is polite by default and runs slowly. Since toscrape.com is a test site built for scraping, we can speed it up. Warning: These settings are for this test site only. When scraping in the wild, you must be mindful of your target site and use respectful DOWNLOAD_DELAY and CONCURRENT_REQUESTS values. To scrape a site, we need to tell Scrapy what data to get. We do this with CSS selectors. The scrapy shell is the best tool for this. Let's launch the shell on our target category page: This will download the page and give you an interactive shell with a response object. You can even type view(response) to open the page in your browser exactly as Scrapy sees it! Let's find the data we need: By inspecting the page, we see each book is in an article.product_pod. The link is inside an h3. That getall() gives us a clean list of all the URLs. At the bottom, we find the "Next" button in an li.next. This get() gives us the single link we need for pagination. Finally, let's open a shell on a product page to find the selectors for our data. Perfect. We now have all the selectors we need. Now, let's open tutorial/spiders/books.py and write our spider. We'll use the user's provided code, as it's a clean, final version. Delete the boilerplate in books.py and replace it with this: This code is clean and efficient. response.follow is smart enough to handle the relative URLs (like page-2.html) for us. We're ready to run. Go to your terminal (at the project root) and run: You'll see Scrapy start up, and in the logs, you'll see all 48 items being scraped! But we want to save this data. Scrapy has a built-in "Feed Exporter" that makes this easy. We just use the -o (output) flag. This will run the spider again, but this time, you'll see a new books.json file in your project root, containing all 48 items, perfectly structured. Today you built a powerful, modern, async Scrapy crawler. You learned how to set up a project, find selectors, follow links, and handle pagination. This is just the starting block. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: # The final spider we'll build import scrapy class BooksSpider(scrapy.Spider): name = "books" allowed_domains = ["toscrape.com"] url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html](https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" async def start(self): yield scrapy.Request(self.url, callback=self.parse_listpage) async def parse_listpage(self, response): product_urls = response.css("article.product_pod h3 a::attr(href)").getall() for url in product_urls: yield response.follow(url, callback=self.parse_book) next_page_url = response.css("li.next a::attr(href)").get() if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) async def parse_book(self, response): yield { "name": response.css("h1::text").get(), "price": response.css("p.price_color::text").get(), "url": response.url } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # The final spider we'll build import scrapy class BooksSpider(scrapy.Spider): name = "books" allowed_domains = ["toscrape.com"] url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html](https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" async def start(self): yield scrapy.Request(self.url, callback=self.parse_listpage) async def parse_listpage(self, response): product_urls = response.css("article.product_pod h3 a::attr(href)").getall() for url in product_urls: yield response.follow(url, callback=self.parse_book) next_page_url = response.css("li.next a::attr(href)").get() if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) async def parse_book(self, response): yield { "name": response.css("h1::text").get(), "price": response.css("p.price_color::text").get(), "url": response.url } COMMAND_BLOCK: # The final spider we'll build import scrapy class BooksSpider(scrapy.Spider): name = "books" allowed_domains = ["toscrape.com"] url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html](https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" async def start(self): yield scrapy.Request(self.url, callback=self.parse_listpage) async def parse_listpage(self, response): product_urls = response.css("article.product_pod h3 a::attr(href)").getall() for url in product_urls: yield response.follow(url, callback=self.parse_book) next_page_url = response.css("li.next a::attr(href)").get() if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) async def parse_book(self, response): yield { "name": response.css("h1::text").get(), "price": response.css("p.price_color::text").get(), "url": response.url } COMMAND_BLOCK: # Create a new folder mkdir scrapy_project cd scrapy_project # Option 1: Using standard pip + venv python -m venv .venv source .venv/bin/activate # On Windows, use: .venv\Scripts\activate # Option 2: Using uv (a fast, modern alternative) uv init Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Create a new folder mkdir scrapy_project cd scrapy_project # Option 1: Using standard pip + venv python -m venv .venv source .venv/bin/activate # On Windows, use: .venv\Scripts\activate # Option 2: Using uv (a fast, modern alternative) uv init COMMAND_BLOCK: # Create a new folder mkdir scrapy_project cd scrapy_project # Option 1: Using standard pip + venv python -m venv .venv source .venv/bin/activate # On Windows, use: .venv\Scripts\activate # Option 2: Using uv (a fast, modern alternative) uv init COMMAND_BLOCK: # Option 1: Using pip pip install scrapy # Option 2: Using uv uv add scrapy source .venv/bin/activate Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Option 1: Using pip pip install scrapy # Option 2: Using uv uv add scrapy source .venv/bin/activate COMMAND_BLOCK: # Option 1: Using pip pip install scrapy # Option 2: Using uv uv add scrapy source .venv/bin/activate COMMAND_BLOCK: # The 'scrapy startproject' command creates the project structure # The '.' tells it to use the current folder scrapy startproject tutorial . Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # The 'scrapy startproject' command creates the project structure # The '.' tells it to use the current folder scrapy startproject tutorial . COMMAND_BLOCK: # The 'scrapy startproject' command creates the project structure # The '.' tells it to use the current folder scrapy startproject tutorial . COMMAND_BLOCK: # The 'genspider' command creates a new spider file # Usage: scrapy genspider <spider_name> <allowed_domain> scrapy genspider books toscrape.com Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # The 'genspider' command creates a new spider file # Usage: scrapy genspider <spider_name> <allowed_domain> scrapy genspider books toscrape.com COMMAND_BLOCK: # The 'genspider' command creates a new spider file # Usage: scrapy genspider <spider_name> <allowed_domain> scrapy genspider books toscrape.com COMMAND_BLOCK: # tutorial/settings.py # Find this line and change it to False ROBOTSTXT_OBEY = False Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # tutorial/settings.py # Find this line and change it to False ROBOTSTXT_OBEY = False COMMAND_BLOCK: # tutorial/settings.py # Find this line and change it to False ROBOTSTXT_OBEY = False COMMAND_BLOCK: # tutorial/settings.py # Uncomment or add these lines CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 0 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # tutorial/settings.py # Uncomment or add these lines CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 0 COMMAND_BLOCK: # tutorial/settings.py # Uncomment or add these lines CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 0 CODE_BLOCK: scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html CODE_BLOCK: scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html COMMAND_BLOCK: # In scrapy shell: >>> response.css("article.product_pod h3 a::attr(href)").getall() [ '../../../../the-host_979/index.html', '../../../../the-hunted_978/index.html', ... ] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # In scrapy shell: >>> response.css("article.product_pod h3 a::attr(href)").getall() [ '../../../../the-host_979/index.html', '../../../../the-hunted_978/index.html', ... ] COMMAND_BLOCK: # In scrapy shell: >>> response.css("article.product_pod h3 a::attr(href)").getall() [ '../../../../the-host_979/index.html', '../../../../the-hunted_978/index.html', ... ] COMMAND_BLOCK: # In scrapy shell: >>> response.css("li.next a::attr(href)").get() 'page-2.html' Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # In scrapy shell: >>> response.css("li.next a::attr(href)").get() 'page-2.html' COMMAND_BLOCK: # In scrapy shell: >>> response.css("li.next a::attr(href)").get() 'page-2.html' COMMAND_BLOCK: # Exit the shell and open a new one: scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Exit the shell and open a new one: scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html COMMAND_BLOCK: # Exit the shell and open a new one: scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html COMMAND_BLOCK: # In scrapy shell: >>> response.css("h1::text").get() 'The Host' >>> response.css("p.price_color::text").get() '£25.82' Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # In scrapy shell: >>> response.css("h1::text").get() 'The Host' >>> response.css("p.price_color::text").get() '£25.82' COMMAND_BLOCK: # In scrapy shell: >>> response.css("h1::text").get() 'The Host' >>> response.css("p.price_color::text").get() '£25.82' COMMAND_BLOCK: # tutorial/spiders/books.py import scrapy class BooksSpider(scrapy.Spider): name = "books" allowed_domains = ["toscrape.com"] # This is our starting URL (the first page of the Fantasy category) url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" # This is the modern, async version of 'start_requests' # It's called once when the spider starts. async def start(self): # We yield our first request, sending the response to 'parse_listpage' yield scrapy.Request(self.url, callback=self.parse_listpage) # This function handles the category page async def parse_listpage(self, response): # 1. Get all product URLs using the selector we found product_urls = response.css("article.product_pod h3 a::attr(href)").getall() # 2. For each product URL, follow it and send the response to 'parse_book' for url in product_urls: yield response.follow(url, callback=self.parse_book) # 3. Find the 'Next' page URL next_page_url = response.css("li.next a::attr(href)").get() # 4. If a 'Next' page exists, follow it and send the response # back to this same function if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) # This function handles the product page async def parse_book(self, response): # We yield a dictionary of the data we want yield { "name": response.css("h1::text").get(), "price": response.css("p.price_color::text").get(), "url": response.url } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # tutorial/spiders/books.py import scrapy class BooksSpider(scrapy.Spider): name = "books" allowed_domains = ["toscrape.com"] # This is our starting URL (the first page of the Fantasy category) url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" # This is the modern, async version of 'start_requests' # It's called once when the spider starts. async def start(self): # We yield our first request, sending the response to 'parse_listpage' yield scrapy.Request(self.url, callback=self.parse_listpage) # This function handles the category page async def parse_listpage(self, response): # 1. Get all product URLs using the selector we found product_urls = response.css("article.product_pod h3 a::attr(href)").getall() # 2. For each product URL, follow it and send the response to 'parse_book' for url in product_urls: yield response.follow(url, callback=self.parse_book) # 3. Find the 'Next' page URL next_page_url = response.css("li.next a::attr(href)").get() # 4. If a 'Next' page exists, follow it and send the response # back to this same function if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) # This function handles the product page async def parse_book(self, response): # We yield a dictionary of the data we want yield { "name": response.css("h1::text").get(), "price": response.css("p.price_color::text").get(), "url": response.url } COMMAND_BLOCK: # tutorial/spiders/books.py import scrapy class BooksSpider(scrapy.Spider): name = "books" allowed_domains = ["toscrape.com"] # This is our starting URL (the first page of the Fantasy category) url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" # This is the modern, async version of 'start_requests' # It's called once when the spider starts. async def start(self): # We yield our first request, sending the response to 'parse_listpage' yield scrapy.Request(self.url, callback=self.parse_listpage) # This function handles the category page async def parse_listpage(self, response): # 1. Get all product URLs using the selector we found product_urls = response.css("article.product_pod h3 a::attr(href)").getall() # 2. For each product URL, follow it and send the response to 'parse_book' for url in product_urls: yield response.follow(url, callback=self.parse_book) # 3. Find the 'Next' page URL next_page_url = response.css("li.next a::attr(href)").get() # 4. If a 'Next' page exists, follow it and send the response # back to this same function if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) # This function handles the product page async def parse_book(self, response): # We yield a dictionary of the data we want yield { "name": response.css("h1::text").get(), "price": response.css("p.price_color::text").get(), "url": response.url } CODE_BLOCK: scrapy crawl books Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: scrapy crawl books CODE_BLOCK: scrapy crawl books CODE_BLOCK: scrapy crawl books -o books.json Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: scrapy crawl books -o books.json CODE_BLOCK: scrapy crawl books -o books.json - ROBOTSTXT_OBEY - Concurrency - Find all Book Links: - Find the "Next" Page Link: - Find the Book Data (on a product page): - 💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord. - ▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel. - 📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.

🏷️ Tags

how-totutorialguidedev.toaimlshellpython