Tools

The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot

2025-12-16 0 views admin

The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot

Source: Dev.to

Prerequisites & Setup ## Step 1: Installing Web Scraping Co-pilot ## Step 2: Auto-Generating our BookItem ## Step 3: Running the AI-Generated Tests ## Step 4: Refactoring the Spider (The Easy Way) ## Step 5: Auto-Generating our BookListPage ## Conclusion: The "Hybrid Developer" Welcome to Part 3 of our Modern Scrapy series. That refactor was a huge improvement, but it was still a lot of manual work. We had to: What if you could do all of that in about 30 seconds? In this guide, we'll show you how to use the Web Scraping Co-pilot (our VS Code extension) to automatically write 100% of your Items, Page Objects, and even your unit tests. We'll take our simple spider from Part 1 and upgrade it to the professional scrapy-poet architecture from Part 2, but this time, the AI will do all the heavy lifting. This tutorial assumes you have: Inside VS Code, go to the "Extensions" tab and search for Web Scraping Co-pilot (published by Zyte). Once installed, you'll see a new icon in your sidebar. Open it, and it will automatically detect your Scrapy project. It may ask to install a few dependencies like pytest—allow it to do so. This setup process ensures your environment is ready for AI-powered generation. Let's start with the spider from Part 1. Our goal is to create a Page Object for our BookItem and add even more fields than we did in Part 2. In the Co-pilot chat window: Write a prompt like this: "Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html" The co-pilot will now: In 30 seconds, the Co-pilot has done everything we did manually in Part 2, but better—it even added more fields. The best part? The Co-pilot also wrote unit tests for you. It created a tests folder with test_bookstoscrape_com.py. You can just click "Run Tests" in the Co-pilot UI (or run pytest in your terminal). Your parsing logic is now fully tested, and you didn't write a single line of test code. Now, we just update our tutorial/spiders/books.py to use this new architecture, just like in Part 2. We can repeat the exact same process for our list page to finish the refactor. "Create a page object for the list item BookListPage using the sample URL https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" Now we can update our spider one last time to be fully architected. Our spider is now just a "crawler." It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co-pilot. The Web Scraping Co-pilot doesn't replace you. It accelerates you. It automates the 90% of work that is "grunt work" (finding selectors, writing boilerplate, creating tests) so you can focus on the 10% of work that matters: crawling logic, strategy, and handling complex sites. This is how we, as the maintainers of Scrapy, build spiders professionally. What's Next? Join the Community. What's Next? Join the Community. 💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord. ▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel. 📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: # tutorial/items.py (Auto-Generated!) import attrs @attrs.define class BookItem: """ The structured data we extract from a book detail page. """ name: str price: str url: str availability: str # <-- New! number_of_reviews: int # <-- New! upc: str # <-- New! Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # tutorial/items.py (Auto-Generated!) import attrs @attrs.define class BookItem: """ The structured data we extract from a book detail page. """ name: str price: str url: str availability: str # <-- New! number_of_reviews: int # <-- New! upc: str # <-- New! COMMAND_BLOCK: # tutorial/items.py (Auto-Generated!) import attrs @attrs.define class BookItem: """ The structured data we extract from a book detail page. """ name: str price: str url: str availability: str # <-- New! number_of_reviews: int # <-- New! upc: str # <-- New! COMMAND_BLOCK: # tutorial/pages/bookstoscrape_com.py (Auto-Generated!) from web_poet import WebPage, handle_urls, field, returns from tutorial.items import BookItem @handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)") @returns(BookItem) class BookDetailPage(WebPage): """ This Page Object handles parsing data from book detail pages. """ @field def name(self) -> str: return self.response.css("h1::text").get() @field def price(self) -> str: return self.response.css("p.price_color::text").get() @field def url(self) -> str: return self.response.url # All of this was written for us! @field def availability(self) -> str: return self.response.css("p.availability::text").getall()[1].strip() @field def number_of_reviews(self) -> int: return int(self.response.css("table tr:last-child td::text").get()) @field def upc(self) -> str: return self.response.css("table tr:first-child td::text").get() Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # tutorial/pages/bookstoscrape_com.py (Auto-Generated!) from web_poet import WebPage, handle_urls, field, returns from tutorial.items import BookItem @handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)") @returns(BookItem) class BookDetailPage(WebPage): """ This Page Object handles parsing data from book detail pages. """ @field def name(self) -> str: return self.response.css("h1::text").get() @field def price(self) -> str: return self.response.css("p.price_color::text").get() @field def url(self) -> str: return self.response.url # All of this was written for us! @field def availability(self) -> str: return self.response.css("p.availability::text").getall()[1].strip() @field def number_of_reviews(self) -> int: return int(self.response.css("table tr:last-child td::text").get()) @field def upc(self) -> str: return self.response.css("table tr:first-child td::text").get() COMMAND_BLOCK: # tutorial/pages/bookstoscrape_com.py (Auto-Generated!) from web_poet import WebPage, handle_urls, field, returns from tutorial.items import BookItem @handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)") @returns(BookItem) class BookDetailPage(WebPage): """ This Page Object handles parsing data from book detail pages. """ @field def name(self) -> str: return self.response.css("h1::text").get() @field def price(self) -> str: return self.response.css("p.price_color::text").get() @field def url(self) -> str: return self.response.url # All of this was written for us! @field def availability(self) -> str: return self.response.css("p.availability::text").getall()[1].strip() @field def number_of_reviews(self) -> int: return int(self.response.css("table tr:last-child td::text").get()) @field def upc(self) -> str: return self.response.css("table tr:first-child td::text").get() COMMAND_BLOCK: $ pytest ================ test session starts ================ ... tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED ... ================ 8 tests passed in 0.10s ================ Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: $ pytest ================ test session starts ================ ... tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED ... ================ 8 tests passed in 0.10s ================ COMMAND_BLOCK: $ pytest ================ test session starts ================ ... tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED ... ================ 8 tests passed in 0.10s ================ COMMAND_BLOCK: # tutorial/spiders/books.py import scrapy # Import our new, auto-generated Item class from tutorial.items import BookItem class BooksSpider(scrapy.Spider): name = "books" # ... (rest of spider from Part 1) ... async def parse_listpage(self, response): product_urls = response.css("article.product_pod h3 a::attr(href)").getall() for url in product_urls: # We just tell Scrapy to call parse_book yield response.follow(url, callback=self.parse_book) next_page_url = response.css("li.next a::attr(href)").get() if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) # We ask for the BookItem, and scrapy-poet does the rest! async def parse_book(self, response, book: BookItem): yield book Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # tutorial/spiders/books.py import scrapy # Import our new, auto-generated Item class from tutorial.items import BookItem class BooksSpider(scrapy.Spider): name = "books" # ... (rest of spider from Part 1) ... async def parse_listpage(self, response): product_urls = response.css("article.product_pod h3 a::attr(href)").getall() for url in product_urls: # We just tell Scrapy to call parse_book yield response.follow(url, callback=self.parse_book) next_page_url = response.css("li.next a::attr(href)").get() if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) # We ask for the BookItem, and scrapy-poet does the rest! async def parse_book(self, response, book: BookItem): yield book COMMAND_BLOCK: # tutorial/spiders/books.py import scrapy # Import our new, auto-generated Item class from tutorial.items import BookItem class BooksSpider(scrapy.Spider): name = "books" # ... (rest of spider from Part 1) ... async def parse_listpage(self, response): product_urls = response.css("article.product_pod h3 a::attr(href)").getall() for url in product_urls: # We just tell Scrapy to call parse_book yield response.follow(url, callback=self.parse_book) next_page_url = response.css("li.next a::attr(href)").get() if next_page_url: yield response.follow(next_page_url, callback=self.parse_listpage) # We ask for the BookItem, and scrapy-poet does the rest! async def parse_book(self, response, book: BookItem): yield book COMMAND_BLOCK: # tutorial/spiders/books.py (FINAL VERSION) import scrapy from tutorial.items import BookItem, BookListPage # Import both class BooksSpider(scrapy.Spider): # ... (name, allowed_domains, url) ... async def start(self): yield scrapy.Request(self.url, callback=self.parse_listpage) # We now ask for the BookListPage item! async def parse_listpage(self, response, page: BookListPage): # All parsing logic is GONE from the spider. for url in page.book_urls: yield response.follow(url, callback=self.parse_book) if page.next_page_url: yield response.follow(page.next_page_url, callback=self.parse_listpage) async def parse_book(self, response, book: BookItem): yield book Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # tutorial/spiders/books.py (FINAL VERSION) import scrapy from tutorial.items import BookItem, BookListPage # Import both class BooksSpider(scrapy.Spider): # ... (name, allowed_domains, url) ... async def start(self): yield scrapy.Request(self.url, callback=self.parse_listpage) # We now ask for the BookListPage item! async def parse_listpage(self, response, page: BookListPage): # All parsing logic is GONE from the spider. for url in page.book_urls: yield response.follow(url, callback=self.parse_book) if page.next_page_url: yield response.follow(page.next_page_url, callback=self.parse_listpage) async def parse_book(self, response, book: BookItem): yield book COMMAND_BLOCK: # tutorial/spiders/books.py (FINAL VERSION) import scrapy from tutorial.items import BookItem, BookListPage # Import both class BooksSpider(scrapy.Spider): # ... (name, allowed_domains, url) ... async def start(self): yield scrapy.Request(self.url, callback=self.parse_listpage) # We now ask for the BookListPage item! async def parse_listpage(self, response, page: BookListPage): # All parsing logic is GONE from the spider. for url in page.book_urls: yield response.follow(url, callback=self.parse_book) if page.next_page_url: yield response.follow(page.next_page_url, callback=self.parse_listpage) async def parse_book(self, response, book: BookItem): yield book - Manually create our BookItem and BookListPage schemas. - Manually create the bookstoscrape_com.py Page Object file. - Manually use scrapy shell to find all the CSS selectors. - Manually write all the @field parsers. - Completed Part 1 (see above) - Visual Studio Code installed. - The Web Scraping Co-pilot extension (which we'll install now). - Select "Web Scraping." - Write a prompt like this: "Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html" - Check your project: It will confirm you have scrapy-poet and pytest (and will offer to install them if you don't). - Add scrapy-poet settings: It will automatically add the ADDONS and SCRAPY_POET_DISCOVER settings to your settings.py file. - Create your items.py: It will create a new BookItem class, but this time it will intelligently add all the fields it can find on the page. - Create Fixtures: It creates a fixtures folder with the saved HTML and expected JSON output for testing. - Write the Page Object: It creates the tutorial/pages/bookstoscrape_com.py file and writes the entire Page Object, complete with all parsing logic and selectors, for all the new fields. - The Co-pilot will create the BookListPage item in items.py. - It will create the BookListPageObject in bookstoscrape_com.py with the parsers for book_urls and next_page_url. - It will write and pass the tests.

🏷️ Tags

how-totutorialguidedev.toaimlshell