Now that we've set up our environment, it's time to dive into Scrapy. Scrapy is a powerful web scraping framework that provides a lot of built-in functionality for extracting data from websites. In this section, we'll cover what Scrapy is, why it's useful, and how to get started with a simple project.
Scrapy is an open-source web crawling framework for Python, which is used to create web scrapers. It's built on top of Twisted, an asynchronous networking library, allowing it to handle multiple requests simultaneously.
There are several reasons to choose Scrapy for your web scraping needs:
-
Robustness: Scrapy is well-suited for large-scale and complex web scraping tasks. It's designed to handle requests asynchronously for speed, and includes several built-in features for handling errors and retrying requests.
-
Speed: Due to its asynchronous handling of requests, Scrapy is fast. It can handle multiple requests in parallel, significantly reducing the time required to scrape pages.
-
Extensibility: Scrapy is highly extensible, allowing you to plug in new functionality easily through middlewares, extensions, and pipelines.
-
Ease of Use: Despite its sophistication, Scrapy provides a convenient command-line interface to streamline the creation of new projects and spiders, making it relatively easy to use once you understand the basics.
-
Comprehensive: Scrapy provides everything you need in one package, from making requests and selecting data, to handling data pipelines, stats collection, caching, and more.
To begin, we'll create a new Scrapy project. Navigate to the directory where you want your project to be located in the command line, then run:
scrapy startproject myproject
Replace "myproject" with whatever you want to name your project. This command creates a new folder with the project name, and sets up the necessary files and directory structure for a Scrapy project.
A Scrapy project typically includes the following components:
-
Spiders: Classes that you define to scrape information from a website (or a group of websites). They must subclass
scrapy.Spider
and define the initial requests to make. -
Items: Custom Python objects that define the structure of the data you're scraping.
-
Item Pipeline: A series of Python functions that processes the data returned by the spiders, typically used for cleaning and storing data.
-
Middlewares: Hook into the Scrapy engine to add custom functionality or extend existing features.
-
Settings: Configuration file to customize behavior of various components of Scrapy including default request headers, timeout values, maximum concurrent requests, and more.
A "spider" in Scrapy is a class that handles the fetching and parsing of web pages. Here's an example of a simple spider that scrapes quotes from http://quotes.toscrape.com:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}`
To run your spider, use the command:
scrapy crawl quotes
This command should output the data scraped from the page to your command line.
Scrapy is a feature-rich framework designed for efficiency and convenience when scraping websites. It handles a lot of the heavy lifting involved in web scraping, allowing you to focus on the data you need to extract. This introduction serves as the groundwork for more advanced topics and techniques in web scraping with Scrapy.