After covering the basics of creating a Scrapy project and your first spider, it's time to delve into more advanced techniques that can be used with Scrapy to handle more complex web scraping tasks.
Scrapy spiders can be enhanced to handle a variety of complex situations, such as dynamically generated content, spider arguments, handling pagination, and more.
Many websites have their content spread across multiple pages. Here's how you might handle pagination by sending requests to subsequent pages:
import scrapy
class PaginatedSpider(scrapy.Spider):
name = 'paginated_quotes'
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Sometimes, you may want to pass arguments to your spiders. Scrapy allows you to pass these arguments via the crawl
command:
scrapy crawl my_spider -a category=electronics
You can access these arguments in your spider:
class MySpider(scrapy.Spider):
name = 'my_spider'
def __init__(self, category='', **kwargs):
self.start_urls = [f'http://example.com/categories/{category}']
super().__init__(**kwargs)
def parse(self, response):
# ... scraping logic here ...
For websites that heavily rely on JavaScript for rendering content, traditional Scrapy requests might not fetch the data as it's loaded dynamically. In such cases, integrating Scrapy with a headless browser like Splash can be useful.
First, you'd need to run a Splash server (see the Splash documentation for installation and setup details) and then configure Scrapy to use Splash by updating your project settings:
# settings.py
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Then, use SplashRequest
in your spider:
import scrapy
from scrapy_splash import SplashRequest
class JSSpider(scrapy.Spider):
name = "js_spider"
def start_requests(self):
url = 'http://example.com/dynamic_content'
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
# ... scraping logic for JS-rendered content ...
While the basic spiders just yield Python dictionaries, Scrapy also provides the Item
class and Item Loaders, which offer a more convenient way to populate these items.
Define the structure of your scraped items:
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Item Loaders provide a way to populate your items:
from scrapy.loader import ItemLoader
from myproject.items import QuoteItem
class QuotesWithLoaderSpider(scrapy.Spider):
name = "quotes_with_loader"
start_urls = ['http://quotes.toscrape.com']
def parse(self, response):
loader = ItemLoader(item=QuoteItem(), response=response)
loader.add_css('text', 'div.quote span.text::text')
loader.add_css('author', 'span small::text')
loader.add_css('tags', 'div.tags a.tag::text')
yield loader.load_item()
Item Loaders are particularly useful when you want to preprocess the data before storing it in the item. You can define input and output processors for this purpose.
Scrapy is highly extensible, with options to include custom middlewares, extensions, and pipelines.
- Middlewares allow you to modify Scrapy's request/response processing.
- Extensions can add functionality to Scrapy (e.g., sending email notifications upon certain events).
- Pipelines are perfect for processing the data once it has been extracted, such as validating, cleaning, or storing in a database.
You can create these components and include them in your project's settings to extend Scrapy's capabilities to fit your specific scraping needs.
These advanced techniques and features make Scrapy a powerful tool for tackling a wide range of web scraping projects. With these skills, you can handle more complex websites and data extraction scenarios.