Skip to content

Latest commit

 

History

History
141 lines (101 loc) · 5.06 KB

advanced-scrapy-techniques.md

File metadata and controls

141 lines (101 loc) · 5.06 KB

5. Advanced Scrapy Techniques

After covering the basics of creating a Scrapy project and your first spider, it's time to delve into more advanced techniques that can be used with Scrapy to handle more complex web scraping tasks.

Writing More Complex Spiders

Scrapy spiders can be enhanced to handle a variety of complex situations, such as dynamically generated content, spider arguments, handling pagination, and more.

Handling Pagination

Many websites have their content spread across multiple pages. Here's how you might handle pagination by sending requests to subsequent pages:

import scrapy

class PaginatedSpider(scrapy.Spider):
    name = 'paginated_quotes'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse) 

Using Spider Arguments

Sometimes, you may want to pass arguments to your spiders. Scrapy allows you to pass these arguments via the crawl command:

scrapy crawl my_spider -a category=electronics 

You can access these arguments in your spider:

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def __init__(self, category='', **kwargs):
        self.start_urls = [f'http://example.com/categories/{category}']  
        super().__init__(**kwargs)  

    def parse(self, response):
        # ... scraping logic here ... 

Handling JavaScript-Rendered Pages

For websites that heavily rely on JavaScript for rendering content, traditional Scrapy requests might not fetch the data as it's loaded dynamically. In such cases, integrating Scrapy with a headless browser like Splash can be useful.

First, you'd need to run a Splash server (see the Splash documentation for installation and setup details) and then configure Scrapy to use Splash by updating your project settings:

# settings.py
SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 

Then, use SplashRequest in your spider:

import scrapy
from scrapy_splash import SplashRequest

class JSSpider(scrapy.Spider):
    name = "js_spider"

    def start_requests(self):
        url = 'http://example.com/dynamic_content'
        yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        # ... scraping logic for JS-rendered content ...

Item Loaders and Input/Output Processors

While the basic spiders just yield Python dictionaries, Scrapy also provides the Item class and Item Loaders, which offer a more convenient way to populate these items.

Defining Items

Define the structure of your scraped items:

import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field() 

Using Item Loaders

Item Loaders provide a way to populate your items:

from scrapy.loader import ItemLoader
from myproject.items import QuoteItem

class QuotesWithLoaderSpider(scrapy.Spider):
    name = "quotes_with_loader"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        loader = ItemLoader(item=QuoteItem(), response=response)
        loader.add_css('text', 'div.quote span.text::text')
        loader.add_css('author', 'span small::text')
        loader.add_css('tags', 'div.tags a.tag::text')
        yield loader.load_item() 

Item Loaders are particularly useful when you want to preprocess the data before storing it in the item. You can define input and output processors for this purpose.

Extending Scrapy with Middlewares, Extensions, and Pipelines

Scrapy is highly extensible, with options to include custom middlewares, extensions, and pipelines.

  • Middlewares allow you to modify Scrapy's request/response processing.
  • Extensions can add functionality to Scrapy (e.g., sending email notifications upon certain events).
  • Pipelines are perfect for processing the data once it has been extracted, such as validating, cleaning, or storing in a database.

You can create these components and include them in your project's settings to extend Scrapy's capabilities to fit your specific scraping needs.


These advanced techniques and features make Scrapy a powerful tool for tackling a wide range of web scraping projects. With these skills, you can handle more complex websites and data extraction scenarios.