|
| 1 | + |
| 2 | +## 5. Advanced Scrapy Techniques |
| 3 | + |
| 4 | +After covering the basics of creating a Scrapy project and your first spider, it's time to delve into more advanced techniques that can be used with Scrapy to handle more complex web scraping tasks. |
| 5 | + |
| 6 | +### Writing More Complex Spiders |
| 7 | + |
| 8 | +Scrapy spiders can be enhanced to handle a variety of complex situations, such as dynamically generated content, spider arguments, handling pagination, and more. |
| 9 | + |
| 10 | +#### Handling Pagination |
| 11 | + |
| 12 | +Many websites have their content spread across multiple pages. Here's how you might handle pagination by sending requests to subsequent pages: |
| 13 | + |
| 14 | +```py |
| 15 | +import scrapy |
| 16 | + |
| 17 | +class PaginatedSpider(scrapy.Spider): |
| 18 | + name = 'paginated_quotes' |
| 19 | + start_urls = ['http://quotes.toscrape.com/page/1/'] |
| 20 | + |
| 21 | + def parse(self, response): |
| 22 | + for quote in response.css('div.quote'): |
| 23 | + yield { |
| 24 | + 'text': quote.css('span.text::text').get(), |
| 25 | + 'author': quote.css('span small::text').get(), |
| 26 | + 'tags': quote.css('div.tags a.tag::text').getall(), |
| 27 | + } |
| 28 | + |
| 29 | + next_page = response.css('li.next a::attr(href)').get() |
| 30 | + if next_page is not None: |
| 31 | + next_page = response.urljoin(next_page) |
| 32 | + yield scrapy.Request(next_page, callback=self.parse) |
| 33 | +``` |
| 34 | +#### Using Spider Arguments |
| 35 | + |
| 36 | +Sometimes, you may want to pass arguments to your spiders. Scrapy allows you to pass these arguments via the `crawl` command: |
| 37 | + |
| 38 | +```sh |
| 39 | +scrapy crawl my_spider -a category=electronics |
| 40 | +``` |
| 41 | +You can access these arguments in your spider: |
| 42 | + |
| 43 | +```py |
| 44 | +class MySpider(scrapy.Spider): |
| 45 | + name = 'my_spider' |
| 46 | + |
| 47 | + def __init__(self, category='', **kwargs): |
| 48 | + self.start_urls = [f'http://example.com/categories/{category}'] |
| 49 | + super().__init__(**kwargs) |
| 50 | + |
| 51 | + def parse(self, response): |
| 52 | + # ... scraping logic here ... |
| 53 | +``` |
| 54 | +#### Handling JavaScript-Rendered Pages |
| 55 | + |
| 56 | +For websites that heavily rely on JavaScript for rendering content, traditional Scrapy requests might not fetch the data as it's loaded dynamically. In such cases, integrating Scrapy with a headless browser like Splash can be useful. |
| 57 | + |
| 58 | +First, you'd need to run a Splash server (see the [Splash documentation](https://splash.readthedocs.io/en/stable/install.html) for installation and setup details) and then configure Scrapy to use Splash by updating your project settings: |
| 59 | + |
| 60 | +```py |
| 61 | +# settings.py |
| 62 | +SPLASH_URL = 'http://localhost:8050' |
| 63 | + |
| 64 | +DOWNLOADER_MIDDLEWARES = { |
| 65 | + 'scrapy_splash.SplashCookiesMiddleware': 723, |
| 66 | + 'scrapy_splash.SplashMiddleware': 725, |
| 67 | + 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, |
| 68 | +} |
| 69 | + |
| 70 | +SPIDER_MIDDLEWARES = { |
| 71 | + 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, |
| 72 | +} |
| 73 | + |
| 74 | +DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' |
| 75 | +``` |
| 76 | +Then, use `SplashRequest` in your spider: |
| 77 | + |
| 78 | +```py |
| 79 | +import scrapy |
| 80 | +from scrapy_splash import SplashRequest |
| 81 | + |
| 82 | +class JSSpider(scrapy.Spider): |
| 83 | + name = "js_spider" |
| 84 | + |
| 85 | + def start_requests(self): |
| 86 | + url = 'http://example.com/dynamic_content' |
| 87 | + yield SplashRequest(url, self.parse, args={'wait': 0.5}) |
| 88 | + |
| 89 | + def parse(self, response): |
| 90 | + # ... scraping logic for JS-rendered content ... |
| 91 | +``` |
| 92 | + |
| 93 | +### Item Loaders and Input/Output Processors |
| 94 | + |
| 95 | +While the basic spiders just yield Python dictionaries, Scrapy also provides the `Item` class and Item Loaders, which offer a more convenient way to populate these items. |
| 96 | + |
| 97 | +#### Defining Items |
| 98 | + |
| 99 | +Define the structure of your scraped items: |
| 100 | + |
| 101 | +```py |
| 102 | +import scrapy |
| 103 | + |
| 104 | +class QuoteItem(scrapy.Item): |
| 105 | + text = scrapy.Field() |
| 106 | + author = scrapy.Field() |
| 107 | + tags = scrapy.Field() |
| 108 | +``` |
| 109 | +#### Using Item Loaders |
| 110 | + |
| 111 | +Item Loaders provide a way to populate your items: |
| 112 | +```py |
| 113 | +from scrapy.loader import ItemLoader |
| 114 | +from myproject.items import QuoteItem |
| 115 | + |
| 116 | +class QuotesWithLoaderSpider(scrapy.Spider): |
| 117 | + name = "quotes_with_loader" |
| 118 | + start_urls = ['http://quotes.toscrape.com'] |
| 119 | + |
| 120 | + def parse(self, response): |
| 121 | + loader = ItemLoader(item=QuoteItem(), response=response) |
| 122 | + loader.add_css('text', 'div.quote span.text::text') |
| 123 | + loader.add_css('author', 'span small::text') |
| 124 | + loader.add_css('tags', 'div.tags a.tag::text') |
| 125 | + yield loader.load_item() |
| 126 | +``` |
| 127 | +Item Loaders are particularly useful when you want to preprocess the data before storing it in the item. You can define input and output processors for this purpose. |
| 128 | + |
| 129 | +### Extending Scrapy with Middlewares, Extensions, and Pipelines |
| 130 | + |
| 131 | +Scrapy is highly extensible, with options to include custom middlewares, extensions, and pipelines. |
| 132 | + |
| 133 | +- **Middlewares** allow you to modify Scrapy's request/response processing. |
| 134 | +- **Extensions** can add functionality to Scrapy (e.g., sending email notifications upon certain events). |
| 135 | +- **Pipelines** are perfect for processing the data once it has been extracted, such as validating, cleaning, or storing in a database. |
| 136 | + |
| 137 | +You can create these components and include them in your project's settings to extend Scrapy's capabilities to fit your specific scraping needs. |
| 138 | + |
| 139 | +---------- |
| 140 | + |
| 141 | +These advanced techniques and features make Scrapy a powerful tool for tackling a wide range of web scraping projects. With these skills, you can handle more complex websites and data extraction scenarios. |
0 commit comments