Skip to content

Commit 87b5b92

Browse files
authoredOct 17, 2023
Create advanced-scrapy-techniques.md
1 parent deeb4fb commit 87b5b92

File tree

1 file changed

+141
-0
lines changed

1 file changed

+141
-0
lines changed
 

‎docs/advanced-scrapy-techniques.md

+141
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
2+
## 5. Advanced Scrapy Techniques
3+
4+
After covering the basics of creating a Scrapy project and your first spider, it's time to delve into more advanced techniques that can be used with Scrapy to handle more complex web scraping tasks.
5+
6+
### Writing More Complex Spiders
7+
8+
Scrapy spiders can be enhanced to handle a variety of complex situations, such as dynamically generated content, spider arguments, handling pagination, and more.
9+
10+
#### Handling Pagination
11+
12+
Many websites have their content spread across multiple pages. Here's how you might handle pagination by sending requests to subsequent pages:
13+
14+
```py
15+
import scrapy
16+
17+
class PaginatedSpider(scrapy.Spider):
18+
name = 'paginated_quotes'
19+
start_urls = ['http://quotes.toscrape.com/page/1/']
20+
21+
def parse(self, response):
22+
for quote in response.css('div.quote'):
23+
yield {
24+
'text': quote.css('span.text::text').get(),
25+
'author': quote.css('span small::text').get(),
26+
'tags': quote.css('div.tags a.tag::text').getall(),
27+
}
28+
29+
next_page = response.css('li.next a::attr(href)').get()
30+
if next_page is not None:
31+
next_page = response.urljoin(next_page)
32+
yield scrapy.Request(next_page, callback=self.parse)
33+
```
34+
#### Using Spider Arguments
35+
36+
Sometimes, you may want to pass arguments to your spiders. Scrapy allows you to pass these arguments via the `crawl` command:
37+
38+
```sh
39+
scrapy crawl my_spider -a category=electronics
40+
```
41+
You can access these arguments in your spider:
42+
43+
```py
44+
class MySpider(scrapy.Spider):
45+
name = 'my_spider'
46+
47+
def __init__(self, category='', **kwargs):
48+
self.start_urls = [f'http://example.com/categories/{category}']
49+
super().__init__(**kwargs)
50+
51+
def parse(self, response):
52+
# ... scraping logic here ...
53+
```
54+
#### Handling JavaScript-Rendered Pages
55+
56+
For websites that heavily rely on JavaScript for rendering content, traditional Scrapy requests might not fetch the data as it's loaded dynamically. In such cases, integrating Scrapy with a headless browser like Splash can be useful.
57+
58+
First, you'd need to run a Splash server (see the [Splash documentation](https://splash.readthedocs.io/en/stable/install.html) for installation and setup details) and then configure Scrapy to use Splash by updating your project settings:
59+
60+
```py
61+
# settings.py
62+
SPLASH_URL = 'http://localhost:8050'
63+
64+
DOWNLOADER_MIDDLEWARES = {
65+
'scrapy_splash.SplashCookiesMiddleware': 723,
66+
'scrapy_splash.SplashMiddleware': 725,
67+
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
68+
}
69+
70+
SPIDER_MIDDLEWARES = {
71+
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
72+
}
73+
74+
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
75+
```
76+
Then, use `SplashRequest` in your spider:
77+
78+
```py
79+
import scrapy
80+
from scrapy_splash import SplashRequest
81+
82+
class JSSpider(scrapy.Spider):
83+
name = "js_spider"
84+
85+
def start_requests(self):
86+
url = 'http://example.com/dynamic_content'
87+
yield SplashRequest(url, self.parse, args={'wait': 0.5})
88+
89+
def parse(self, response):
90+
# ... scraping logic for JS-rendered content ...
91+
```
92+
93+
### Item Loaders and Input/Output Processors
94+
95+
While the basic spiders just yield Python dictionaries, Scrapy also provides the `Item` class and Item Loaders, which offer a more convenient way to populate these items.
96+
97+
#### Defining Items
98+
99+
Define the structure of your scraped items:
100+
101+
```py
102+
import scrapy
103+
104+
class QuoteItem(scrapy.Item):
105+
text = scrapy.Field()
106+
author = scrapy.Field()
107+
tags = scrapy.Field()
108+
```
109+
#### Using Item Loaders
110+
111+
Item Loaders provide a way to populate your items:
112+
```py
113+
from scrapy.loader import ItemLoader
114+
from myproject.items import QuoteItem
115+
116+
class QuotesWithLoaderSpider(scrapy.Spider):
117+
name = "quotes_with_loader"
118+
start_urls = ['http://quotes.toscrape.com']
119+
120+
def parse(self, response):
121+
loader = ItemLoader(item=QuoteItem(), response=response)
122+
loader.add_css('text', 'div.quote span.text::text')
123+
loader.add_css('author', 'span small::text')
124+
loader.add_css('tags', 'div.tags a.tag::text')
125+
yield loader.load_item()
126+
```
127+
Item Loaders are particularly useful when you want to preprocess the data before storing it in the item. You can define input and output processors for this purpose.
128+
129+
### Extending Scrapy with Middlewares, Extensions, and Pipelines
130+
131+
Scrapy is highly extensible, with options to include custom middlewares, extensions, and pipelines.
132+
133+
- **Middlewares** allow you to modify Scrapy's request/response processing.
134+
- **Extensions** can add functionality to Scrapy (e.g., sending email notifications upon certain events).
135+
- **Pipelines** are perfect for processing the data once it has been extracted, such as validating, cleaning, or storing in a database.
136+
137+
You can create these components and include them in your project's settings to extend Scrapy's capabilities to fit your specific scraping needs.
138+
139+
----------
140+
141+
These advanced techniques and features make Scrapy a powerful tool for tackling a wide range of web scraping projects. With these skills, you can handle more complex websites and data extraction scenarios.

0 commit comments

Comments
 (0)
Please sign in to comment.