- Inside your terminal, run the following to set up your environment:
a. Run "sudo apt update"
b. "sudo apt install python3-pip"
c. "sudo apt install python3-scrapy"
d. "pip install Scrapy==2.11.0"
e. "pip install scraperapi-sdk==0.2.2"
f. "pip install python_dotenv==1.0.0"
g. "pip install pandas==1.3.5" - Run "scrapy startproject {project name}" inside the directory you wish to store this scrapy project in.
- You may need to run the following if you get the following error: "AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'"
a. python3 -m pip install --upgrade cryptography pyOpenSSL
b. sudo apt-get install libssl-dev - Once you have created a
Scrapy
project, put the Python files from this directory into the spiders folder in your own directory. EachScrapy
project has a spiders folder automatically (assuming we completed theScrapy
project step correctly). - Inside the settings.py file (which gets automatically created in each
Scrapy
project), make sure the fields are set as following:
a. "HTTPERROR_ALLOWED_CODES = [401, 404, 405]"
b. "ROBOTSTXT_OBEY = False"
c. "DOWNLOAD_DELAY = 0.5" (May need to set it to 1 if 0.5 overwhelms the site server)
d. 'USER_AGENT = "Mozilla/5.0"'
e. "CONCURRENT_REQUESTS = 32"
f. "CONCURRENT_REQUESTS_PER_DOMAIN = 32"
g. "CONCURRENT_REQUESTS_PER_IP = 16"
h. "COOKIES_ENABLED = False" - In the directory where you will be running your scrapy spiders, do the following:
a. Create a .env file
b. Inside the .env file, add the following line:
Scraper_API_Key='{enter Scraper API key}'
- To retrieve the urls, run the following:
scrapy crawl farfetch_women_get_urls -o farfetch_new_existing_deleted/FarFetch_Women_Urls.csv -t csv
The spider name is "farfetch_women_get_urls", so we call this immediately after "scrapy crawl"
-o: csv file directory to save your scrape
-t: specify the output format, in our case, csv - To run a scrape for the item urls, run the following:
scrapy crawl farfetch_women_update_database -o farfetch_new_existing_deleted/FarFetch_Women_New.csv -t csv
- If there's an error running the scrapy spiders, make sure twisted version is 21.7.0 and scrapy version is 2.11.0