Project Setup

Inside your terminal, run the following to set up your environment: a. Run "sudo apt update"
b. "sudo apt install python3-pip"
c. "sudo apt install python3-scrapy"
d. "pip install Scrapy==2.11.0"
e. "pip install scraperapi-sdk==0.2.2"
f. "pip install python_dotenv==1.0.0"
g. "pip install pandas==1.3.5"
Run "scrapy startproject {project name}" inside the directory you wish to store this scrapy project in.
You may need to run the following if you get the following error: "AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'"
a. python3 -m pip install --upgrade cryptography pyOpenSSL
b. sudo apt-get install libssl-dev
Once you have created a Scrapy project, put the Python files from this directory into the spiders folder in your own directory. Each Scrapy project has a spiders folder automatically (assuming we completed the Scrapy project step correctly).
Inside the settings.py file (which gets automatically created in each Scrapy project), make sure the fields are set as following:
a. "HTTPERROR_ALLOWED_CODES = [401, 404, 405]"
b. "ROBOTSTXT_OBEY = False"
c. "DOWNLOAD_DELAY = 0.5" (May need to set it to 1 if 0.5 overwhelms the site server)
d. 'USER_AGENT = "Mozilla/5.0"'
e. "CONCURRENT_REQUESTS = 32"
f. "CONCURRENT_REQUESTS_PER_DOMAIN = 32"
g. "CONCURRENT_REQUESTS_PER_IP = 16"
h. "COOKIES_ENABLED = False"
In the directory where you will be running your scrapy spiders, do the following:
a. Create a .env file
b. Inside the .env file, add the following line:
Scraper_API_Key='{enter Scraper API key}'

Running web scrape

To retrieve the urls, run the following:
scrapy crawl farfetch_women_get_urls -o farfetch_new_existing_deleted/FarFetch_Women_Urls.csv -t csv
The spider name is "farfetch_women_get_urls", so we call this immediately after "scrapy crawl"
-o: csv file directory to save your scrape
-t: specify the output format, in our case, csv
To run a scrape for the item urls, run the following:
scrapy crawl farfetch_women_update_database -o farfetch_new_existing_deleted/FarFetch_Women_New.csv -t csv
If there's an error running the scrapy spiders, make sure twisted version is 21.7.0 and scrapy version is 2.11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

README.md

README.md

Project Setup

Running web scrape

Files

README.md

Latest commit

History

README.md

File metadata and controls

Project Setup

Running web scrape