Skip to content

Commit de12fbc

Browse files
committed
add images for notebooks
1 parent 1055234 commit de12fbc

6 files changed

+363
-0
lines changed

WebData/Scrape_dynamic.ipynb

Lines changed: 363 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,363 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Scraping JavaScript data (\"dynamic webpages\")\n",
8+
"### by [Jason DeBacker](http://jasondebacker.com), October 2017 (with thanks to [Adam Rennhoff](http://mtweb.mtsu.edu/rennhoff/) )\n",
9+
"\n",
10+
"This notebook provides a tutorial and examples showing how to scrape webpages with JavaScript data.\n",
11+
"\n",
12+
"## Example: scrape the store locations for Walgreens pharmacies.\n",
13+
"\n",
14+
"Sometimes the webpage will be more complicated. As an example, suppose that I want to scrape store locations of Walgreens pharmacies.\n",
15+
"\n",
16+
"![Walgreen Locations Screenshot](files/images/WalgreenLocations.png)\n",
17+
"\n",
18+
"Notice that I am searching for stores near zip code 29205 but this fact is NOT displayed in the url, which is `https://www.walgreens.com/storelocator/find.jsp`. In other words, the zip code is not part of the url so it would not be possible to loop over different locations the way we did with the Wikipedia pages.\n",
19+
"\n",
20+
"What can we do? Lets look at the request that is being sent to Walgreens.com to see if we can mimic the request that is being sent. To do this, we need to use the \"Inspect\" tool to look at the network data (\"XHR\") (Note that the format of the inspect tool will vary depending on the internet browser you are using - in the screenshots below, I'm using Safari Version 11.0):\n",
21+
"\n",
22+
"![Walgreen Inspect Screenshot](files/images/WalgreensInspect.png)\n",
23+
"\n",
24+
"This will take some trial and error but you can see a list of requests under \"Resources\" and then \"XHR\". I have clicked on the second search result in the list. Notice that this is showing the address of the first result (4467 DEVINE ST). This tells me that this is the request I want to mimic.\n",
25+
"\n",
26+
"\n",
27+
"In order to figure out the format of my request, I need to click on the drop down menu that says \"Response\" and select \"Request\" from this menu. Then click on the \"show details sidebar\" icon to show details of the request.\n",
28+
"\n",
29+
"![Walgreen Request Screenshot](files/images/RequestType.png)\n",
30+
"\n",
31+
"The \"request payload\" in this case is: `{\"q\":\" Columbia, SC 29205\",\"r\":\"50\",\"lat\":33.9900337,\"lng\":-80.99815760000001,\"requestType\":\"locator\",\"s\":\"15\",\"p\":\"1\"}`\n",
32+
"\n",
33+
"The request payload tells us what we need to send to the URL so that they return the information we want\n",
34+
"\n",
35+
"There are three things from the show details that we'll also need: \"Location\", \"Request and Response\", and \"Request Payload\"\n",
36+
"\n",
37+
"* \"Location\" will tell us the URL that we make our request.\n",
38+
"* \"Request and Responses\" will tell us the method (POST)\n",
39+
"* \"Request Headers\" will tell us the information that needs to be in the header of our request -- in the Best Buy example, we had to send a \"User-Agent\" so that it looked like we were coming from a web browser like Chrome or Firefox\n",
40+
"\n",
41+
"Let's try to make that request using the requests library in Python"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": 22,
47+
"metadata": {},
48+
"outputs": [
49+
{
50+
"name": "stdout",
51+
"output_type": "stream",
52+
"text": [
53+
"<Response [200]>\n"
54+
]
55+
}
56+
],
57+
"source": [
58+
"import requests\n",
59+
"import json\n",
60+
"\n",
61+
"url = 'https://customersearch.walgreens.com/storelocator/v1/stores/search' # from Headers Request URL\n",
62+
"\n",
63+
"# Request payloads\n",
64+
"pay = {\n",
65+
" \"q\":\" Columbia, SC 29205\",\n",
66+
" \"r\":\"50\",\"lat\":33.9900337,\n",
67+
" \"lng\":-80.99815760000001,\n",
68+
" \"requestType\":\"locator\",\n",
69+
" \"s\":\"15\",\n",
70+
" \"p\":\"1\"\n",
71+
"}\n",
72+
"\n",
73+
"# Request headers\n",
74+
"heads = {\n",
75+
" \"Accept\" : \"application/json, text/plain, */*\",\n",
76+
" \"Accept-Encoding\" : \"gzip, deflate, br\",\n",
77+
" \"Accept-Language\" : \"en-US,en;q=0.8\",\n",
78+
" \"Connection\" : \"keep-alive\",\n",
79+
" \"Content-Length\" : \"105\",\n",
80+
" \"Content-Type\" : \"application/json;charset=UTF-8\",\n",
81+
" \"Host\" : \"customersearch.walgreens.com\",\n",
82+
" \"Origin\" : \"https://www.walgreens.com\",\n",
83+
" \"Referer\" : \"https://www.walgreens.com/storelocator/find.jsp?tab=store+locator&requestType=locator\",\n",
84+
" \"User-Agent\" : \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Safari/604.1.38\" \n",
85+
"}\n",
86+
"\n",
87+
"# Making our POST request using the headers and payload\n",
88+
"response = requests.post(url, data = json.dumps(pay), headers = heads)\n",
89+
"print(response)"
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"metadata": {},
95+
"source": [
96+
"A response code of 200 means that the request was properly sent and received. Success!\n",
97+
"\n",
98+
"Our next step is to get the data in a usable format by using the JSON data and remebering the format of the response we receive (from the \"Preview\" tab)."
99+
]
100+
},
101+
{
102+
"cell_type": "code",
103+
"execution_count": 18,
104+
"metadata": {},
105+
"outputs": [
106+
{
107+
"data": {
108+
"text/plain": [
109+
"3"
110+
]
111+
},
112+
"execution_count": 18,
113+
"metadata": {},
114+
"output_type": "execute_result"
115+
}
116+
],
117+
"source": [
118+
"data = response.json() # our requested information is now saved as JSON data\n",
119+
"len(data)"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"We can only see the first two elements in the image above but our JSON data has three elements: \"filter\", \"results\", and \"summary\". The \"results\" key contains the information we really want."
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": 19,
132+
"metadata": {},
133+
"outputs": [
134+
{
135+
"data": {
136+
"text/plain": [
137+
"25"
138+
]
139+
},
140+
"execution_count": 19,
141+
"metadata": {},
142+
"output_type": "execute_result"
143+
}
144+
],
145+
"source": [
146+
"results = data['results']\n",
147+
"len(results)"
148+
]
149+
},
150+
{
151+
"cell_type": "markdown",
152+
"metadata": {},
153+
"source": [
154+
"This is data for 15 stores in our \"results\" variable. Go back and look at the code, notice that in the payload, we set a parameter 's' to equal 15\n",
155+
"\n",
156+
"Let's look at the results for the first store in the listings..."
157+
]
158+
},
159+
{
160+
"cell_type": "code",
161+
"execution_count": 20,
162+
"metadata": {},
163+
"outputs": [
164+
{
165+
"data": {
166+
"text/plain": [
167+
"{'distance': '1.2',\n",
168+
" 'latitude': '33.992543',\n",
169+
" 'longitude': '-80.977481',\n",
170+
" 'mapUrl': 'https://maps.googleapis.com/maps/api/staticmap?size=451x451&markers=icon:http://www.walgreens.com/images/gmap/markers/point_wag.png|shadow:true|33.9900337,-80.99815760000001&client=gme-walgreens&sensor=false',\n",
171+
" 'store': {'address': {'city': 'COLUMBIA',\n",
172+
" 'state': 'SC',\n",
173+
" 'street': '4467 DEVINE ST',\n",
174+
" 'zip': '29205'},\n",
175+
" 'emergencyCode': '0',\n",
176+
" 'pharmacyCloseTime': '11PM',\n",
177+
" 'pharmacyOpenTime': '7AM',\n",
178+
" 'phone': [{'areaCode': '803', 'number': '7872527 ', 'type': 'store'}],\n",
179+
" 'serviceIndicators': [{'code': 't4hr', 'name': 'Store Open 24 hours'},\n",
180+
" {'code': 'dt', 'name': 'Drive-Thru Pharmacy'},\n",
181+
" {'code': 'imn', 'name': 'Immunizations'},\n",
182+
" {'code': 'fs', 'name': 'Flu Shot'},\n",
183+
" {'code': 'phi', 'name': 'One Hour Photo'},\n",
184+
" {'code': 'rb', 'name': 'Redbox'}],\n",
185+
" 'storeBrand': 'Walgreens',\n",
186+
" 'storeCloseTime': '12AM',\n",
187+
" 'storeNumber': '6136',\n",
188+
" 'storeOpenTime': '12AM',\n",
189+
" 'storeType': '01',\n",
190+
" 'telePharmacyKiosk': False,\n",
191+
" 'timeZone': 'EA'},\n",
192+
" 'storeSeoUrl': '/locator/walgreens-4467+devine+st-columbia-sc-29205/id=6136'}"
193+
]
194+
},
195+
"execution_count": 20,
196+
"metadata": {},
197+
"output_type": "execute_result"
198+
}
199+
],
200+
"source": [
201+
"results[0]"
202+
]
203+
},
204+
{
205+
"cell_type": "markdown",
206+
"metadata": {},
207+
"source": [
208+
"It may be difficult to see in the output but most of the information we would want is contained in the 'store' element:"
209+
]
210+
},
211+
{
212+
"cell_type": "code",
213+
"execution_count": 10,
214+
"metadata": {},
215+
"outputs": [
216+
{
217+
"data": {
218+
"text/plain": [
219+
"{'address': {'city': 'COLUMBIA',\n",
220+
" 'state': 'SC',\n",
221+
" 'street': '4467 DEVINE ST',\n",
222+
" 'zip': '29205'},\n",
223+
" 'emergencyCode': '0',\n",
224+
" 'pharmacyCloseTime': '11PM',\n",
225+
" 'pharmacyOpenTime': '7AM',\n",
226+
" 'phone': [{'areaCode': '803', 'number': '7872527 ', 'type': 'store'}],\n",
227+
" 'serviceIndicators': [{'code': 't4hr', 'name': 'Store Open 24 hours'},\n",
228+
" {'code': 'dt', 'name': 'Drive-Thru Pharmacy'},\n",
229+
" {'code': 'imn', 'name': 'Immunizations'},\n",
230+
" {'code': 'fs', 'name': 'Flu Shot'},\n",
231+
" {'code': 'phi', 'name': 'One Hour Photo'},\n",
232+
" {'code': 'rb', 'name': 'Redbox'}],\n",
233+
" 'storeBrand': 'Walgreens',\n",
234+
" 'storeCloseTime': '12AM',\n",
235+
" 'storeNumber': '6136',\n",
236+
" 'storeOpenTime': '12AM',\n",
237+
" 'storeType': '01',\n",
238+
" 'telePharmacyKiosk': False,\n",
239+
" 'timeZone': 'EA'}"
240+
]
241+
},
242+
"execution_count": 10,
243+
"metadata": {},
244+
"output_type": "execute_result"
245+
}
246+
],
247+
"source": [
248+
"results[0]['store'] # first store on the list"
249+
]
250+
},
251+
{
252+
"cell_type": "code",
253+
"execution_count": 11,
254+
"metadata": {},
255+
"outputs": [
256+
{
257+
"data": {
258+
"text/plain": [
259+
"{'address': {'city': 'COLUMBIA',\n",
260+
" 'state': 'SC',\n",
261+
" 'street': '1941 BLOSSOM ST',\n",
262+
" 'zip': '29205'},\n",
263+
" 'emergencyCode': '0',\n",
264+
" 'pharmacyCloseTime': '9PM',\n",
265+
" 'pharmacyOpenTime': '9AM',\n",
266+
" 'phone': [{'areaCode': '803', 'number': '2121015 ', 'type': 'store'}],\n",
267+
" 'serviceIndicators': [{'code': 'dt', 'name': 'Drive-Thru Pharmacy'},\n",
268+
" {'code': 'imn', 'name': 'Immunizations'},\n",
269+
" {'code': 'fs', 'name': 'Flu Shot'},\n",
270+
" {'code': 'phi', 'name': 'One Hour Photo'}],\n",
271+
" 'storeBrand': 'Walgreens',\n",
272+
" 'storeCloseTime': '10PM',\n",
273+
" 'storeNumber': '11433',\n",
274+
" 'storeOpenTime': '7AM',\n",
275+
" 'storeType': '01',\n",
276+
" 'telePharmacyKiosk': False,\n",
277+
" 'timeZone': 'EA'}"
278+
]
279+
},
280+
"execution_count": 11,
281+
"metadata": {},
282+
"output_type": "execute_result"
283+
}
284+
],
285+
"source": [
286+
"results[1]['store'] # second store on the list"
287+
]
288+
},
289+
{
290+
"cell_type": "markdown",
291+
"metadata": {},
292+
"source": [
293+
"Suppose that for a research question, I am interested in knowing which Walgreens locations offer flu shots. After some exploration, I see that a \"serviceIndicators\" code of \"fs\" indicates that flu shots are offered at that location. We can loop through the 15 returned stores to print out a list of the stores that offer flu shots."
294+
]
295+
},
296+
{
297+
"cell_type": "code",
298+
"execution_count": 16,
299+
"metadata": {},
300+
"outputs": [
301+
{
302+
"name": "stdout",
303+
"output_type": "stream",
304+
"text": [
305+
"The Walgreens at 4467 DEVINE ST offers flu shots.\n",
306+
"The Walgreens at 1941 BLOSSOM ST offers flu shots.\n",
307+
"The Walgreens at 3501 FOREST DR offers flu shots.\n",
308+
"The Walgreens at 7801 GARNERS FERRY RD offers flu shots.\n",
309+
"The Walgreens at 1537 CHARLESTON HWY offers flu shots.\n",
310+
"The Walgreens at 2224 AUGUSTA RD offers flu shots.\n",
311+
"The Walgreens at 1223 SAINT ANDREWS RD offers flu shots.\n",
312+
"The Walgreens at 9001 TWO NOTCH RD offers flu shots.\n",
313+
"The Walgreens at 1010 OLD BARNWELL RD offers flu shots.\n",
314+
"The Walgreens at 2725 CLEMSON RD offers flu shots.\n",
315+
"The Walgreens at 5220 SUNSET BLVD offers flu shots.\n",
316+
"The Walgreens at 7412 BROAD RIVER RD offers flu shots.\n",
317+
"The Walgreens at 175 FORUM DR offers flu shots.\n",
318+
"The Walgreens at 4520 HARD SCRABBLE RD offers flu shots.\n",
319+
"The Walgreens at 1532 LAKE MURRAY BLVD offers flu shots.\n"
320+
]
321+
}
322+
],
323+
"source": [
324+
"# Loop over 15 stores\n",
325+
"for j in range(len(results)):\n",
326+
" # For each store, loop over their serviceIndicators to find 'tc'\n",
327+
" for i in results[j]['store']['serviceIndicators']:\n",
328+
" if i['code'] == 'fs':\n",
329+
" print('The Walgreens at ' + str(results[j]['store']['address']['street']) + ' offers flu shots.')"
330+
]
331+
},
332+
{
333+
"cell_type": "code",
334+
"execution_count": null,
335+
"metadata": {
336+
"collapsed": true
337+
},
338+
"outputs": [],
339+
"source": []
340+
}
341+
],
342+
"metadata": {
343+
"kernelspec": {
344+
"display_name": "Python 3",
345+
"language": "python",
346+
"name": "python3"
347+
},
348+
"language_info": {
349+
"codemirror_mode": {
350+
"name": "ipython",
351+
"version": 3
352+
},
353+
"file_extension": ".py",
354+
"mimetype": "text/x-python",
355+
"name": "python",
356+
"nbconvert_exporter": "python",
357+
"pygments_lexer": "ipython3",
358+
"version": "3.6.2"
359+
}
360+
},
361+
"nbformat": 4,
362+
"nbformat_minor": 2
363+
}

WebData/images/RequestType.png

227 KB
Loading

WebData/images/UGA1980results.png

147 KB
Loading

WebData/images/WalgreenLocations.png

382 KB
Loading

WebData/images/WalgreensInspect.png

612 KB
Loading

WebData/images/YP_search.png

501 KB
Loading

0 commit comments

Comments
 (0)