You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<small> First Edition: Nov 22 - Dec 22, 2019</small>
12
+
<small> Second Edition: July, 2021</small>
13
13
</sub>
14
-
15
-
</div>
16
14
</div>
17
15
18
16
[<< Day 21](../21_Day_Classes_and_objects/21_classes_and_objects.md) | [Day 23 >>](../23_Day_Virtual_environment/23_virtual_environment.md)
@@ -36,28 +34,28 @@ Web scraping is the process of extracting and collecting data from websites and
36
34
37
35
In this section, we will use beautifulsoup and requests package to scrape data. The package version we are using is beautifulsoup 4.
38
36
39
-
To start scraping websites you need _requests_, _beautifoulSoup4_ and _website_.
37
+
To start scraping websites you need _requests_, _beautifoulSoup4_ and a _website_.
40
38
41
39
```sh
42
40
pip install requests
43
41
pip install beautifulsoup4
44
42
```
45
43
46
-
To scrape data from websites, basic understanding of HTML tags and css selectors is needed. We target content from a website using HTML tags, classes or/and ids.
47
-
Let's import the requests and BeautifulSoup module
44
+
To scrape data from websites, basic understanding of HTML tags and CSS selectors is needed. We target content from a website using HTML tags, classes or/and ids.
45
+
Let us import the requests and BeautifulSoup module
48
46
49
47
```py
50
48
import requests
51
49
from bs4 import BeautifulSoup
52
50
```
53
51
54
-
Let's declare url variable for the website which we are going to scrape.
52
+
Let us declare url variable for the website which we are going to scrape.
content = response.content # we get all the content from the website
@@ -97,12 +95,13 @@ for td in table.find('tr').find_all('td'):
97
95
If you run this code, you can see that the extraction is half done. You can continue doing it because it is part of exercise 1.
98
96
For reference check the [beautifulsoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)
99
97
100
-
🌕 You are so special, you are progressing everyday.You are left with only eight days to your way to greatness. Now do some exercises for your brain and for your muscle.
98
+
🌕 You are so special, you are progressing everyday.You are left with only eight days to your way to greatness. Now do some exercises for your brain and muscles.
101
99
102
100
## 💻 Exercises: Day 22
103
101
104
-
1. Extract the table in this url (http://mlr.cs.umass.edu/ml/datasets.html) and change it to a json file
105
-
2. Scrape the presidents table and store the data as json(https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States)
102
+
1. Scrape the following website and store the data as json file(url = 'http://www.bu.edu/president/boston-university-facts-stats/').
103
+
1. Extract the table in this url (https://archive.ics.uci.edu/ml/datasets.php) and change it to a json file
104
+
2. Scrape the presidents table and store the data as json(https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States). The table is not very structured and the scrapping may take very long time.
0 commit comments