Skip to content

Commit be71350

Browse files
committed
scrapping
1 parent 241ca3b commit be71350

File tree

2 files changed

+84
-12
lines changed

2 files changed

+84
-12
lines changed

22_Day_Web_scraping/22_web_scraping.md

+11-12
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,8 @@
99

1010
<sub>Author:
1111
<a href="https://www.linkedin.com/in/asabeneh/" target="_blank">Asabeneh Yetayeh</a><br>
12-
<small> First Edition: Nov 22 - Dec 22, 2019</small>
12+
<small> Second Edition: July, 2021</small>
1313
</sub>
14-
15-
</div>
1614
</div>
1715

1816
[<< Day 21](../21_Day_Classes_and_objects/21_classes_and_objects.md) | [Day 23 >>](../23_Day_Virtual_environment/23_virtual_environment.md)
@@ -36,28 +34,28 @@ Web scraping is the process of extracting and collecting data from websites and
3634

3735
In this section, we will use beautifulsoup and requests package to scrape data. The package version we are using is beautifulsoup 4.
3836

39-
To start scraping websites you need _requests_, _beautifoulSoup4_ and _website_.
37+
To start scraping websites you need _requests_, _beautifoulSoup4_ and a _website_.
4038

4139
```sh
4240
pip install requests
4341
pip install beautifulsoup4
4442
```
4543

46-
To scrape data from websites, basic understanding of HTML tags and css selectors is needed. We target content from a website using HTML tags, classes or/and ids.
47-
Let's import the requests and BeautifulSoup module
44+
To scrape data from websites, basic understanding of HTML tags and CSS selectors is needed. We target content from a website using HTML tags, classes or/and ids.
45+
Let us import the requests and BeautifulSoup module
4846

4947
```py
5048
import requests
5149
from bs4 import BeautifulSoup
5250
```
5351

54-
Let's declare url variable for the website which we are going to scrape.
52+
Let us declare url variable for the website which we are going to scrape.
5553

5654
```py
5755

5856
import requests
5957
from bs4 import BeautifulSoup
60-
url = 'http://mlr.cs.umass.edu/ml/datasets.html'
58+
url = 'https://archive.ics.uci.edu/ml/datasets.php'
6159

6260
# Lets use the requests get method to fetch the data from url
6361

@@ -76,7 +74,7 @@ Using beautifulSoup to parse content from the page
7674
```py
7775
import requests
7876
from bs4 import BeautifulSoup
79-
url = 'http://mlr.cs.umass.edu/ml/datasets.html'
77+
url = 'https://archive.ics.uci.edu/ml/datasets.php'
8078

8179
response = requests.get(url)
8280
content = response.content # we get all the content from the website
@@ -97,12 +95,13 @@ for td in table.find('tr').find_all('td'):
9795
If you run this code, you can see that the extraction is half done. You can continue doing it because it is part of exercise 1.
9896
For reference check the [beautifulsoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)
9997

100-
🌕 You are so special, you are progressing everyday.You are left with only eight days to your way to greatness. Now do some exercises for your brain and for your muscle.
98+
🌕 You are so special, you are progressing everyday. You are left with only eight days to your way to greatness. Now do some exercises for your brain and muscles.
10199

102100
## 💻 Exercises: Day 22
103101

104-
1. Extract the table in this url (http://mlr.cs.umass.edu/ml/datasets.html) and change it to a json file
105-
2. Scrape the presidents table and store the data as json(https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States)
102+
1. Scrape the following website and store the data as json file(url = 'http://www.bu.edu/president/boston-university-facts-stats/').
103+
1. Extract the table in this url (https://archive.ics.uci.edu/ml/datasets.php) and change it to a json file
104+
2. Scrape the presidents table and store the data as json(https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States). The table is not very structured and the scrapping may take very long time.
106105

107106
🎉 CONGRATULATIONS ! 🎉
108107

+73
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
[
2+
{
3+
"category": "Community",
4+
"Student Body": "34,589",
5+
"Living Alumni": "398,195",
6+
"Total Employees": "10,517",
7+
"Faculty": "4,171",
8+
"Nondegree Students": "2,008",
9+
"Graduate & Professional Students": "15,645",
10+
"Undergraduate Students": "16,936"
11+
},
12+
{
13+
"category": "Campus",
14+
"Classrooms": "834",
15+
"Buildings": "370",
16+
"Laboratories": "1,681",
17+
"Libraries": "21",
18+
"Campus Area (acres)": "169"
19+
},
20+
{
21+
"category": "Academics",
22+
"Study Abroad Programs": "70+",
23+
"Average Class Size": "27",
24+
"Faculty": "4,171",
25+
"Student/Faculty Ratio": "10:1",
26+
"Schools and Colleges": "17",
27+
"Programs of Study": "300+"
28+
},
29+
{
30+
"category": "Grant & Contract Awards",
31+
"Research Awards": "$574.1M",
32+
"BMC Clinical Research Grants": "$88.0M"
33+
},
34+
{
35+
"category": "Undergraduate Financial Aid & Scholarships",
36+
"Average Total Need-Based Financial Aid": "$46,252",
37+
"Average Need-Based Grant/Scholarship": "$40,969",
38+
"Grants & Scholarships (need-based)": "$275.6M",
39+
"Grants & Scholarships (non-need-based)": "$28.7M"
40+
},
41+
{
42+
"category": "Student Life",
43+
"Community Service Hours": "1.6M+",
44+
"Alternative Service Breaks Participants": "300+",
45+
"BU on Social": "new accounts daily",
46+
"Cultural & Religious Organizations": "60+",
47+
"Community Service & Justice Organizations": "80+",
48+
"Academic & Professional Organizations": "120+",
49+
"Art & Performance Organizations": "60+",
50+
"Student Organizations": "450+",
51+
"First-Year Student Outreach Project Volunteers": "800+"
52+
},
53+
{
54+
"category": "Research",
55+
"Faculty Publications": "6,000+",
56+
"Student UROP Participants": "450+",
57+
"Centers & Institutes": "130+"
58+
},
59+
{
60+
"category": "International Community",
61+
"Global Initiatives": "300+",
62+
"Cultural Student Groups": "40+",
63+
"Alumni Countries": "180+",
64+
"International Students": "11,000+"
65+
},
66+
{
67+
"category": "Athletics",
68+
"Intramural Sports & Tournaments": "15+",
69+
"Club and Intramural Sports Participants": "7,000+",
70+
"Club Sports Teams": "50",
71+
"Varsity Sports": "24"
72+
}
73+
]

0 commit comments

Comments
 (0)