Skip to content

Commit 9a61d57

Browse files
authored
Merge pull request #804 from Gud-will/main
Automated Website Url scraper
2 parents 21a9d77 + 2e32d14 commit 9a61d57

File tree

3 files changed

+72
-0
lines changed

3 files changed

+72
-0
lines changed

websiteurl_scraper/README.md

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
2+
<h1 align="center">Automated Url Scraper</h1>
3+
<p>This python script focuses on retreving all the webpages links from a giver Url
4+
5+
This link can also be present inside a button or an action
6+
</p>
7+
<h2 >Libraries needed are</h2>
8+
9+
<ul><h3>ssl</h3>
10+
<p>This can be install by using pip install ssl</p>
11+
</ul>
12+
13+
<ul><h3>urrlib</h3>
14+
<p>This can be install by using pip install urrlib</p></ul>
15+
16+
<ul><h3>BeautifulSoup4</h3>
17+
<p>This can be installed by using BeautifulSoup4</p>
18+
</ul>
19+
20+
<h2>Inputs</h2>
21+
<p>we need ot give A valid website link as input</p>
22+
<h2>Outputs</h2>
23+
24+
<p>The program access the website and returns all the links present in the website</p>

websiteurl_scraper/requirements.txt

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
We need to install 3 libraries in order for this file to work
2+
the 1st is ssl
3+
This can be install by using pip install ssl
4+
and this library helps to us to tackle website certification issues
5+
6+
the 2nd is urrlib
7+
This can be install by using pip install urrlib
8+
and this library helps us to acces the url
9+
10+
the 3rd is bs4 which is beautifulsoup
11+
This can be installed by using BeautifulSoup4
12+
This helps us to read the url and acces information

websiteurl_scraper/webUrlscraper.py

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# USings ssl and urrlib.request to read the contents of the url
2+
# ssl helps us to avoid cretificate verifation and so on
3+
4+
import ssl
5+
from urllib.request import urlopen, Request
6+
from bs4 import BeautifulSoup
7+
8+
ctx = ssl.create_default_context()
9+
ctx.check_hostname = False
10+
ctx.verify_mode = ssl.CERT_NONE
11+
12+
# getting in the website link
13+
Url = input("Enter your Urllink")
14+
try :
15+
# trying to access the page
16+
page = Request(Url, headers={'User-Agent': 'Mozilla/5.0'})
17+
page = urlopen(page, context=ctx).read()
18+
# Using beautifulsoup to read the contents of the page
19+
soup = BeautifulSoup(page, 'html.parser')
20+
# finding all the link headers
21+
links = soup.findAll('a')
22+
if(links is not None) :
23+
finalLinks = []
24+
# getting actual site links from the header a
25+
for link in links :
26+
if 'href' in str(link) :
27+
templist = str(link).split("href")
28+
index1 = templist[-1].index("\"")
29+
index2 = templist[-1][index1 + 1 :].index("\"")
30+
finalLinks.append(templist[-1][index1 : index2 + 3])
31+
print("Here are your final links")
32+
# printing the final completed list
33+
for i in finalLinks :
34+
print(i)
35+
except Exception as e :
36+
print(str(e))

0 commit comments

Comments
 (0)