Skip to content

Commit ec2765c

Browse files
authored
Merge pull request #611 from NavanshGoel/main
Adding the python file for scraping data from CarDekho.com
2 parents 223eac9 + d9ed7a8 commit ec2765c

File tree

3 files changed

+89
-0
lines changed

3 files changed

+89
-0
lines changed

cardekho_scraper/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Description
2+
3+
Script to automatically scroll and scrap data from the Indian Car Website CarDekho.com using selenium and BeautifulSoup
4+
5+
# Python requirements
6+
7+
Python 3.0 or newer
8+
9+
## How to use it
10+
11+
Run the following command for installing the necessary dependancies from the requirements.txt file. Then run the python file to star the scraping. The output would be stored in a csv file with the necessary column names and corresponding data
12+
13+
```
14+
pip install -r requirements.txt
15+
```
16+
17+
## Editor:
18+
19+
USe VS Code or Jupyter/Spyder notebook to run the python file. You can also create a virtual environment and then install the dependancies from requirements.txt
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
'''
2+
Import the necessary libraries
3+
'''
4+
# !pip install selenium
5+
from selenium import webdriver
6+
import time
7+
import pandas as pd
8+
from bs4 import BeautifulSoup as soup
9+
10+
'''
11+
Define the browser/driver and open the desired webpage
12+
'''
13+
driver = webdriver.Chrome(
14+
'D:\\Softwares\\chromedriver_win32\\chromedriver.exe'
15+
)
16+
driver.get('https://www.cardekho.com/filter/new-cars')
17+
'''
18+
Keep scrolling automatically and extract the data from the webpage and store it
19+
'''
20+
for i in range(0, 20):
21+
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
22+
time.sleep(1)
23+
driver.execute_script("window.scrollTo(0, \
24+
(document.body.scrollHeight)*0.73)")
25+
time.sleep(1)
26+
res = driver.execute_script("return document.documentElement.outerHTML")
27+
driver.quit()
28+
psoup = soup(res, "lxml")
29+
containers = psoup.findAll(
30+
"div", {"gsc_col-md-12 gsc_col-sm-12 gsc_col-xs-12 append_list"}
31+
)
32+
cars = []
33+
prices = []
34+
engines = []
35+
mileages = []
36+
for i in containers:
37+
# cars.append(i.div.img["alt"])
38+
price = i.findAll("div", {"class": "price"})
39+
q = price[0].text
40+
s = ""
41+
for h in q:
42+
if h != "*":
43+
s += h
44+
else:
45+
break
46+
prices.append(s)
47+
m = i.findAll("div", {"class": "dotlist"})
48+
f = m[0].findAll("span", {"title": "Mileage"})
49+
if len(f) != 0:
50+
mileages.append(f[0].text)
51+
else:
52+
mileages.append(" ")
53+
e = m[0].findAll("span", {"title": "Engine Displacement"})
54+
if len(e) != 0:
55+
engines.append(e[0].text)
56+
else:
57+
engines.append(" ")
58+
df = pd.DataFrame(
59+
{
60+
'Car Name': cars,
61+
'Price': prices,
62+
'Engine': engines,
63+
'Mileage': mileages
64+
}
65+
)
66+
df.to_csv('carScrap.csv', index=False, encoding='utf-8')

cardekho_scraper/requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
selenium
2+
pandas
3+
time
4+
bs4

0 commit comments

Comments
 (0)