Intern: Omokhoa Oshose Tosayoname
Intern ID: CA/DF1/71570
Duration: 20th May 2026 – 20th June 2026
This project builds and evaluates multiple machine learning regression models to predict the selling price of used cars based on features such as present showroom price, kilometres driven, manufacturing year, fuel type, seller type, transmission, and number of previous owners. The dataset contains 301 used car listings spanning model years 2003 to 2018.
Business Question: What factors most strongly determine the resale value of a used car, and how accurately can we predict selling price from these features?
Data Loading --> EDA & Visualisation --> Feature Engineering & Preprocessing
--> Model Training --> Evaluation & Comparison --> Business Insights
CodeAlpha_CarPricePrediction/
├── data/
│ ├── car_data.csv # Raw dataset
│ └── *.png # All generated visualisations
├── notebooks/
│ └── car_price_prediction.ipynb # Main notebook (fully executed)
├── requirements.txt
└── README.md
| Feature | Type | Description |
|---|---|---|
| Car_Name | Categorical | Model name of the car |
| Year | Numeric | Year of manufacture |
| Selling_Price | Numeric | Price at which car was sold (Lakhs INR) — target |
| Present_Price | Numeric | Current ex-showroom price (Lakhs INR) |
| Driven_kms | Numeric | Total kilometres driven |
| Fuel_Type | Categorical | Petrol / Diesel / CNG |
| Selling_type | Categorical | Dealer / Individual |
| Transmission | Categorical | Manual / Automatic |
| Owner | Numeric | Number of previous owners |
The selling price is right-skewed, with most cars priced under 10 Lakhs INR. Log transformation brings it closer to a normal distribution.
Petrol and Manual transmission cars dominate the dataset. Dealer listings outnumber Individual ones.
Diesel cars command a higher median resale price than Petrol. Automatic transmission cars sell at a premium, and Dealer-listed cars are priced higher than Individual listings.
Selling price declines with car age. The depreciation is steepest in the first 5 years, stabilising for older models.
Present Price has the strongest positive correlation with Selling Price. Kilometres driven shows a weak negative relationship.
Premium and luxury models lead in average resale value, with significant spread across the dataset.
Listings peak around 2015–2017 model years. Newer cars command higher average prices as expected.
All points below the diagonal represent depreciated vehicles. Diesel cars tend to retain value better relative to their showroom price.
Present Price has the highest correlation with Selling Price. Car Age and Driven_kms are negatively correlated with price.
Engineered features added to improve model performance:
| Feature | Formula | Rationale |
|---|---|---|
| Car_Age | 2026 - Year | More interpretable than Year |
| KM_per_Year | Driven_kms / (Car_Age + 1) | Usage intensity per year |
| Age_x_KM | Car_Age × Driven_kms | Joint depreciation proxy |
| Price_per_KM | Present_Price / (Driven_kms + 1) | Value retention per km |
| Brand_Avg_Price | Mean Selling_Price per brand | Brand prestige encoding |
| Model | Features Used |
|---|---|
| Linear Regression | All 11 features |
| Ridge Regression | All 11 features |
| Lasso Regression | All 11 features |
| Random Forest | All 11 features |
| Gradient Boosting | All 11 features |
| XGBoost | All 11 features |
| Model | R² | RMSE | MAE |
|---|---|---|---|
| Gradient Boosting | 0.9697 | 0.8355 | 0.5060 |
| XGBoost | 0.9654 | 0.8925 | 0.5522 |
| Random Forest | 0.9472 | 1.1030 | 0.6678 |
| Linear Regression | 0.8493 | 1.8629 | 1.1484 |
| Lasso Regression | 0.8488 | 1.8663 | 1.1401 |
| Ridge Regression | 0.8468 | 1.8785 | 1.1540 |
Best model: Gradient Boosting (R² = 0.9697)
The best models cluster tightly around the perfect prediction line, with minimal scatter at higher price points.
Residuals are randomly distributed around zero, confirming good model fit with no systematic bias.
Both ensemble models agree: Present Price and Brand_Avg_Price are the dominant predictors.
The 95% confidence band shows the model's prediction range across the test set, sorted by actual price.
- Present showroom price is the strongest single predictor of resale value across all models.
- Diesel cars retain value better than Petrol; Automatic transmission cars command a premium.
- First-owner cars (Owner = 0) retain significantly more value than second or third-owner vehicles.
- Car age is a key depreciation driver; steepest drop occurs within the first 5 years.
- Brand prestige (encoded via brand average price) is highly informative for tree-based models.
- Ensemble models (Gradient Boosting, XGBoost, Random Forest) significantly outperform linear models on this dataset, suggesting non-linear relationships between features and price.
-
Clone the repository:
git clone https://github.com/Tosa9/CodeAlpha_CarPricePrediction.git cd CodeAlpha_CarPricePrediction -
Install dependencies:
pip install -r requirements.txt
-
Launch the notebook:
jupyter notebook notebooks/car_price_prediction.ipynb
Car Price Prediction Dataset — Kaggle
CodeAlpha Data Science Internship | Task 3
#CodeAlpha #DataScience #MachineLearning #CarPricePrediction #XGBoost #Python














