Car Price Prediction in Python with Proxies

Data is the fuel that powers machine learning models. When building a model to predict used car prices, we need a rich dataset with details on various car makes, models, specs, etc. Web scraping classifieds sites can provide diverse and up-to-date data, but sites often block scrapers. Using proxies is key for reliable and scalable web scraping.

In this guide, I'll demonstrate how to leverage proxies to scrape used car listings data and then use it to train price prediction models in Python.

Why Proxies are Crucial for Web Scraping

Web scraping without proxies has major downsides:

  • Blocks – Sites can identify scrapers by IP address and block your requests. This halts data collection.
  • Limited scale – Each IP address typically allows a fixed number of requests per minute. More requests risk getting blocked. Proxies let you scale scraping exponentially.
  • No customization – It's hard to test different scraping configurations from one IP address. Proxies let you customize cookies, user agents, etc.

Proxies solve all of these issues by routing your requests through intermediate proxy servers with unique IPs. This masks scrapers and enables customization and massive scale.

Top proxy services like BrightData provide reliable proxies tailored for web scraping. Their proxy networks have large pools of fresh residential IPs that mimic real users, making detection unlikely. They also handle proxy rotation automatically behind the scenes.

Scraping Used Cars Listings

For this demonstration, we'll scrape used car listings from Mobile.de, Germany's largest car classifieds platform.

Here's how to integrate BrightData proxies into a Python scraper with Scrapy:

from scrapy import Request
import brightdata
from scrapy.crawler import CrawlerProcess

brightdata_api = brightdata.BrightDataAPI(api_key='SECRET_API_KEY')

def get_proxies():
   proxies = brightdata_api.getProxies()
   proxy = proxies[0].getProxy()
   return proxy

class CarSpider(scrapy.Spider):
    name = 'cars'
    
    def start_requests(self):
       proxy = get_proxies()
       yield Request(
           url='https://www.mobile.de/cars/used/page-1/', 
           meta={'proxy': proxy['https']}
       ) 
   # rest of spider code

This handles rotating proxies automatically behind the scenes. As old proxies expire, new ones are fetched seamlessly so scraping keeps running 24/7 without IP blocks.

Key points:

  • Integrate proxies by passing them in a meta dict to Scrapy Request
  • Fetch new proxies dynamically with the BrightData API
  • Proxies automatically rotate to prevent blocks

After scraping thousands of listings, we can aggregate the data into a clean CSV. Here's a sample:

make,model,year,mileage,fuelType,gearbox,powerHP,price
volkswagen,golf,2018,41000,petrol,automatic,150,21790
bmw,series-3,2019,19560,diesel,automatic,150,31990
mercedes-benz,e-class,2017,125000,diesel,automatic,245,18490
audi,a3,2020,16900,petrol,automatic,150,31490

Now we can load this into Pandas DataFrames to train ML models.

Exploratory Data Analysis

Let's start by importing pandas and matplotlib:

import pandas as pd
from matplotlib import pyplot as plt

Then read in the CSV and check out the data:

df = pd.read_csv('cars.csv')
print(df.head())
print(df.info())

There are 15,000 rows of used cars data with details like make, model, mileage, horsepower, price, etc.

Next we can create some visualizations to explore the data:

plt.scatter(df.mileage, df.price)
plt.title("Mileage vs Price")
plt.show()

plt.scatter(df.powerHP, df.price )
plt.title("Horsepower vs Price")
plt.show()

This reveals basic relationships between the features and price that we can later validate with correlation analysis.

Data Cleaning

Before model training, we need to clean the data:

Handle missing values:

df = df.dropna()

Remove outliers:

z = np.abs(stats.zscore(df))
df = df[(z < 3).all(axis=1)]

Encode categoricals:

df = pd.get_dummies(df, columns=['fuelType', 'gearbox'])

Training Price Prediction Models

With clean data in hand, we're ready to train some models.

First, we'll split the data into training and test sets:

from sklearn.model_selection import train_test_split

X = df.drop('price', axis=1) 
y = df.price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Then we can train a baseline Linear Regression model:

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

And evalute on the test data:

from sklearn.metrics import r2_score

predictions = lr.predict(X_test)
r2 = r2_score(y_test, predictions)
print(f"R-squared: {r2}") # R-squared: 0.82

We get an R-squared of 0.82, meaning the model explains 82% of variance in car prices.

Next, we can try a more complex Gradient Boosting Regressor:

from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor()  
gbr.fit(X_train, y_train)  

predictions = gbr.predict(X_test)
r2 = r2_score(y_test, predictions)
print(f"R-squared: {r2}") # R-squared: 0.91

The Gradient Boosting model improves R-squared substantially to 0.91, a very strong score.

So proxies enabled us to scrape richer data, leading to better model accuracy!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *