Predicting FORD Trucks with R & Python (TensorFlow)

A lost game?

Pere Fuster
4 min readApr 10, 2021

Abstract

This article seeks to provide conscious guidance in the application of Statistical Learning by means of both R and Python (Via TensorFlow). It was first proposed as delivery for the Subject Statistical Learning, taught by Dr. Javier Nogales at UC3M University.

Keywords: TensorFlow, R, Caret, Deep Learning, Neural Networks, Stacking, Bagging, Gradient boosting, Bayesian Classifiers, Support Vector Machines.

Introduction & Set up

This article seeks to show how to undertake a data science problem. In this case, we’re asked to be able to predict whether a truck is a Ford truck or not in the city of Chicago.

This is a very challenging case, because, how can you anticipate the brand of a truck based only on traffic data? Let’s see to what extent we can anticipate this.

According to myself, I understand analytic projects to have 4 stages:

  1. First discussion: here as analysts we need to understand the question fully and why/how it’ll impact our organisation.
  2. Make a Proposal: We also have to understand the data available (And also the information available, since it’s not all about data). We also need to consider several options and understand
  3. Project Kick-off:
  4. Exploratory Analysis:
  5. Modelling:
  6. Testing:
  7. Deployment:
  8. Testing and Maintenence:

Business Case

Here we’re asked to develop a model for the compulsory subject of Statistical Learning in the Master of Statistics at UC3M. This course is taught by Javier Nogales.

Data Inspection

As you can see, most trucks are branded FORD, which wasn’t expected.

The data comes from tables belonging to the Chicago Data Portal. The first dataset, Vehicles contains information about vehicles (or units as they are identified in crash reports) involved in a traffic crash, whilst the Crashes dataset shows information about each traffic crash (sample units) on city streets within the City of Chicago limits and under the jurisdiction of Chicago Police Department (CPD). This dataset contains every reported traffic crash event, regardless of the statute of limitations.

All the information about these datasets can be found on the Chicago Data Portal website:

• Vehicles dataset (945K rows x 72 columns): https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3

• Crashes dataset (463K rows x 49 columns): https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if
These tables are joint for the analytical application

I have joint the tables with a left join, based on the primary key crashes table (crash ID), giving a resulting table of

I start in Python, but given the massive amount of data, and low expertise, time flew and decided to do the statistical part in R.

Exploratory Data Analysis and Data Processing at a second stage

Here, we start looking at what’s inside the data. We also start making decisions with regards to what variables we want to be in the data. For example, further to identifying new variables that are constant or lack of analytical meaning, we also will look at the possible correlation between these and the response.

import pandas as pd
import numpy as np
vehicles = pd.read_csv("C:/Users/Carmen/OneDrive/Archivos - Todo/1 - Master Statistics/Period 2/Traffic_Crashes_-_Vehicles.csv")
crashes = pd.read_csv("C:/Users/Carmen/OneDrive/Archivos - Todo/1 - Master Statistics/Period 2/Traffic_Crashes_-_Crashes.csv", sep=',')
vehicles = vehicles[(vehicles['VEHICLE_TYPE'] == 'TRUCK - SINGLE UNIT') | (vehicles['VEHICLE_TYPE'] == 'SINGLE UNIT TRUCK WITH TRAILER')]vehicles.tail(2)vehicles.shape[0] * 0.1a = vehicles['CRASH_UNIT_ID']type(a)burn = round(vehicles.shape[0] * 0.1)
b = vehicles['AXLE_CNT'][burn:]
b = b[: round(b.shape[0] - burn)]
b.shape
np.mean(b)
vehicles['AXLE_CNT'].fillna(np.mean(b), inplace=True)vehicles['Address'] = vehicles['CRASH_UNIT_ID']*vehicles['CRASH_UNIT_ID']vehicles['TOTAL_VEHICLE_LENGTH'].fillna(vehicles['Address'], inplace=True)for i in range(0,200):
if vehicles[['TOTAL_VEHICLE_LENGTH']][i].isna() == True:
vehicles[['TOTAL_VEHICLE_LENGTH']][i] = vehicles[['TOTAL_VEHICLE_LENGTH']][i]*vehicles[['TOTAL_VEHICLE_LENGTH']][i]
del vehicles['Address']del burn
del b
vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] > 60, "0 to 24", 0)
vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] <= 60, "", vehicles['Address'])
vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] <= 48, 2, vehicles['Address'])
vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] <= 24, 2, vehicles['Address'])
charges_data[''] = charges_data['churn']vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] > 24 & vehicles['TOTAL_VEHICLE_LENGTH'] <= 24, 1, vehicles['TOTAL_VEHICLE_LENGTH'])vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] < 1, 4, vehicles['TOTAL_VEHICLE_LENGTH'])
vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] < 1, 4, vehicles['TOTAL_VEHICLE_LENGTH'])
vehicles['Address'] = np.where(vehicles['TOTAL_VEHICLE_LENGTH'] < 1, 4, vehicles['TOTAL_VEHICLE_LENGTH'])
a = pd.value_counts(vehicles['AREA_11_I'])
vehicles.info()
s = a.to_dict()
s
c.info()#vehicles.info()df = pd.merge(vehicles, crashes, how='left', on ='CRASH_RECORD_ID')
df = df[(df['VEHICLE_TYPE'] == 'TRUCK - SINGLE UNIT') | (df['VEHICLE_TYPE'] == 'SINGLE UNIT TRUCK WITH TRAILER')]
del vehicles
del crashes
data = df.loc[:, (df.isnull().sum(axis=0)/df.shape[0] <= 0.85)]
areas = data.columns[data.columns.str.startswith('AREA')]
data.drop(areas, axis=1, inplace=True)
del areas
# data = data.drop(['REPORT_TYPE','RD_NO_x','VEHICLE_ID','STREET_NAME','STREET_NO','CRASH_UNIT_ID','CRASH_RECORD_ID','DEVICE_CONDITION','ALIGNMENT','CRASH_DATE'], axis=1) # Unrelevant meaningdata = data.drop(['REPORT_TYPE','RD_NO_x'], axis=1) # Unrelevant meaning
data = data.drop(['VEHICLE_ID'], axis=1) # Unrelevant meaning
data = data.drop(['STREET_NAME'], axis=1) # Unrelevant meaning
data = data.drop(['STREET_NO'], axis=1) # Unrelevant meaning
data = data.drop(['CRASH_UNIT_ID'], axis=1) # Unrelevant meaning
data = data.drop(['CRASH_RECORD_ID'], axis=1) # Unrelevant meaning
data = data.drop(['DEVICE_CONDITION'], axis=1) # Unrelevant meaning
data = data.drop(['ALIGNMENT'], axis=1) # Unrelevant meaning
data = data.drop(['CRASH_DATE'], axis=1) # Unrelevant meaning
data.columns[data.nunique() <= 1]

Helpful Resources:

  • TensorFlow for Dummies

--

--