Competition: The Ultimate Student Hunt

25 minute read

Github link to notebook

Hackathon - The Ultimate Student Hunt

Analytics Vidhya hosted a student-only hackathon over predicting the number of visitors to parks given a set amount of variables. This was my first data competition, and proved to be a huge learning experience.

Here’s a brief overview of the data from the contest rules listed on the website:

  • ID: Unique ID
  • Park_ID: Unique ID for Parks
  • Date: Calendar Date
  • Direction_Of_Wind: Direction of winds in degrees
  • Average_Breeze_Speed: Daily average Breeze speed
  • Max_Breeze_Speed: Daily maximum Breeze speed
  • Min_Breeze_Speed: Daily minimum Breeze speed
  • Var1: A continuous feature
  • Average_Atmospheric_Pressure: Daily average atmospheric pressure
  • Max_Atmospheric_Pressure: Daily maximum atmospheric pressure
  • Min_Atmospheric_Pressure: Daily minimum atmospheric pressure
  • Min_Ambient_Pollution: Daily minimum Ambient pollution
  • Max_Ambient_Pollution: Daily maximum Ambient pollution
  • Average_Moisture_In_Park: Daily average moisture
  • Max_Moisture_In_Park: Daily maximum moisture
  • Min_Moisture_In_Park: Daily minimum moisture
  • Location_Type: Location Type (1/2/3/4)
  • Footfall: The target variable, daily Footfall

Summary

This problem involved predicting the number of visitors (footfall) to parks on a given day with given conditions, which ultimately makes it a time series problem. Specifically, it provided ten years as a training set, and five years for the test set. This notebook is an annotated version of my final submission which ranked 13th on the leaderboard. On a side note, the hackathon was only open for nine days, so there is of course a lot of room for improvement in this notebook.

My process for this hackathon was as follows:

  1. Initial Exploration: All data projects should begin with an initial exploration to understand the data itself. I initially used the pandas profiling package, but excluded it from my final submission since it generates a lengthy report. I left both a quick df.describe() and df.head() to showcase summary statistics and an example of the data.
  2. Outliers: I created boxplots to look for outliers visually. This can be done mathematically when you are more familiar with the data (using methods such as interquartile ranges), but the distributions of the variables produced a significant amount of data points that would’ve been considered outliers with this methodology.
  3. Missing Values: I first sorted the dataframe by date and park ID, then used the msno package to visually examine missing values. After seeing fairly regular trends of missing values, I plotted histograms of the missing values by park IDs to see if I could fill them by linearly interpolating. After seeing that certain park IDs were completely missing some values, I built random forest models to impute them. This is a brute-force method that is CPU intensive, but was a trade-off for the limited time frame.
  4. Feature Engineering: This was almost non-existant in this competition due to the anonymity of the data. I used daily and weekly averages of the individual variables in both the end model and missing value imputation models.
  5. Model Building: I initially started by creating three models using random forests, gradient boosted trees, and AdaBoost. The gradient boosted trees model outperformed the other two, so I stuck with that and scrapped the other two.
  6. Hyperparameter Tuning: This was my first time using gradient boosted trees, so I took a trial-and-error approach by adjusting various parameters and running them through cross validation to see how differently they performed. I found that just adjusting the number of trees and max depth obtained the best results in this situation.
  7. Validation: I used both a holdout cross-validation and k-folds (with 10 folds) to check for overfitting. The hackathon also had a “solution checker” for your predicted values (specifically for the first two years of the test set - the final score of the competition was on the full five years of the test set, so it is very important to not overfit) that provided a score, which I used in combination with the cross validation results.

Here is the annotated code for my final submission:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.cross_validation import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import Imputer

# For exploratory data analysis
import missingno as msno  # Visualizes missing values

%matplotlib inline
df = pd.read_csv('crime.csv')  # Training set - ignore the name

df_test = pd.read_csv('Test_pyI9Owa.csv')  # Testing set

Exploratory Data Analysis

df.describe()
ID Park_ID Direction_Of_Wind Average_Breeze_Speed Max_Breeze_Speed Min_Breeze_Speed Var1 Average_Atmospheric_Pressure Max_Atmospheric_Pressure Min_Atmospheric_Pressure Min_Ambient_Pollution Max_Ambient_Pollution Average_Moisture_In_Park Max_Moisture_In_Park Min_Moisture_In_Park Location_Type Footfall
count 1.145390e+05 114539.000000 110608.000000 110608.000000 110603.000000 110605.000000 106257.000000 74344.000000 74344.000000 74344.000000 82894.000000 82894.000000 114499.000000 114499.000000 114499.000000 114539.000000 114539.000000
mean 3.517595e+06 25.582596 179.587146 34.255340 51.704297 17.282553 18.802545 8331.545949 8356.053468 8305.692510 162.806138 306.555698 248.008970 283.917082 202.355331 2.630720 1204.217192
std 1.189083e+05 8.090592 85.362934 17.440065 22.068301 14.421844 38.269851 80.943971 76.032983 87.172258 90.869627 38.188020 28.898084 15.637930 46.365728 0.967435 248.385651
min 3.311712e+06 12.000000 1.000000 3.040000 7.600000 0.000000 0.000000 7982.000000 8037.000000 7890.000000 4.000000 8.000000 102.000000 141.000000 48.000000 1.000000 310.000000
25% 3.414820e+06 18.000000 111.000000 22.040000 38.000000 7.600000 0.000000 8283.000000 8311.000000 8252.000000 80.000000 288.000000 231.000000 279.000000 171.000000 2.000000 1026.000000
50% 3.517039e+06 26.000000 196.000000 30.400000 45.600000 15.200000 0.830000 8335.000000 8358.000000 8311.000000 180.000000 316.000000 252.000000 288.000000 207.000000 3.000000 1216.000000
75% 3.619624e+06 33.000000 239.000000 42.560000 60.800000 22.800000 21.580000 8382.000000 8406.000000 8362.000000 244.000000 336.000000 270.000000 294.000000 237.000000 3.000000 1402.000000
max 3.725639e+06 39.000000 360.000000 154.280000 212.800000 129.200000 1181.090000 8588.000000 8601.000000 8571.000000 348.000000 356.000000 300.000000 300.000000 300.000000 4.000000 1925.000000
df.head()
ID Park_ID Date Direction_Of_Wind Average_Breeze_Speed Max_Breeze_Speed Min_Breeze_Speed Var1 Average_Atmospheric_Pressure Max_Atmospheric_Pressure Min_Atmospheric_Pressure Min_Ambient_Pollution Max_Ambient_Pollution Average_Moisture_In_Park Max_Moisture_In_Park Min_Moisture_In_Park Location_Type Footfall
0 3311712 12 01-09-1990 194.0 37.24 60.8 15.2 92.1300 8225.0 8259.0 8211.0 92.0 304.0 255.0 288.0 222.0 3 1406
1 3311812 12 02-09-1990 285.0 32.68 60.8 7.6 14.1100 8232.0 8280.0 8205.0 172.0 332.0 252.0 297.0 204.0 3 1409
2 3311912 12 03-09-1990 319.0 43.32 60.8 15.2 35.6900 8321.0 8355.0 8283.0 236.0 292.0 219.0 279.0 165.0 3 1386
3 3312012 12 04-09-1990 297.0 25.84 38.0 7.6 0.0249 8379.0 8396.0 8358.0 272.0 324.0 225.0 261.0 192.0 3 1365
4 3312112 12 05-09-1990 207.0 28.88 45.6 7.6 0.8300 8372.0 8393.0 8335.0 236.0 332.0 234.0 273.0 183.0 3 1413

Outliers

We’ll do our outlier detection visually with box plots. Rather than determining outliers mathematically (such as using the interquartile range), we’ll simply look for any points that aren’t contiguous.

df_box = df.drop(['ID', 'Park_ID', 'Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
                  , 'Min_Atmospheric_Pressure', 'Footfall', 'Date'], axis = 1)
plt.figure(figsize = (20,10))
sns.boxplot(data=df_box)
<matplotlib.axes._subplots.AxesSubplot at 0x1774dfe6630>

Var1 seems to potentially have outliers, but since it is undefined, it is difficult to determine if these are anomalies or noisy/incorrect data. We’ll leave them for now.

df_box = df[['Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
                  , 'Min_Atmospheric_Pressure']]
plt.figure(figsize = (20,10))
sns.boxplot(data=df_box)
<matplotlib.axes._subplots.AxesSubplot at 0x1774f62a518>

Max atmospheric pressure (and by result, average atmospheric pressure) have a few non-contiguous values, but they don’t seem egregious enough to deal with for the time being.

# Converting date field to datetime and extracting date components
df['Date'] = pd.to_datetime(df['Date'])

df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Month'] = pd.DatetimeIndex(df['Date']).month
df['Day'] = pd.DatetimeIndex(df['Date']).day
df['Week'] = pd.DatetimeIndex(df['Date']).week
df['WeekDay'] = pd.DatetimeIndex(df['Date']).dayofweek


# Repeating for the test set
df_test['Date'] = pd.to_datetime(df_test['Date'])

df_test['Year'] = pd.DatetimeIndex(df_test['Date']).year
df_test['Month'] = pd.DatetimeIndex(df_test['Date']).month
df_test['Day'] = pd.DatetimeIndex(df_test['Date']).day
df_test['Week'] = pd.DatetimeIndex(df_test['Date']).week
df_test['WeekDay'] = pd.DatetimeIndex(df_test['Date']).dayofweek


# Lastly, combining to use for building models to predict missing predictors 
df_full = df.append(df_test)

Missing Values

Since this is ultimately a time series problem, we’ll begin with sorting the values. Then, I’m going to use a useful package for visualizing missing values.

# Sorting by date and park
df = df.sort_values(['Date', 'Park_ID'], ascending=[1, 1])
df_full = df_full.sort_values(['Date', 'Park_ID'], ascending=[1, 1])
# Visualizing missing values
msno.matrix(df_full)

# Checking which Park IDs missing values occur in
plt.subplot(221)
df_full[df_full['Direction_Of_Wind'].isnull() == True]['Park_ID'].hist()
plt.subplot(222)
df_full[df_full['Average_Breeze_Speed'].isnull() == True]['Park_ID'].hist()
plt.subplot(223)
df_full[df_full['Max_Breeze_Speed'].isnull() == True]['Park_ID'].hist()
plt.subplot(224)
df_full[df_full['Min_Breeze_Speed'].isnull() == True]['Park_ID'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1774f091550>

df_full[df_full['Var1'].isnull() == True]['Park_ID'].hist()
plt.title('Var1 Missing Park IDs')
<matplotlib.text.Text at 0x1774f6d1470>

plt.subplot(221)
df_full[df_full['Average_Atmospheric_Pressure'].isnull() == True]['Park_ID'].hist()
plt.subplot(222)
df_full[df_full['Max_Atmospheric_Pressure'].isnull() == True]['Park_ID'].hist()
plt.subplot(223)
df_full[df_full['Min_Atmospheric_Pressure'].isnull() == True]['Park_ID'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1774f78fb70>

plt.subplot(221)
df_full[df_full['Max_Ambient_Pollution'].isnull() == True]['Park_ID'].hist()
plt.subplot(222)
df_full[df_full['Min_Ambient_Pollution'].isnull() == True]['Park_ID'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1774e0245c0>

We can see here that most missing values are re-occurring in the same parks. This means we can’t interpolate our missing values, and filling with the mean/median/mode is to over-generalized, so we should build models to predict our missing values.

Msno has a heatmap that shows the co-occurrance of missing values, which will be helpful in determining how to construct our models.

# Co-occurrence of missing values
msno.heatmap(df)

Feature Engineering

Daily & Weekly Averages

Before building models to predict the missing values, we’ll begin with calculating daily and weekly averages across all parks to assist wth our predictions.

There is a lot of repetition here due to re-running the same code for both the training and testing set.

Training Set

# Gathering the daily averages of predictors

# Wind
avg_daily_breeze = df['Average_Breeze_Speed'].groupby(df['Date']).mean().to_frame().reset_index()  # Group by day
avg_daily_breeze.columns = ['Date', 'Avg_Daily_Breeze']  # Renaming the columns for the join
df = df.merge(avg_daily_breeze, how = 'left')  # Joining onto the original dataframe

max_daily_breeze = df['Max_Breeze_Speed'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_breeze.columns = ['Date', 'Max_Daily_Breeze']
df = df.merge(max_daily_breeze, how = 'left')

min_daily_breeze = df['Min_Breeze_Speed'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_breeze.columns = ['Date', 'Min_Daily_Breeze']
df = df.merge(min_daily_breeze, how = 'left')


# Var1
var1_daily = df['Var1'].groupby(df['Date']).mean().to_frame().reset_index()
var1_daily.columns = ['Date', 'Var1_Daily']
df = df.merge(var1_daily, how = 'left')


# Atmosphere & Pollution
avg_daily_atmo = df['Average_Atmospheric_Pressure'].groupby(df['Date']).mean().to_frame().reset_index()
avg_daily_atmo.columns = ['Date', 'Avg_Daily_Atmosphere']
df = df.merge(avg_daily_atmo, how = 'left')

max_daily_atmo = df['Max_Atmospheric_Pressure'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_atmo.columns = ['Date', 'Max_Daily_Atmosphere']
df = df.merge(max_daily_atmo, how = 'left')

min_daily_atmo = df['Min_Atmospheric_Pressure'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_atmo.columns = ['Date', 'Min_Daily_Atmosphere']
df = df.merge(min_daily_atmo, how = 'left')

max_daily_pollution = df['Max_Ambient_Pollution'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_pollution.columns = ['Date', 'Max_Daily_Pollution']
df = df.merge(max_daily_pollution, how = 'left')

min_daily_pollution = df['Min_Ambient_Pollution'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_pollution.columns = ['Date', 'Min_Daily_Pollution']
df = df.merge(min_daily_pollution, how = 'left')


# Moisture
avg_daily_moisture = df['Average_Moisture_In_Park'].groupby(df['Date']).mean().to_frame().reset_index()
avg_daily_moisture.columns = ['Date', 'Avg_Daily_moisture']
df = df.merge(avg_daily_moisture, how = 'left')

max_daily_moisture = df['Max_Moisture_In_Park'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_moisture.columns = ['Date', 'Max_Daily_moisture']
df = df.merge(max_daily_moisture, how = 'left')

min_daily_moisture = df['Min_Moisture_In_Park'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_moisture.columns = ['Date', 'Min_Daily_moisture']
df = df.merge(min_daily_moisture, how = 'left')
# Repeating with weekly averages of predictors

# Wind
avg_weekly_breeze = df['Average_Breeze_Speed'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
avg_weekly_breeze.columns = ['Year', 'Week', 'Avg_Weekly_Breeze']
df = df.merge(avg_weekly_breeze, how = 'left')

max_weekly_breeze = df['Max_Breeze_Speed'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_breeze.columns = ['Year', 'Week', 'Max_Weekly_Breeze']
df = df.merge(max_weekly_breeze, how = 'left')

min_weekly_breeze = df['Min_Breeze_Speed'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_breeze.columns = ['Year', 'Week', 'Min_Weekly_Breeze']
df = df.merge(min_weekly_breeze, how = 'left')


# Var 1
var1_weekly = df['Var1'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
var1_weekly.columns = ['Year', 'Week', 'Var1_Weekly']
df = df.merge(var1_weekly, how = 'left')


# Atmosphere & Pollution
avg_weekly_atmo = df['Average_Atmospheric_Pressure'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
avg_weekly_atmo.columns = ['Year', 'Week', 'Avg_Weekly_Atmosphere']
df = df.merge(avg_weekly_atmo, how = 'left')

max_weekly_atmo = df['Max_Atmospheric_Pressure'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_atmo.columns = ['Year', 'Week', 'Max_Weekly_Atmosphere']
df = df.merge(max_weekly_atmo, how = 'left')

min_weekly_atmo = df['Min_Atmospheric_Pressure'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_atmo.columns = ['Year', 'Week', 'Min_Weekly_Atmosphere']
df = df.merge(min_weekly_atmo, how = 'left')

max_weekly_pollution = df['Max_Ambient_Pollution'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_pollution.columns = ['Year', 'Week', 'Max_Weekly_Pollution']
df = df.merge(max_weekly_pollution, how = 'left')

min_weekly_pollution = df['Min_Ambient_Pollution'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_pollution.columns = ['Year', 'Week', 'Min_Weekly_Pollution']
df = df.merge(min_weekly_pollution, how = 'left')


# Moisture
avg_weekly_moisture = df['Average_Moisture_In_Park'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
avg_weekly_moisture.columns = ['Year', 'Week', 'Avg_Weekly_Moisture']
df = df.merge(avg_weekly_moisture, how = 'left')

max_weekly_moisture = df['Max_Moisture_In_Park'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_moisture.columns = ['Year', 'Week', 'Max_Weekly_Moisture']
df = df.merge(max_weekly_moisture, how = 'left')

min_weekly_moisture = df['Min_Moisture_In_Park'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_moisture.columns = ['Year', 'Week', 'Min_Weekly_Moisture']
df = df.merge(min_weekly_moisture, how = 'left')

Testing Set

# Gathering the daily averages of predictors

# Wind
avg_daily_breeze = df_test['Average_Breeze_Speed'].groupby(df_test['Date']).mean().to_frame().reset_index()
avg_daily_breeze.columns = ['Date', 'Avg_Daily_Breeze']
df_test = df_test.merge(avg_daily_breeze, how = 'left')

max_daily_breeze = df_test['Max_Breeze_Speed'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_breeze.columns = ['Date', 'Max_Daily_Breeze']
df_test = df_test.merge(max_daily_breeze, how = 'left')

min_daily_breeze = df_test['Min_Breeze_Speed'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_breeze.columns = ['Date', 'Min_Daily_Breeze']
df_test = df_test.merge(min_daily_breeze, how = 'left')


# Var1
var1_daily = df_test['Var1'].groupby(df_test['Date']).mean().to_frame().reset_index()
var1_daily.columns = ['Date', 'Var1_Daily']
df_test = df_test.merge(var1_daily, how = 'left')


# Atmosphere & Pollution
avg_daily_atmo = df_test['Average_Atmospheric_Pressure'].groupby(df_test['Date']).mean().to_frame().reset_index()
avg_daily_atmo.columns = ['Date', 'Avg_Daily_Atmosphere']
df_test = df_test.merge(avg_daily_atmo, how = 'left')
                        
max_daily_atmo = df_test['Max_Atmospheric_Pressure'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_atmo.columns = ['Date', 'Max_Daily_Atmosphere']
df_test = df_test.merge(max_daily_atmo, how = 'left')
                        
min_daily_atmo = df_test['Min_Atmospheric_Pressure'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_atmo.columns = ['Date', 'Min_Daily_Atmosphere']
df_test = df_test.merge(min_daily_atmo, how = 'left')
                        
max_daily_pollution = df_test['Max_Ambient_Pollution'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_pollution.columns = ['Date', 'Max_Daily_Pollution']
df_test = df_test.merge(max_daily_pollution, how = 'left')
                        
min_daily_pollution = df_test['Min_Ambient_Pollution'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_pollution.columns = ['Date', 'Min_Daily_Pollution']
df_test = df_test.merge(min_daily_pollution, how = 'left')


# Moisture
avg_daily_moisture = df_test['Average_Moisture_In_Park'].groupby(df_test['Date']).mean().to_frame().reset_index()
avg_daily_moisture.columns = ['Date', 'Avg_Daily_moisture']
df_test = df_test.merge(avg_daily_moisture, how = 'left')
                        
max_daily_moisture = df_test['Max_Moisture_In_Park'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_moisture.columns = ['Date', 'Max_Daily_moisture']
df_test = df_test.merge(max_daily_moisture, how = 'left')
                        
min_daily_moisture = df_test['Min_Moisture_In_Park'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_moisture.columns = ['Date', 'Min_Daily_moisture']
df_test = df_test.merge(min_daily_moisture, how = 'left')
# Repeating with weekly averages of predictors

# Wind
avg_weekly_breeze = df_test['Average_Breeze_Speed'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
avg_weekly_breeze.columns = ['Year', 'Week', 'Avg_Weekly_Breeze']
df_test = df_test.merge(avg_weekly_breeze, how = 'left')

max_weekly_breeze = df_test['Max_Breeze_Speed'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_breeze.columns = ['Year', 'Week', 'Max_Weekly_Breeze']
df_test = df_test.merge(max_weekly_breeze, how = 'left')

min_weekly_breeze = df_test['Min_Breeze_Speed'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_breeze.columns = ['Year', 'Week', 'Min_Weekly_Breeze']
df_test = df_test.merge(min_weekly_breeze, how = 'left')


# Var 1
var1_weekly = df_test['Var1'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
var1_weekly.columns = ['Year', 'Week', 'Var1_Weekly']
df_test = df_test.merge(var1_weekly, how = 'left')


# Atmosphere & Pollution
avg_weekly_atmo = df_test['Average_Atmospheric_Pressure'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
avg_weekly_atmo.columns = ['Year', 'Week', 'Avg_Weekly_Atmosphere']
df_test = df_test.merge(avg_weekly_atmo, how = 'left')

max_weekly_atmo = df_test['Max_Atmospheric_Pressure'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_atmo.columns = ['Year', 'Week', 'Max_Weekly_Atmosphere']
df_test = df_test.merge(max_weekly_atmo, how = 'left')

min_weekly_atmo = df_test['Min_Atmospheric_Pressure'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_atmo.columns = ['Year', 'Week', 'Min_Weekly_Atmosphere']
df_test = df_test.merge(min_weekly_atmo, how = 'left')

max_weekly_pollution = df_test['Max_Ambient_Pollution'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_pollution.columns = ['Year', 'Week', 'Max_Weekly_Pollution']
df_test = df_test.merge(max_weekly_pollution, how = 'left')

min_weekly_pollution = df_test['Min_Ambient_Pollution'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_pollution.columns = ['Year', 'Week', 'Min_Weekly_Pollution']
df_test = df_test.merge(min_weekly_pollution, how = 'left')


# Moisture
avg_weekly_moisture = df_test['Average_Moisture_In_Park'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
avg_weekly_moisture.columns = ['Year', 'Week', 'Avg_Weekly_Moisture']
df_test = df_test.merge(avg_weekly_moisture, how = 'left')

max_weekly_moisture = df_test['Max_Moisture_In_Park'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_moisture.columns = ['Year', 'Week', 'Max_Weekly_Moisture']
df_test = df_test.merge(max_weekly_moisture, how = 'left')

min_weekly_moisture = df_test['Min_Moisture_In_Park'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_moisture.columns = ['Year', 'Week', 'Min_Weekly_Moisture']
df_test = df_test.merge(min_weekly_moisture, how = 'left')
df_full = df.append(df_test)

Handling Missing Values

  • Using random forests for all missing value prediction for imputation
  • For values with missing average, minimum, and maximum values, will first predict the average, then use that in predicting the minimum, then use both in predicting the maximum.

This section is relatively lengthy, and I used alot of copying/pasting. There are better ways to handle this for something that would be used for production, but it got the job done for this application.

Average Atmospheric Pressure

X = df_full[df_full['Average_Atmospheric_Pressure'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year', 'Average_Atmospheric_Pressure'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X = imp.fit_transform(X)

y = df_full['Average_Atmospheric_Pressure'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_avg_atmosphere = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_avg_atmosphere.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_avg_atmosphere, X_test, y_test))
0.99347389326468905
# Predicting the missing values and filling them in the dataframe
# Training data
X_avg_atmosphere = df[df['Average_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year', 'Average_Atmospheric_Pressure'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_avg_atmosphere = imp.fit_transform(X_avg_atmosphere)

avg_atmosphere_prediction = rfr_avg_atmosphere.predict(X_avg_atmosphere)
avg_atmosphere_prediction = pd.DataFrame({'ID':df.ix[(df['Average_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'avg_atmo_predict':avg_atmosphere_prediction})

df = df.merge(avg_atmosphere_prediction, how = 'left', on = 'ID')

df.Average_Atmospheric_Pressure.fillna(df.avg_atmo_predict, inplace=True)
del df['avg_atmo_predict']
# Predicting the missing values and filling them in the dataframe
# Test data
X_avg_atmosphere = df_test[df_test['Average_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Date', 'Average_Atmospheric_Pressure', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_avg_atmosphere = imp.fit_transform(X_avg_atmosphere)

avg_atmosphere_prediction = rfr_avg_atmosphere.predict(X_avg_atmosphere)
avg_atmosphere_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Average_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'avg_atmo_predict':avg_atmosphere_prediction})

df_test = df_test.merge(avg_atmosphere_prediction, how = 'left', on = 'ID')

df_test.Average_Atmospheric_Pressure.fillna(df_test.avg_atmo_predict, inplace=True)
del df_test['avg_atmo_predict']

Max Atmospheric Pressure

X = df_full[df_full['Max_Atmospheric_Pressure'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Atmospheric_Pressure'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_atmosphere = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_atmosphere.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_max_atmosphere, X_test, y_test))
0.9954164885643183
# Predicting the missing values and filling them in the dataframe
# Training data
X_max_atmosphere = df[df['Max_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_atmosphere = imp.fit_transform(X_max_atmosphere)

max_atmosphere_prediction = rfr_max_atmosphere.predict(X_max_atmosphere)
max_atmosphere_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'max_atmo_predict':max_atmosphere_prediction})

df = df.merge(max_atmosphere_prediction, how = 'left', on = 'ID')

df.Max_Atmospheric_Pressure.fillna(df.max_atmo_predict, inplace=True)
del df['max_atmo_predict']
# Predicting the missing values and filling them in the dataframe
# Test data
X_max_atmosphere = df_test[df_test['Max_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_atmosphere = imp.fit_transform(X_max_atmosphere)

max_atmosphere_prediction = rfr_max_atmosphere.predict(X_max_atmosphere)
max_atmosphere_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'max_atmo_predict':max_atmosphere_prediction})

df_test = df_test.merge(max_atmosphere_prediction, how = 'left', on = 'ID')

df_test.Max_Atmospheric_Pressure.fillna(df_test.max_atmo_predict, inplace=True)
del df_test['max_atmo_predict']

Min Atmospheric Pressure

X = df_full[df_full['Min_Atmospheric_Pressure'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Max_Ambient_Pollution'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Atmospheric_Pressure'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_atmosphere = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_atmosphere.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_min_atmosphere, X_test, y_test))
0.99499363701433363
# Predicting the missing values and filling them in the dataframe
# Training data
X_min_atmosphere = df[df['Min_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_min_atmosphere = imp.fit_transform(X_min_atmosphere)

min_atmosphere_prediction = rfr_min_atmosphere.predict(X_min_atmosphere)
min_atmosphere_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'min_atmo_predict':min_atmosphere_prediction})

df = df.merge(min_atmosphere_prediction, how = 'left', on = 'ID')

df.Min_Atmospheric_Pressure.fillna(df.min_atmo_predict, inplace=True)
del df['min_atmo_predict']
# Predicting the missing values and filling them in the dataframe
# Test data
X_min_atmosphere = df_test[df_test['Min_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_min_atmosphere = imp.fit_transform(X_min_atmosphere)

min_atmosphere_prediction = rfr_min_atmosphere.predict(X_min_atmosphere)
min_atmosphere_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'min_atmo_predict':min_atmosphere_prediction})

df_test = df_test.merge(min_atmosphere_prediction, how = 'left', on = 'ID')

df_test.Min_Atmospheric_Pressure.fillna(df_test.min_atmo_predict, inplace=True)
del df_test['min_atmo_predict']

Max Ambient Pollution

X = df_full[df_full['Max_Ambient_Pollution'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Ambient_Pollution'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_pollution = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_pollution.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_max_pollution, X_test, y_test))
0.80003476426720599
# Predicting the missing values and filling them in the dataframe
# Training data
X_max_pollution = df[df['Max_Ambient_Pollution'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_pollution = imp.fit_transform(X_max_pollution)

max_pollution_prediction = rfr_max_pollution.predict(X_max_pollution)
max_pollution_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'max_pollution_predict':max_pollution_prediction})

df = df.merge(max_pollution_prediction, how = 'left', on = 'ID')

df.Max_Ambient_Pollution.fillna(df.max_pollution_predict, inplace=True)
del df['max_pollution_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_max_pollution = df_test[df_test['Max_Ambient_Pollution'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_pollution = imp.fit_transform(X_max_pollution)

max_pollution_prediction = rfr_max_pollution.predict(X_max_pollution)
max_pollution_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'max_pollution_predict':max_pollution_prediction})

df_test = df_test.merge(max_pollution_prediction, how = 'left', on = 'ID')

df_test.Max_Ambient_Pollution.fillna(df_test.max_pollution_predict, inplace=True)
del df_test['max_pollution_predict']

Min Ambient Pollution

X = df_full[df_full['Min_Ambient_Pollution'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Ambient_Pollution'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_pollution = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_pollution.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_min_pollution, X_test, y_test))
0.75850550015000273
# Predicting the missing values and filling them in the dataframe
# Training data
X_min_pollution = df[df['Min_Ambient_Pollution'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X_min_pollution = imp.fit_transform(X_min_pollution)

min_pollution_prediction = rfr_min_pollution.predict(X_min_pollution)
min_pollution_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'min_pollution_predict':min_pollution_prediction})

df = df.merge(min_pollution_prediction, how = 'left', on = 'ID')

df.Min_Ambient_Pollution.fillna(df.min_pollution_predict, inplace=True)
del df['min_pollution_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_min_pollution = df_test[df_test['Min_Ambient_Pollution'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X_min_pollution = imp.fit_transform(X_min_pollution)

min_pollution_prediction = rfr_min_pollution.predict(X_min_pollution)
min_pollution_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'min_pollution_predict':min_pollution_prediction})

df_test = df_test.merge(min_pollution_prediction, how = 'left', on = 'ID')

df_test.Min_Ambient_Pollution.fillna(df_test.min_pollution_predict, inplace=True)
del df_test['min_pollution_predict']

Average Breeze Speed

X = df_full[df_full['Average_Breeze_Speed'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Average_Breeze_Speed'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Average_Breeze_Speed'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_avg_breeze = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_avg_breeze.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_avg_breeze, X_test, y_test))
0.91409857146873208
# Predicting the missing values and filling them in the dataframe
# Training data
X_avg_breeze = df[df['Average_Breeze_Speed'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Average_Breeze_Speed'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_breeze = imp.fit_transform(X_avg_breeze)

avg_breeze_prediction = rfr_avg_breeze.predict(X_avg_breeze)
avg_breeze_prediction = pd.DataFrame({'ID':df.ix[(df['Average_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'avg_breeze_predict':avg_breeze_prediction})

df = df.merge(avg_breeze_prediction, how = 'left', on = 'ID')

df.Average_Breeze_Speed.fillna(df.avg_breeze_predict, inplace=True)
del df['avg_breeze_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_avg_breeze = df_test[df_test['Average_Breeze_Speed'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Average_Breeze_Speed'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_breeze = imp.fit_transform(X_avg_breeze)

avg_breeze_prediction = rfr_avg_breeze.predict(X_avg_breeze)
avg_breeze_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Average_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'avg_breeze_predict':avg_breeze_prediction})

df_test = df_test.merge(avg_breeze_prediction, how = 'left', on = 'ID')

df_test.Average_Breeze_Speed.fillna(df_test.avg_breeze_predict, inplace=True)
del df_test['avg_breeze_predict']

Max Breeze Speed

X = df_full[df_full['Max_Breeze_Speed'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Breeze_Speed'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_breeze = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_breeze.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_max_breeze, X_test, y_test))
0.92781368606209191
# Predicting the missing values and filling them in the dataframe
# Training data
X_max_breeze = df[df['Max_Breeze_Speed'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_breeze = imp.fit_transform(X_max_breeze)

max_breeze_prediction = rfr_max_breeze.predict(X_max_breeze)
max_breeze_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'max_breeze_predict':max_breeze_prediction})

df = df.merge(max_breeze_prediction, how = 'left', on = 'ID')

df.Max_Breeze_Speed.fillna(df.max_breeze_predict, inplace=True)
del df['max_breeze_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_max_breeze = df_test[df_test['Max_Breeze_Speed'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_breeze = imp.fit_transform(X_max_breeze)

max_breeze_prediction = rfr_max_breeze.predict(X_max_breeze)
max_breeze_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'max_breeze_predict':max_breeze_prediction})

df_test = df_test.merge(max_breeze_prediction, how = 'left', on = 'ID')

df_test.Max_Breeze_Speed.fillna(df_test.max_breeze_predict, inplace=True)
del df_test['max_breeze_predict']

Min Breeze Speed

X = df_full[df_full['Min_Breeze_Speed'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Breeze_Speed'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_breeze = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_breeze.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_min_breeze, X_test, y_test))
0.88292032860270131
# Predicting the missing values and filling them in the dataframe
# Training data
X_min_breeze = df[df['Min_Breeze_Speed'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_breeze = imp.fit_transform(X_min_breeze)

min_breeze_prediction = rfr_min_breeze.predict(X_min_breeze)
min_breeze_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'min_breeze_predict':min_breeze_prediction})

df = df.merge(min_breeze_prediction, how = 'left', on = 'ID')

df.Min_Breeze_Speed.fillna(df.min_breeze_predict, inplace=True)
del df['min_breeze_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_min_breeze = df_test[df_test['Min_Breeze_Speed'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_breeze = imp.fit_transform(X_min_breeze)

min_breeze_prediction = rfr_min_breeze.predict(X_min_breeze)
min_breeze_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'min_breeze_predict':min_breeze_prediction})

df_test = df_test.merge(min_breeze_prediction, how = 'left', on = 'ID')

df_test.Min_Breeze_Speed.fillna(df_test.min_breeze_predict, inplace=True)
del df_test['min_breeze_predict']

Average Moisture

79 missing values, causing high error on test set

X = df_full[df_full['Average_Moisture_In_Park'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Average_Moisture_In_Park'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Average_Moisture_In_Park'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_avg_moisture = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_avg_moisture.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_avg_moisture, X_test, y_test))
0.87629324767915051
# Predicting the missing values and filling them in the dataframe
# Training data
X_avg_moisture = df[df['Average_Moisture_In_Park'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Average_Moisture_In_Park'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_moisture = imp.fit_transform(X_avg_moisture)

avg_moisture_prediction = rfr_avg_moisture.predict(X_avg_moisture)
avg_moisture_prediction = pd.DataFrame({'ID':df.ix[(df['Average_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'avg_moisture_predict':avg_moisture_prediction})

df = df.merge(avg_moisture_prediction, how = 'left', on = 'ID')

df.Average_Moisture_In_Park.fillna(df.avg_moisture_predict, inplace=True)
del df['avg_moisture_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_avg_moisture = df_test[df_test['Average_Moisture_In_Park'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Average_Moisture_In_Park'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_moisture = imp.fit_transform(X_avg_moisture)

avg_moisture_prediction = rfr_avg_moisture.predict(X_avg_moisture)
avg_moisture_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Average_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'avg_moisture_predict':avg_moisture_prediction})

df_test = df_test.merge(avg_moisture_prediction, how = 'left', on = 'ID')

df_test.Average_Moisture_In_Park.fillna(df_test.avg_moisture_predict, inplace=True)
del df_test['avg_moisture_predict']

Min Moisture

X = df_full[df_full['Min_Moisture_In_Park'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Moisture_In_Park'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_moisture = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_moisture.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_min_moisture, X_test, y_test))
0.93591464075660369
# Predicting the missing values and filling them in the dataframe
# Training data
X_min_moisture = df[df['Min_Moisture_In_Park'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_moisture = imp.fit_transform(X_min_moisture)

min_moisture_prediction = rfr_min_moisture.predict(X_min_moisture)
min_moisture_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'min_moisture_predict':min_moisture_prediction})

df = df.merge(min_moisture_prediction, how = 'left', on = 'ID')

df.Min_Moisture_In_Park.fillna(df.min_moisture_predict, inplace=True)
del df['min_moisture_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_min_moisture = df_test[df_test['Min_Moisture_In_Park'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_moisture = imp.fit_transform(X_min_moisture)

min_moisture_prediction = rfr_min_moisture.predict(X_min_moisture)
min_moisture_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'min_moisture_predict':min_moisture_prediction})

df_test = df_test.merge(min_moisture_prediction, how = 'left', on = 'ID')

df_test.Min_Moisture_In_Park.fillna(df_test.min_moisture_predict, inplace=True)
del df_test['min_moisture_predict']

Max Moisture

X = df_full[df_full['Max_Moisture_In_Park'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Moisture_In_Park'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_moisture = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_moisture.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_max_moisture, X_test, y_test))
0.8239425796574974
# Predicting the missing values and filling them in the dataframe
# Training data
X_max_moisture = df[df['Max_Moisture_In_Park'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_moisture = imp.fit_transform(X_max_moisture)

max_moisture_prediction = rfr_max_moisture.predict(X_max_moisture)
max_moisture_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'max_moisture_predict':max_moisture_prediction})

df = df.merge(max_moisture_prediction, how = 'left', on = 'ID')

df.Max_Moisture_In_Park.fillna(df.max_moisture_predict, inplace=True)
del df['max_moisture_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_max_moisture = df_test[df_test['Max_Moisture_In_Park'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_moisture = imp.fit_transform(X_max_moisture)

max_moisture_prediction = rfr_max_moisture.predict(X_max_moisture)
max_moisture_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'max_moisture_predict':max_moisture_prediction})

df_test = df_test.merge(max_moisture_prediction, how = 'left', on = 'ID')

df_test.Max_Moisture_In_Park.fillna(df_test.max_moisture_predict, inplace=True)
del df_test['max_moisture_predict']

Var1

X = df_full[df_full['Var1'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                    , 'Var1'
                                                    ], axis = 1)

X = imp.fit_transform(X)

y = df_full['Var1'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_var1 = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_var1.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_var1, X_test, y_test))
0.61193989909532365
# Predicting the missing values and filling them in the dataframe
# Training data
X_var1 = df[df['Var1'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Var1'
                                                              ], axis = 1)

X_var1 = imp.fit_transform(X_var1)

var1_prediction = rfr_var1.predict(X_var1)
var1_prediction = pd.DataFrame({'ID':df.ix[(df['Var1'].isnull() == True)]['ID']
                                          ,'var1_predict':var1_prediction})

df = df.merge(var1_prediction, how = 'left', on = 'ID')

df.Var1.fillna(df.var1_predict, inplace=True)
del df['var1_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_var1 = df_test[df_test['Var1'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Var1'
                                                              ], axis = 1)

X_var1 = imp.fit_transform(X_var1)

var1_prediction = rfr_var1.predict(X_var1)
var1_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Var1'].isnull() == True)]['ID']
                                          ,'var1_predict':var1_prediction})

df_test = df_test.merge(var1_prediction, how = 'left', on = 'ID')

df_test.Var1.fillna(df_test.var1_predict, inplace=True)
del df_test['var1_predict']

Direction of Wind

X = df_full[df_full['Direction_Of_Wind'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                    , 'Direction_Of_Wind'
                                                    ], axis = 1)

X = imp.fit_transform(X)

y = df_full['Direction_Of_Wind'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_wind_dir = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_wind_dir.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
np.mean(cross_val_score(rfr_wind_dir, X_test, y_test))
0.67839725370662685
# Predicting the missing values and filling them in the dataframe
# Training data
X_wind_dir = df[df['Direction_Of_Wind'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Direction_Of_Wind'
                                                              ], axis = 1)

X_wind_dir = imp.fit_transform(X_wind_dir)

wind_dir_prediction = rfr_wind_dir.predict(X_wind_dir)
wind_dir_prediction = pd.DataFrame({'ID':df.ix[(df['Direction_Of_Wind'].isnull() == True)]['ID']
                                          ,'wind_dir_predict':wind_dir_prediction})

df = df.merge(wind_dir_prediction, how = 'left', on = 'ID')

df.Direction_Of_Wind.fillna(df.wind_dir_predict, inplace=True)
del df['wind_dir_predict']
# Predicting the missing values and filling them in the dataframe
# Testing data
X_wind_dir = df_test[df_test['Direction_Of_Wind'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Direction_Of_Wind'
                                                              ], axis = 1)

X_wind_dir = imp.fit_transform(X_wind_dir)

wind_dir_prediction = rfr_wind_dir.predict(X_wind_dir)
wind_dir_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Direction_Of_Wind'].isnull() == True)]['ID']
                                          ,'wind_dir_predict':wind_dir_prediction})

df_test = df_test.merge(wind_dir_prediction, how = 'left', on = 'ID')

df_test.Direction_Of_Wind.fillna(df_test.wind_dir_predict, inplace=True)
del df_test['wind_dir_predict']

Checking for all missing values being accounted for:

df_missing_check = df.append(df_test)
df_missing_check = df_missing_check.sort_values(['Date', 'Park_ID'], ascending=[1, 1])
msno.matrix(df_missing_check)

Model Building

Gradient Boosted Trees

  • Max depth of 5 is most effective

  • Outperformed both random forests and AdaBoost

X = df.drop(['ID', 'Footfall', 'Date', 'Year'
             , 'Location_Type', 'Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
             , 'Var1', 'Max_Ambient_Pollution', 'Min_Atmospheric_Pressure'
             , 'Max_Breeze_Speed', 'Min_Breeze_Speed', 'Min_Ambient_Pollution'
             , 'Max_Moisture_In_Park'
            ], axis = 1)
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
X = imp.fit_transform(X)

y = df['Footfall']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)
gbr = GradientBoostingRegressor(n_estimators = 300
                               , max_depth = 5
                               )
gbr.fit(X_train, y_train)
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=5, max_features=None, max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=300,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)
# Regular cross validation
np.mean(cross_val_score(gbr, X_test, y_test, n_jobs = 3))
0.95762687239052224
# K-fold cross validation

k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
cross_val_score(gbr, X, y, cv=k_fold, n_jobs=3)
array([ 0.96254909,  0.96338802,  0.96328219,  0.96139247,  0.96212329,
        0.9647377 ,  0.96253464,  0.96374171,  0.96351884,  0.96228093])

Examining the distance of predictions from the actual, and looking for common characteristics among the parts with the biggest differences.

# Plot of error over time
y_pred = gbr.predict(X_test)

cv_error = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Error': (y_test - y_pred)})
cv_error = pd.merge(df, cv_error, left_index=True, right_index=True)

error_plot = cv_error[['Date', 'Error']]
error_plot = error_plot.set_index('Date')
error_plot.plot(figsize = (20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1774e13a518>

cv_error.sort_values('Error').head()
ID Park_ID Date Direction_Of_Wind Average_Breeze_Speed Max_Breeze_Speed Min_Breeze_Speed Var1 Average_Atmospheric_Pressure Max_Atmospheric_Pressure ... Max_Weekly_Atmosphere Min_Weekly_Atmosphere Max_Weekly_Pollution Min_Weekly_Pollution Avg_Weekly_Moisture Max_Weekly_Moisture Min_Weekly_Moisture Actual Error Predicted
95947 3686224 24 2000-02-12 178.000000 16.720000 38.000000 0.000000 1.66 8300.000000 8317.000000 ... 8339.769841 8302.928571 318.951724 153.875862 246.730159 284.904762 200.650794 528 -250.086489 778.086489
22191 3388817 17 1992-11-10 138.026667 78.072267 97.077333 58.773333 0.00 8005.033333 8043.626667 ... 8285.809524 8245.206349 291.910448 153.671642 254.158163 278.739796 225.091837 1011 -235.739930 1246.739930
18749 3388419 19 1992-07-10 83.000000 31.160000 45.600000 15.200000 0.00 8341.000000 8355.000000 ... 8341.722222 8287.119048 324.059701 160.686567 240.734694 281.693878 186.000000 1035 -228.127802 1263.127802
18751 3388421 21 1992-07-10 81.000000 27.360000 45.600000 15.200000 0.00 8348.000000 8358.000000 ... 8341.722222 8287.119048 324.059701 160.686567 240.734694 281.693878 186.000000 999 -227.231848 1226.231848
22247 3394917 17 1992-11-12 138.166667 78.330667 97.330667 59.026667 0.00 8005.033333 8043.626667 ... 8285.809524 8245.206349 291.910448 153.671642 254.158163 278.739796 225.091837 652 -222.944937 874.944937

5 rows × 50 columns

cv_error.sort_values('Error').tail()
ID Park_ID Date Direction_Of_Wind Average_Breeze_Speed Max_Breeze_Speed Min_Breeze_Speed Var1 Average_Atmospheric_Pressure Max_Atmospheric_Pressure ... Max_Weekly_Atmosphere Min_Weekly_Atmosphere Max_Weekly_Pollution Min_Weekly_Pollution Avg_Weekly_Moisture Max_Weekly_Moisture Min_Weekly_Moisture Actual Error Predicted
69981 3562439 39 1997-07-13 204.000000 25.080000 45.600000 0.000000 0.000000 8000.260000 8042.993333 ... 8361.619048 8293.492063 315.517730 194.496454 250.790816 287.770408 208.040816 1703 182.894997 1520.105003
3429 3335939 39 1991-01-05 20.000000 36.480000 53.200000 22.800000 0.000000 8000.400000 8042.313333 ... 8362.590278 8287.125000 299.838926 111.087248 265.745455 290.850000 225.940909 1323 184.154605 1138.845395
26230 3403224 24 1993-04-03 159.000000 26.600000 38.000000 15.200000 0.000000 8341.000000 8358.000000 ... 8381.301587 8309.087302 303.657143 147.142857 252.214286 287.311224 201.443878 1230 195.827119 1034.172881
11167 3352417 17 1991-10-13 140.313333 78.072267 97.026667 59.077333 0.000000 8005.033333 8044.040000 ... 8408.166667 8359.476190 309.082707 173.593985 252.872449 284.448980 204.811224 1536 203.969139 1332.030861
48807 3486833 33 1995-06-18 67.000000 21.280000 38.000000 0.000000 192.631933 8420.000000 8441.000000 ... 8362.158730 8313.317460 317.285714 182.400000 232.301020 282.260204 176.326531 1663 206.940381 1456.059619

5 rows × 50 columns

Viewing variable importance - used to discard unnecessary variables

# Plotting feature importance

gbr_labels = df_test.drop(['ID', 'Date', 'Year'
             , 'Location_Type', 'Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
             , 'Var1', 'Max_Ambient_Pollution', 'Min_Atmospheric_Pressure'
             , 'Max_Breeze_Speed', 'Min_Breeze_Speed', 'Min_Ambient_Pollution'
             , 'Max_Moisture_In_Park'
                            ], axis = 1)

feature_importance = gbr.feature_importances_

feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

plt.figure(figsize=(10,15))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, gbr_labels.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

Outputting Predictions

# Predicting for submission
X_submission = df_test.drop(['ID', 'Date', 'Year'
             , 'Location_Type', 'Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
             , 'Var1', 'Max_Ambient_Pollution', 'Min_Atmospheric_Pressure'
             , 'Max_Breeze_Speed', 'Min_Breeze_Speed', 'Min_Ambient_Pollution'
             , 'Max_Moisture_In_Park'
                            ], axis = 1)
X_submission = imp.fit_transform(X_submission)
y_submission = gbr.predict(X_submission)
# Exporting results
df_submission = pd.DataFrame(y_submission, index = df_test['ID'])
df_submission.columns = ['Footfall']
df_submission.to_csv('test.csv')
# Saving train/test with imputed null values
df.to_csv('train_imputed.csv')
df_test.to_csv('test_imputed.csv')

Categories:

Updated: