Competition: The Ultimate Student Hunt

25 minute read

Hackathon - The Ultimate Student Hunt

Analytics Vidhya hosted a student-only hackathon over predicting the number of visitors to parks given a set amount of variables. This was my first data competition, and proved to be a huge learning experience.

Here’s a brief overview of the data from the contest rules listed on the website:

ID: Unique ID
Park_ID: Unique ID for Parks
Date: Calendar Date
Direction_Of_Wind: Direction of winds in degrees
Average_Breeze_Speed: Daily average Breeze speed
Max_Breeze_Speed: Daily maximum Breeze speed
Min_Breeze_Speed: Daily minimum Breeze speed
Var1: A continuous feature
Average_Atmospheric_Pressure: Daily average atmospheric pressure
Max_Atmospheric_Pressure: Daily maximum atmospheric pressure
Min_Atmospheric_Pressure: Daily minimum atmospheric pressure
Min_Ambient_Pollution: Daily minimum Ambient pollution
Max_Ambient_Pollution: Daily maximum Ambient pollution
Average_Moisture_In_Park: Daily average moisture
Max_Moisture_In_Park: Daily maximum moisture
Min_Moisture_In_Park: Daily minimum moisture
Location_Type: Location Type (1/2/3/4)
Footfall: The target variable, daily Footfall

Summary

This problem involved predicting the number of visitors (footfall) to parks on a given day with given conditions, which ultimately makes it a time series problem. Specifically, it provided ten years as a training set, and five years for the test set. This notebook is an annotated version of my final submission which ranked 13th on the leaderboard. On a side note, the hackathon was only open for nine days, so there is of course a lot of room for improvement in this notebook.

My process for this hackathon was as follows:

Initial Exploration: All data projects should begin with an initial exploration to understand the data itself. I initially used the pandas profiling package, but excluded it from my final submission since it generates a lengthy report. I left both a quick df.describe() and df.head() to showcase summary statistics and an example of the data.
Outliers: I created boxplots to look for outliers visually. This can be done mathematically when you are more familiar with the data (using methods such as interquartile ranges), but the distributions of the variables produced a significant amount of data points that would’ve been considered outliers with this methodology.
Missing Values: I first sorted the dataframe by date and park ID, then used the msno package to visually examine missing values. After seeing fairly regular trends of missing values, I plotted histograms of the missing values by park IDs to see if I could fill them by linearly interpolating. After seeing that certain park IDs were completely missing some values, I built random forest models to impute them. This is a brute-force method that is CPU intensive, but was a trade-off for the limited time frame.
Feature Engineering: This was almost non-existant in this competition due to the anonymity of the data. I used daily and weekly averages of the individual variables in both the end model and missing value imputation models.
Model Building: I initially started by creating three models using random forests, gradient boosted trees, and AdaBoost. The gradient boosted trees model outperformed the other two, so I stuck with that and scrapped the other two.
Hyperparameter Tuning: This was my first time using gradient boosted trees, so I took a trial-and-error approach by adjusting various parameters and running them through cross validation to see how differently they performed. I found that just adjusting the number of trees and max depth obtained the best results in this situation.
Validation: I used both a holdout cross-validation and k-folds (with 10 folds) to check for overfitting. The hackathon also had a “solution checker” for your predicted values (specifically for the first two years of the test set - the final score of the competition was on the full five years of the test set, so it is very important to not overfit) that provided a score, which I used in combination with the cross validation results.

Here is the annotated code for my final submission:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.cross_validation import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import Imputer

# For exploratory data analysis
import missingno as msno  # Visualizes missing values

%matplotlib inline

df = pd.read_csv('crime.csv')  # Training set - ignore the name

df_test = pd.read_csv('Test_pyI9Owa.csv')  # Testing set

Exploratory Data Analysis

df.describe()

	ID	Park_ID	Direction_Of_Wind	Average_Breeze_Speed	Max_Breeze_Speed	Min_Breeze_Speed	Var1	Average_Atmospheric_Pressure	Max_Atmospheric_Pressure	Min_Atmospheric_Pressure	Min_Ambient_Pollution	Max_Ambient_Pollution	Average_Moisture_In_Park	Max_Moisture_In_Park	Min_Moisture_In_Park	Location_Type	Footfall
count	1.145390e+05	114539.000000	110608.000000	110608.000000	110603.000000	110605.000000	106257.000000	74344.000000	74344.000000	74344.000000	82894.000000	82894.000000	114499.000000	114499.000000	114499.000000	114539.000000	114539.000000
mean	3.517595e+06	25.582596	179.587146	34.255340	51.704297	17.282553	18.802545	8331.545949	8356.053468	8305.692510	162.806138	306.555698	248.008970	283.917082	202.355331	2.630720	1204.217192
std	1.189083e+05	8.090592	85.362934	17.440065	22.068301	14.421844	38.269851	80.943971	76.032983	87.172258	90.869627	38.188020	28.898084	15.637930	46.365728	0.967435	248.385651
min	3.311712e+06	12.000000	1.000000	3.040000	7.600000	0.000000	0.000000	7982.000000	8037.000000	7890.000000	4.000000	8.000000	102.000000	141.000000	48.000000	1.000000	310.000000
25%	3.414820e+06	18.000000	111.000000	22.040000	38.000000	7.600000	0.000000	8283.000000	8311.000000	8252.000000	80.000000	288.000000	231.000000	279.000000	171.000000	2.000000	1026.000000
50%	3.517039e+06	26.000000	196.000000	30.400000	45.600000	15.200000	0.830000	8335.000000	8358.000000	8311.000000	180.000000	316.000000	252.000000	288.000000	207.000000	3.000000	1216.000000
75%	3.619624e+06	33.000000	239.000000	42.560000	60.800000	22.800000	21.580000	8382.000000	8406.000000	8362.000000	244.000000	336.000000	270.000000	294.000000	237.000000	3.000000	1402.000000
max	3.725639e+06	39.000000	360.000000	154.280000	212.800000	129.200000	1181.090000	8588.000000	8601.000000	8571.000000	348.000000	356.000000	300.000000	300.000000	300.000000	4.000000	1925.000000

df.head()

	ID	Park_ID	Date	Direction_Of_Wind	Average_Breeze_Speed	Max_Breeze_Speed	Min_Breeze_Speed	Var1	Average_Atmospheric_Pressure	Max_Atmospheric_Pressure	Min_Atmospheric_Pressure	Min_Ambient_Pollution	Max_Ambient_Pollution	Average_Moisture_In_Park	Max_Moisture_In_Park	Min_Moisture_In_Park	Location_Type	Footfall
0	3311712	12	01-09-1990	194.0	37.24	60.8	15.2	92.1300	8225.0	8259.0	8211.0	92.0	304.0	255.0	288.0	222.0	3	1406
1	3311812	12	02-09-1990	285.0	32.68	60.8	7.6	14.1100	8232.0	8280.0	8205.0	172.0	332.0	252.0	297.0	204.0	3	1409
2	3311912	12	03-09-1990	319.0	43.32	60.8	15.2	35.6900	8321.0	8355.0	8283.0	236.0	292.0	219.0	279.0	165.0	3	1386
3	3312012	12	04-09-1990	297.0	25.84	38.0	7.6	0.0249	8379.0	8396.0	8358.0	272.0	324.0	225.0	261.0	192.0	3	1365
4	3312112	12	05-09-1990	207.0	28.88	45.6	7.6	0.8300	8372.0	8393.0	8335.0	236.0	332.0	234.0	273.0	183.0	3	1413

Outliers

We’ll do our outlier detection visually with box plots. Rather than determining outliers mathematically (such as using the interquartile range), we’ll simply look for any points that aren’t contiguous.

df_box = df.drop(['ID', 'Park_ID', 'Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
                  , 'Min_Atmospheric_Pressure', 'Footfall', 'Date'], axis = 1)
plt.figure(figsize = (20,10))
sns.boxplot(data=df_box)

<matplotlib.axes._subplots.AxesSubplot at 0x1774dfe6630>

Var1 seems to potentially have outliers, but since it is undefined, it is difficult to determine if these are anomalies or noisy/incorrect data. We’ll leave them for now.

df_box = df[['Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
                  , 'Min_Atmospheric_Pressure']]
plt.figure(figsize = (20,10))
sns.boxplot(data=df_box)

<matplotlib.axes._subplots.AxesSubplot at 0x1774f62a518>

Max atmospheric pressure (and by result, average atmospheric pressure) have a few non-contiguous values, but they don’t seem egregious enough to deal with for the time being.

# Converting date field to datetime and extracting date components
df['Date'] = pd.to_datetime(df['Date'])

df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Month'] = pd.DatetimeIndex(df['Date']).month
df['Day'] = pd.DatetimeIndex(df['Date']).day
df['Week'] = pd.DatetimeIndex(df['Date']).week
df['WeekDay'] = pd.DatetimeIndex(df['Date']).dayofweek


# Repeating for the test set
df_test['Date'] = pd.to_datetime(df_test['Date'])

df_test['Year'] = pd.DatetimeIndex(df_test['Date']).year
df_test['Month'] = pd.DatetimeIndex(df_test['Date']).month
df_test['Day'] = pd.DatetimeIndex(df_test['Date']).day
df_test['Week'] = pd.DatetimeIndex(df_test['Date']).week
df_test['WeekDay'] = pd.DatetimeIndex(df_test['Date']).dayofweek


# Lastly, combining to use for building models to predict missing predictors 
df_full = df.append(df_test)

Missing Values

Since this is ultimately a time series problem, we’ll begin with sorting the values. Then, I’m going to use a useful package for visualizing missing values.

# Sorting by date and park
df = df.sort_values(['Date', 'Park_ID'], ascending=[1, 1])
df_full = df_full.sort_values(['Date', 'Park_ID'], ascending=[1, 1])

# Visualizing missing values
msno.matrix(df_full)

# Checking which Park IDs missing values occur in
plt.subplot(221)
df_full[df_full['Direction_Of_Wind'].isnull() == True]['Park_ID'].hist()
plt.subplot(222)
df_full[df_full['Average_Breeze_Speed'].isnull() == True]['Park_ID'].hist()
plt.subplot(223)
df_full[df_full['Max_Breeze_Speed'].isnull() == True]['Park_ID'].hist()
plt.subplot(224)
df_full[df_full['Min_Breeze_Speed'].isnull() == True]['Park_ID'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1774f091550>

df_full[df_full['Var1'].isnull() == True]['Park_ID'].hist()
plt.title('Var1 Missing Park IDs')

<matplotlib.text.Text at 0x1774f6d1470>

plt.subplot(221)
df_full[df_full['Average_Atmospheric_Pressure'].isnull() == True]['Park_ID'].hist()
plt.subplot(222)
df_full[df_full['Max_Atmospheric_Pressure'].isnull() == True]['Park_ID'].hist()
plt.subplot(223)
df_full[df_full['Min_Atmospheric_Pressure'].isnull() == True]['Park_ID'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1774f78fb70>

plt.subplot(221)
df_full[df_full['Max_Ambient_Pollution'].isnull() == True]['Park_ID'].hist()
plt.subplot(222)
df_full[df_full['Min_Ambient_Pollution'].isnull() == True]['Park_ID'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1774e0245c0>

We can see here that most missing values are re-occurring in the same parks. This means we can’t interpolate our missing values, and filling with the mean/median/mode is to over-generalized, so we should build models to predict our missing values.

Msno has a heatmap that shows the co-occurrance of missing values, which will be helpful in determining how to construct our models.

# Co-occurrence of missing values
msno.heatmap(df)

Feature Engineering

Daily & Weekly Averages

Before building models to predict the missing values, we’ll begin with calculating daily and weekly averages across all parks to assist wth our predictions.

There is a lot of repetition here due to re-running the same code for both the training and testing set.

Training Set

# Gathering the daily averages of predictors

# Wind
avg_daily_breeze = df['Average_Breeze_Speed'].groupby(df['Date']).mean().to_frame().reset_index()  # Group by day
avg_daily_breeze.columns = ['Date', 'Avg_Daily_Breeze']  # Renaming the columns for the join
df = df.merge(avg_daily_breeze, how = 'left')  # Joining onto the original dataframe

max_daily_breeze = df['Max_Breeze_Speed'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_breeze.columns = ['Date', 'Max_Daily_Breeze']
df = df.merge(max_daily_breeze, how = 'left')

min_daily_breeze = df['Min_Breeze_Speed'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_breeze.columns = ['Date', 'Min_Daily_Breeze']
df = df.merge(min_daily_breeze, how = 'left')


# Var1
var1_daily = df['Var1'].groupby(df['Date']).mean().to_frame().reset_index()
var1_daily.columns = ['Date', 'Var1_Daily']
df = df.merge(var1_daily, how = 'left')


# Atmosphere & Pollution
avg_daily_atmo = df['Average_Atmospheric_Pressure'].groupby(df['Date']).mean().to_frame().reset_index()
avg_daily_atmo.columns = ['Date', 'Avg_Daily_Atmosphere']
df = df.merge(avg_daily_atmo, how = 'left')

max_daily_atmo = df['Max_Atmospheric_Pressure'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_atmo.columns = ['Date', 'Max_Daily_Atmosphere']
df = df.merge(max_daily_atmo, how = 'left')

min_daily_atmo = df['Min_Atmospheric_Pressure'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_atmo.columns = ['Date', 'Min_Daily_Atmosphere']
df = df.merge(min_daily_atmo, how = 'left')

max_daily_pollution = df['Max_Ambient_Pollution'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_pollution.columns = ['Date', 'Max_Daily_Pollution']
df = df.merge(max_daily_pollution, how = 'left')

min_daily_pollution = df['Min_Ambient_Pollution'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_pollution.columns = ['Date', 'Min_Daily_Pollution']
df = df.merge(min_daily_pollution, how = 'left')


# Moisture
avg_daily_moisture = df['Average_Moisture_In_Park'].groupby(df['Date']).mean().to_frame().reset_index()
avg_daily_moisture.columns = ['Date', 'Avg_Daily_moisture']
df = df.merge(avg_daily_moisture, how = 'left')

max_daily_moisture = df['Max_Moisture_In_Park'].groupby(df['Date']).mean().to_frame().reset_index()
max_daily_moisture.columns = ['Date', 'Max_Daily_moisture']
df = df.merge(max_daily_moisture, how = 'left')

min_daily_moisture = df['Min_Moisture_In_Park'].groupby(df['Date']).mean().to_frame().reset_index()
min_daily_moisture.columns = ['Date', 'Min_Daily_moisture']
df = df.merge(min_daily_moisture, how = 'left')

# Repeating with weekly averages of predictors

# Wind
avg_weekly_breeze = df['Average_Breeze_Speed'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
avg_weekly_breeze.columns = ['Year', 'Week', 'Avg_Weekly_Breeze']
df = df.merge(avg_weekly_breeze, how = 'left')

max_weekly_breeze = df['Max_Breeze_Speed'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_breeze.columns = ['Year', 'Week', 'Max_Weekly_Breeze']
df = df.merge(max_weekly_breeze, how = 'left')

min_weekly_breeze = df['Min_Breeze_Speed'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_breeze.columns = ['Year', 'Week', 'Min_Weekly_Breeze']
df = df.merge(min_weekly_breeze, how = 'left')


# Var 1
var1_weekly = df['Var1'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
var1_weekly.columns = ['Year', 'Week', 'Var1_Weekly']
df = df.merge(var1_weekly, how = 'left')


# Atmosphere & Pollution
avg_weekly_atmo = df['Average_Atmospheric_Pressure'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
avg_weekly_atmo.columns = ['Year', 'Week', 'Avg_Weekly_Atmosphere']
df = df.merge(avg_weekly_atmo, how = 'left')

max_weekly_atmo = df['Max_Atmospheric_Pressure'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_atmo.columns = ['Year', 'Week', 'Max_Weekly_Atmosphere']
df = df.merge(max_weekly_atmo, how = 'left')

min_weekly_atmo = df['Min_Atmospheric_Pressure'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_atmo.columns = ['Year', 'Week', 'Min_Weekly_Atmosphere']
df = df.merge(min_weekly_atmo, how = 'left')

max_weekly_pollution = df['Max_Ambient_Pollution'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_pollution.columns = ['Year', 'Week', 'Max_Weekly_Pollution']
df = df.merge(max_weekly_pollution, how = 'left')

min_weekly_pollution = df['Min_Ambient_Pollution'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_pollution.columns = ['Year', 'Week', 'Min_Weekly_Pollution']
df = df.merge(min_weekly_pollution, how = 'left')


# Moisture
avg_weekly_moisture = df['Average_Moisture_In_Park'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
avg_weekly_moisture.columns = ['Year', 'Week', 'Avg_Weekly_Moisture']
df = df.merge(avg_weekly_moisture, how = 'left')

max_weekly_moisture = df['Max_Moisture_In_Park'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
max_weekly_moisture.columns = ['Year', 'Week', 'Max_Weekly_Moisture']
df = df.merge(max_weekly_moisture, how = 'left')

min_weekly_moisture = df['Min_Moisture_In_Park'].groupby((df['Year'], df['Week'])).mean().to_frame().reset_index()
min_weekly_moisture.columns = ['Year', 'Week', 'Min_Weekly_Moisture']
df = df.merge(min_weekly_moisture, how = 'left')

Testing Set

# Gathering the daily averages of predictors

# Wind
avg_daily_breeze = df_test['Average_Breeze_Speed'].groupby(df_test['Date']).mean().to_frame().reset_index()
avg_daily_breeze.columns = ['Date', 'Avg_Daily_Breeze']
df_test = df_test.merge(avg_daily_breeze, how = 'left')

max_daily_breeze = df_test['Max_Breeze_Speed'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_breeze.columns = ['Date', 'Max_Daily_Breeze']
df_test = df_test.merge(max_daily_breeze, how = 'left')

min_daily_breeze = df_test['Min_Breeze_Speed'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_breeze.columns = ['Date', 'Min_Daily_Breeze']
df_test = df_test.merge(min_daily_breeze, how = 'left')


# Var1
var1_daily = df_test['Var1'].groupby(df_test['Date']).mean().to_frame().reset_index()
var1_daily.columns = ['Date', 'Var1_Daily']
df_test = df_test.merge(var1_daily, how = 'left')


# Atmosphere & Pollution
avg_daily_atmo = df_test['Average_Atmospheric_Pressure'].groupby(df_test['Date']).mean().to_frame().reset_index()
avg_daily_atmo.columns = ['Date', 'Avg_Daily_Atmosphere']
df_test = df_test.merge(avg_daily_atmo, how = 'left')
                        
max_daily_atmo = df_test['Max_Atmospheric_Pressure'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_atmo.columns = ['Date', 'Max_Daily_Atmosphere']
df_test = df_test.merge(max_daily_atmo, how = 'left')
                        
min_daily_atmo = df_test['Min_Atmospheric_Pressure'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_atmo.columns = ['Date', 'Min_Daily_Atmosphere']
df_test = df_test.merge(min_daily_atmo, how = 'left')
                        
max_daily_pollution = df_test['Max_Ambient_Pollution'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_pollution.columns = ['Date', 'Max_Daily_Pollution']
df_test = df_test.merge(max_daily_pollution, how = 'left')
                        
min_daily_pollution = df_test['Min_Ambient_Pollution'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_pollution.columns = ['Date', 'Min_Daily_Pollution']
df_test = df_test.merge(min_daily_pollution, how = 'left')


# Moisture
avg_daily_moisture = df_test['Average_Moisture_In_Park'].groupby(df_test['Date']).mean().to_frame().reset_index()
avg_daily_moisture.columns = ['Date', 'Avg_Daily_moisture']
df_test = df_test.merge(avg_daily_moisture, how = 'left')
                        
max_daily_moisture = df_test['Max_Moisture_In_Park'].groupby(df_test['Date']).mean().to_frame().reset_index()
max_daily_moisture.columns = ['Date', 'Max_Daily_moisture']
df_test = df_test.merge(max_daily_moisture, how = 'left')
                        
min_daily_moisture = df_test['Min_Moisture_In_Park'].groupby(df_test['Date']).mean().to_frame().reset_index()
min_daily_moisture.columns = ['Date', 'Min_Daily_moisture']
df_test = df_test.merge(min_daily_moisture, how = 'left')

# Repeating with weekly averages of predictors

# Wind
avg_weekly_breeze = df_test['Average_Breeze_Speed'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
avg_weekly_breeze.columns = ['Year', 'Week', 'Avg_Weekly_Breeze']
df_test = df_test.merge(avg_weekly_breeze, how = 'left')

max_weekly_breeze = df_test['Max_Breeze_Speed'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_breeze.columns = ['Year', 'Week', 'Max_Weekly_Breeze']
df_test = df_test.merge(max_weekly_breeze, how = 'left')

min_weekly_breeze = df_test['Min_Breeze_Speed'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_breeze.columns = ['Year', 'Week', 'Min_Weekly_Breeze']
df_test = df_test.merge(min_weekly_breeze, how = 'left')


# Var 1
var1_weekly = df_test['Var1'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
var1_weekly.columns = ['Year', 'Week', 'Var1_Weekly']
df_test = df_test.merge(var1_weekly, how = 'left')


# Atmosphere & Pollution
avg_weekly_atmo = df_test['Average_Atmospheric_Pressure'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
avg_weekly_atmo.columns = ['Year', 'Week', 'Avg_Weekly_Atmosphere']
df_test = df_test.merge(avg_weekly_atmo, how = 'left')

max_weekly_atmo = df_test['Max_Atmospheric_Pressure'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_atmo.columns = ['Year', 'Week', 'Max_Weekly_Atmosphere']
df_test = df_test.merge(max_weekly_atmo, how = 'left')

min_weekly_atmo = df_test['Min_Atmospheric_Pressure'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_atmo.columns = ['Year', 'Week', 'Min_Weekly_Atmosphere']
df_test = df_test.merge(min_weekly_atmo, how = 'left')

max_weekly_pollution = df_test['Max_Ambient_Pollution'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_pollution.columns = ['Year', 'Week', 'Max_Weekly_Pollution']
df_test = df_test.merge(max_weekly_pollution, how = 'left')

min_weekly_pollution = df_test['Min_Ambient_Pollution'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_pollution.columns = ['Year', 'Week', 'Min_Weekly_Pollution']
df_test = df_test.merge(min_weekly_pollution, how = 'left')


# Moisture
avg_weekly_moisture = df_test['Average_Moisture_In_Park'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
avg_weekly_moisture.columns = ['Year', 'Week', 'Avg_Weekly_Moisture']
df_test = df_test.merge(avg_weekly_moisture, how = 'left')

max_weekly_moisture = df_test['Max_Moisture_In_Park'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
max_weekly_moisture.columns = ['Year', 'Week', 'Max_Weekly_Moisture']
df_test = df_test.merge(max_weekly_moisture, how = 'left')

min_weekly_moisture = df_test['Min_Moisture_In_Park'].groupby((df_test['Year'], df_test['Week'])).mean().to_frame().reset_index()
min_weekly_moisture.columns = ['Year', 'Week', 'Min_Weekly_Moisture']
df_test = df_test.merge(min_weekly_moisture, how = 'left')

df_full = df.append(df_test)

Handling Missing Values

Using random forests for all missing value prediction for imputation
For values with missing average, minimum, and maximum values, will first predict the average, then use that in predicting the minimum, then use both in predicting the maximum.

This section is relatively lengthy, and I used alot of copying/pasting. There are better ways to handle this for something that would be used for production, but it got the job done for this application.

Average Atmospheric Pressure

X = df_full[df_full['Average_Atmospheric_Pressure'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year', 'Average_Atmospheric_Pressure'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X = imp.fit_transform(X)

y = df_full['Average_Atmospheric_Pressure'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_avg_atmosphere = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_avg_atmosphere.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_avg_atmosphere, X_test, y_test))

0.99347389326468905

# Predicting the missing values and filling them in the dataframe
# Training data
X_avg_atmosphere = df[df['Average_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year', 'Average_Atmospheric_Pressure'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_avg_atmosphere = imp.fit_transform(X_avg_atmosphere)

avg_atmosphere_prediction = rfr_avg_atmosphere.predict(X_avg_atmosphere)
avg_atmosphere_prediction = pd.DataFrame({'ID':df.ix[(df['Average_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'avg_atmo_predict':avg_atmosphere_prediction})

df = df.merge(avg_atmosphere_prediction, how = 'left', on = 'ID')

df.Average_Atmospheric_Pressure.fillna(df.avg_atmo_predict, inplace=True)
del df['avg_atmo_predict']

# Predicting the missing values and filling them in the dataframe
# Test data
X_avg_atmosphere = df_test[df_test['Average_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Date', 'Average_Atmospheric_Pressure', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_avg_atmosphere = imp.fit_transform(X_avg_atmosphere)

avg_atmosphere_prediction = rfr_avg_atmosphere.predict(X_avg_atmosphere)
avg_atmosphere_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Average_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'avg_atmo_predict':avg_atmosphere_prediction})

df_test = df_test.merge(avg_atmosphere_prediction, how = 'left', on = 'ID')

df_test.Average_Atmospheric_Pressure.fillna(df_test.avg_atmo_predict, inplace=True)
del df_test['avg_atmo_predict']

Max Atmospheric Pressure

X = df_full[df_full['Max_Atmospheric_Pressure'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Atmospheric_Pressure'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_atmosphere = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_atmosphere.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_max_atmosphere, X_test, y_test))

0.9954164885643183

# Predicting the missing values and filling them in the dataframe
# Training data
X_max_atmosphere = df[df['Max_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_atmosphere = imp.fit_transform(X_max_atmosphere)

max_atmosphere_prediction = rfr_max_atmosphere.predict(X_max_atmosphere)
max_atmosphere_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'max_atmo_predict':max_atmosphere_prediction})

df = df.merge(max_atmosphere_prediction, how = 'left', on = 'ID')

df.Max_Atmospheric_Pressure.fillna(df.max_atmo_predict, inplace=True)
del df['max_atmo_predict']

# Predicting the missing values and filling them in the dataframe
# Test data
X_max_atmosphere = df_test[df_test['Max_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Max_Atmospheric_Pressure'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_atmosphere = imp.fit_transform(X_max_atmosphere)

max_atmosphere_prediction = rfr_max_atmosphere.predict(X_max_atmosphere)
max_atmosphere_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'max_atmo_predict':max_atmosphere_prediction})

df_test = df_test.merge(max_atmosphere_prediction, how = 'left', on = 'ID')

df_test.Max_Atmospheric_Pressure.fillna(df_test.max_atmo_predict, inplace=True)
del df_test['max_atmo_predict']

Min Atmospheric Pressure

X = df_full[df_full['Min_Atmospheric_Pressure'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Max_Ambient_Pollution'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Atmospheric_Pressure'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_atmosphere = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_atmosphere.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_min_atmosphere, X_test, y_test))

0.99499363701433363

# Predicting the missing values and filling them in the dataframe
# Training data
X_min_atmosphere = df[df['Min_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_min_atmosphere = imp.fit_transform(X_min_atmosphere)

min_atmosphere_prediction = rfr_min_atmosphere.predict(X_min_atmosphere)
min_atmosphere_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'min_atmo_predict':min_atmosphere_prediction})

df = df.merge(min_atmosphere_prediction, how = 'left', on = 'ID')

df.Min_Atmospheric_Pressure.fillna(df.min_atmo_predict, inplace=True)
del df['min_atmo_predict']

# Predicting the missing values and filling them in the dataframe
# Test data
X_min_atmosphere = df_test[df_test['Min_Atmospheric_Pressure'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Min_Atmospheric_Pressure'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_min_atmosphere = imp.fit_transform(X_min_atmosphere)

min_atmosphere_prediction = rfr_min_atmosphere.predict(X_min_atmosphere)
min_atmosphere_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Atmospheric_Pressure'].isnull() == True)]['ID']
                                          ,'min_atmo_predict':min_atmosphere_prediction})

df_test = df_test.merge(min_atmosphere_prediction, how = 'left', on = 'ID')

df_test.Min_Atmospheric_Pressure.fillna(df_test.min_atmo_predict, inplace=True)
del df_test['min_atmo_predict']

Max Ambient Pollution

X = df_full[df_full['Max_Ambient_Pollution'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Ambient_Pollution'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_pollution = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_pollution.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_max_pollution, X_test, y_test))

0.80003476426720599

# Predicting the missing values and filling them in the dataframe
# Training data
X_max_pollution = df[df['Max_Ambient_Pollution'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_pollution = imp.fit_transform(X_max_pollution)

max_pollution_prediction = rfr_max_pollution.predict(X_max_pollution)
max_pollution_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'max_pollution_predict':max_pollution_prediction})

df = df.merge(max_pollution_prediction, how = 'left', on = 'ID')

df.Max_Ambient_Pollution.fillna(df.max_pollution_predict, inplace=True)
del df['max_pollution_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_max_pollution = df_test[df_test['Max_Ambient_Pollution'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'
                                                              ,'Max_Ambient_Pollution'], axis = 1)

X_max_pollution = imp.fit_transform(X_max_pollution)

max_pollution_prediction = rfr_max_pollution.predict(X_max_pollution)
max_pollution_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'max_pollution_predict':max_pollution_prediction})

df_test = df_test.merge(max_pollution_prediction, how = 'left', on = 'ID')

df_test.Max_Ambient_Pollution.fillna(df_test.max_pollution_predict, inplace=True)
del df_test['max_pollution_predict']

Min Ambient Pollution

X = df_full[df_full['Min_Ambient_Pollution'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Ambient_Pollution'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_pollution = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_pollution.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_min_pollution, X_test, y_test))

0.75850550015000273

# Predicting the missing values and filling them in the dataframe
# Training data
X_min_pollution = df[df['Min_Ambient_Pollution'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X_min_pollution = imp.fit_transform(X_min_pollution)

min_pollution_prediction = rfr_min_pollution.predict(X_min_pollution)
min_pollution_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'min_pollution_predict':min_pollution_prediction})

df = df.merge(min_pollution_prediction, how = 'left', on = 'ID')

df.Min_Ambient_Pollution.fillna(df.min_pollution_predict, inplace=True)
del df['min_pollution_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_min_pollution = df_test[df_test['Min_Ambient_Pollution'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Min_Ambient_Pollution'], axis = 1)

X_min_pollution = imp.fit_transform(X_min_pollution)

min_pollution_prediction = rfr_min_pollution.predict(X_min_pollution)
min_pollution_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Ambient_Pollution'].isnull() == True)]['ID']
                                          ,'min_pollution_predict':min_pollution_prediction})

df_test = df_test.merge(min_pollution_prediction, how = 'left', on = 'ID')

df_test.Min_Ambient_Pollution.fillna(df_test.min_pollution_predict, inplace=True)
del df_test['min_pollution_predict']

Average Breeze Speed

X = df_full[df_full['Average_Breeze_Speed'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Average_Breeze_Speed'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Average_Breeze_Speed'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_avg_breeze = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_avg_breeze.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_avg_breeze, X_test, y_test))

0.91409857146873208

# Predicting the missing values and filling them in the dataframe
# Training data
X_avg_breeze = df[df['Average_Breeze_Speed'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              ,'Average_Breeze_Speed'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_breeze = imp.fit_transform(X_avg_breeze)

avg_breeze_prediction = rfr_avg_breeze.predict(X_avg_breeze)
avg_breeze_prediction = pd.DataFrame({'ID':df.ix[(df['Average_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'avg_breeze_predict':avg_breeze_prediction})

df = df.merge(avg_breeze_prediction, how = 'left', on = 'ID')

df.Average_Breeze_Speed.fillna(df.avg_breeze_predict, inplace=True)
del df['avg_breeze_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_avg_breeze = df_test[df_test['Average_Breeze_Speed'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              ,'Average_Breeze_Speed'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_breeze = imp.fit_transform(X_avg_breeze)

avg_breeze_prediction = rfr_avg_breeze.predict(X_avg_breeze)
avg_breeze_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Average_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'avg_breeze_predict':avg_breeze_prediction})

df_test = df_test.merge(avg_breeze_prediction, how = 'left', on = 'ID')

df_test.Average_Breeze_Speed.fillna(df_test.avg_breeze_predict, inplace=True)
del df_test['avg_breeze_predict']

Max Breeze Speed

X = df_full[df_full['Max_Breeze_Speed'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Breeze_Speed'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_breeze = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_breeze.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_max_breeze, X_test, y_test))

0.92781368606209191

# Predicting the missing values and filling them in the dataframe
# Training data
X_max_breeze = df[df['Max_Breeze_Speed'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_breeze = imp.fit_transform(X_max_breeze)

max_breeze_prediction = rfr_max_breeze.predict(X_max_breeze)
max_breeze_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'max_breeze_predict':max_breeze_prediction})

df = df.merge(max_breeze_prediction, how = 'left', on = 'ID')

df.Max_Breeze_Speed.fillna(df.max_breeze_predict, inplace=True)
del df['max_breeze_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_max_breeze = df_test[df_test['Max_Breeze_Speed'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Max_Breeze_Speed'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_breeze = imp.fit_transform(X_max_breeze)

max_breeze_prediction = rfr_max_breeze.predict(X_max_breeze)
max_breeze_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'max_breeze_predict':max_breeze_prediction})

df_test = df_test.merge(max_breeze_prediction, how = 'left', on = 'ID')

df_test.Max_Breeze_Speed.fillna(df_test.max_breeze_predict, inplace=True)
del df_test['max_breeze_predict']

Min Breeze Speed

X = df_full[df_full['Min_Breeze_Speed'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Breeze_Speed'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_breeze = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_breeze.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_min_breeze, X_test, y_test))

0.88292032860270131

# Predicting the missing values and filling them in the dataframe
# Training data
X_min_breeze = df[df['Min_Breeze_Speed'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_breeze = imp.fit_transform(X_min_breeze)

min_breeze_prediction = rfr_min_breeze.predict(X_min_breeze)
min_breeze_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'min_breeze_predict':min_breeze_prediction})

df = df.merge(min_breeze_prediction, how = 'left', on = 'ID')

df.Min_Breeze_Speed.fillna(df.min_breeze_predict, inplace=True)
del df['min_breeze_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_min_breeze = df_test[df_test['Min_Breeze_Speed'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Min_Breeze_Speed'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_breeze = imp.fit_transform(X_min_breeze)

min_breeze_prediction = rfr_min_breeze.predict(X_min_breeze)
min_breeze_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Breeze_Speed'].isnull() == True)]['ID']
                                          ,'min_breeze_predict':min_breeze_prediction})

df_test = df_test.merge(min_breeze_prediction, how = 'left', on = 'ID')

df_test.Min_Breeze_Speed.fillna(df_test.min_breeze_predict, inplace=True)
del df_test['min_breeze_predict']

Average Moisture

79 missing values, causing high error on test set

X = df_full[df_full['Average_Moisture_In_Park'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Average_Moisture_In_Park'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Average_Moisture_In_Park'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_avg_moisture = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_avg_moisture.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_avg_moisture, X_test, y_test))

0.87629324767915051

# Predicting the missing values and filling them in the dataframe
# Training data
X_avg_moisture = df[df['Average_Moisture_In_Park'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Average_Moisture_In_Park'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_moisture = imp.fit_transform(X_avg_moisture)

avg_moisture_prediction = rfr_avg_moisture.predict(X_avg_moisture)
avg_moisture_prediction = pd.DataFrame({'ID':df.ix[(df['Average_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'avg_moisture_predict':avg_moisture_prediction})

df = df.merge(avg_moisture_prediction, how = 'left', on = 'ID')

df.Average_Moisture_In_Park.fillna(df.avg_moisture_predict, inplace=True)
del df['avg_moisture_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_avg_moisture = df_test[df_test['Average_Moisture_In_Park'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Average_Moisture_In_Park'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_avg_moisture = imp.fit_transform(X_avg_moisture)

avg_moisture_prediction = rfr_avg_moisture.predict(X_avg_moisture)
avg_moisture_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Average_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'avg_moisture_predict':avg_moisture_prediction})

df_test = df_test.merge(avg_moisture_prediction, how = 'left', on = 'ID')

df_test.Average_Moisture_In_Park.fillna(df_test.avg_moisture_predict, inplace=True)
del df_test['avg_moisture_predict']

Min Moisture

X = df_full[df_full['Min_Moisture_In_Park'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Min_Moisture_In_Park'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_min_moisture = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_min_moisture.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_min_moisture, X_test, y_test))

0.93591464075660369

# Predicting the missing values and filling them in the dataframe
# Training data
X_min_moisture = df[df['Min_Moisture_In_Park'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_moisture = imp.fit_transform(X_min_moisture)

min_moisture_prediction = rfr_min_moisture.predict(X_min_moisture)
min_moisture_prediction = pd.DataFrame({'ID':df.ix[(df['Min_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'min_moisture_predict':min_moisture_prediction})

df = df.merge(min_moisture_prediction, how = 'left', on = 'ID')

df.Min_Moisture_In_Park.fillna(df.min_moisture_predict, inplace=True)
del df['min_moisture_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_min_moisture = df_test[df_test['Min_Moisture_In_Park'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Min_Moisture_In_Park'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_min_moisture = imp.fit_transform(X_min_moisture)

min_moisture_prediction = rfr_min_moisture.predict(X_min_moisture)
min_moisture_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Min_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'min_moisture_predict':min_moisture_prediction})

df_test = df_test.merge(min_moisture_prediction, how = 'left', on = 'ID')

df_test.Min_Moisture_In_Park.fillna(df_test.min_moisture_predict, inplace=True)
del df_test['min_moisture_predict']

Max Moisture

X = df_full[df_full['Max_Moisture_In_Park'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X = imp.fit_transform(X)

y = df_full['Max_Moisture_In_Park'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_max_moisture = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_max_moisture.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_max_moisture, X_test, y_test))

0.8239425796574974

# Predicting the missing values and filling them in the dataframe
# Training data
X_max_moisture = df[df['Max_Moisture_In_Park'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_moisture = imp.fit_transform(X_max_moisture)

max_moisture_prediction = rfr_max_moisture.predict(X_max_moisture)
max_moisture_prediction = pd.DataFrame({'ID':df.ix[(df['Max_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'max_moisture_predict':max_moisture_prediction})

df = df.merge(max_moisture_prediction, how = 'left', on = 'ID')

df.Max_Moisture_In_Park.fillna(df.max_moisture_predict, inplace=True)
del df['max_moisture_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_max_moisture = df_test[df_test['Max_Moisture_In_Park'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Max_Moisture_In_Park'
                                                              , 'Direction_Of_Wind'], axis = 1)

X_max_moisture = imp.fit_transform(X_max_moisture)

max_moisture_prediction = rfr_max_moisture.predict(X_max_moisture)
max_moisture_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Max_Moisture_In_Park'].isnull() == True)]['ID']
                                          ,'max_moisture_predict':max_moisture_prediction})

df_test = df_test.merge(max_moisture_prediction, how = 'left', on = 'ID')

df_test.Max_Moisture_In_Park.fillna(df_test.max_moisture_predict, inplace=True)
del df_test['max_moisture_predict']

Var1

X = df_full[df_full['Var1'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                    , 'Var1'
                                                    ], axis = 1)

X = imp.fit_transform(X)

y = df_full['Var1'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_var1 = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_var1.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_var1, X_test, y_test))

0.61193989909532365

# Predicting the missing values and filling them in the dataframe
# Training data
X_var1 = df[df['Var1'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Var1'
                                                              ], axis = 1)

X_var1 = imp.fit_transform(X_var1)

var1_prediction = rfr_var1.predict(X_var1)
var1_prediction = pd.DataFrame({'ID':df.ix[(df['Var1'].isnull() == True)]['ID']
                                          ,'var1_predict':var1_prediction})

df = df.merge(var1_prediction, how = 'left', on = 'ID')

df.Var1.fillna(df.var1_predict, inplace=True)
del df['var1_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_var1 = df_test[df_test['Var1'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Var1'
                                                              ], axis = 1)

X_var1 = imp.fit_transform(X_var1)

var1_prediction = rfr_var1.predict(X_var1)
var1_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Var1'].isnull() == True)]['ID']
                                          ,'var1_predict':var1_prediction})

df_test = df_test.merge(var1_prediction, how = 'left', on = 'ID')

df_test.Var1.fillna(df_test.var1_predict, inplace=True)
del df_test['var1_predict']

Direction of Wind

X = df_full[df_full['Direction_Of_Wind'].isnull() == False].drop(['ID', 'Footfall', 'Date', 'Year'
                                                    , 'Direction_Of_Wind'
                                                    ], axis = 1)

X = imp.fit_transform(X)

y = df_full['Direction_Of_Wind'].dropna()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

rfr_wind_dir = RandomForestRegressor(n_estimators = 150, n_jobs = 4)
rfr_wind_dir.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=4, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

np.mean(cross_val_score(rfr_wind_dir, X_test, y_test))

0.67839725370662685

# Predicting the missing values and filling them in the dataframe
# Training data
X_wind_dir = df[df['Direction_Of_Wind'].isnull() == True].drop(['ID', 'Footfall', 'Date', 'Year'
                                                              , 'Direction_Of_Wind'
                                                              ], axis = 1)

X_wind_dir = imp.fit_transform(X_wind_dir)

wind_dir_prediction = rfr_wind_dir.predict(X_wind_dir)
wind_dir_prediction = pd.DataFrame({'ID':df.ix[(df['Direction_Of_Wind'].isnull() == True)]['ID']
                                          ,'wind_dir_predict':wind_dir_prediction})

df = df.merge(wind_dir_prediction, how = 'left', on = 'ID')

df.Direction_Of_Wind.fillna(df.wind_dir_predict, inplace=True)
del df['wind_dir_predict']

# Predicting the missing values and filling them in the dataframe
# Testing data
X_wind_dir = df_test[df_test['Direction_Of_Wind'].isnull() == True].drop(['ID', 'Date', 'Year'
                                                              , 'Direction_Of_Wind'
                                                              ], axis = 1)

X_wind_dir = imp.fit_transform(X_wind_dir)

wind_dir_prediction = rfr_wind_dir.predict(X_wind_dir)
wind_dir_prediction = pd.DataFrame({'ID':df_test.ix[(df_test['Direction_Of_Wind'].isnull() == True)]['ID']
                                          ,'wind_dir_predict':wind_dir_prediction})

df_test = df_test.merge(wind_dir_prediction, how = 'left', on = 'ID')

df_test.Direction_Of_Wind.fillna(df_test.wind_dir_predict, inplace=True)
del df_test['wind_dir_predict']

Checking for all missing values being accounted for:

df_missing_check = df.append(df_test)
df_missing_check = df_missing_check.sort_values(['Date', 'Park_ID'], ascending=[1, 1])
msno.matrix(df_missing_check)

Model Building

Gradient Boosted Trees

Max depth of 5 is most effective
Outperformed both random forests and AdaBoost

X = df.drop(['ID', 'Footfall', 'Date', 'Year'
             , 'Location_Type', 'Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure'
             , 'Var1', 'Max_Ambient_Pollution', 'Min_Atmospheric_Pressure'
             , 'Max_Breeze_Speed', 'Min_Breeze_Speed', 'Min_Ambient_Pollution'
             , 'Max_Moisture_In_Park'
            ], axis = 1)
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
X = imp.fit_transform(X)

y = df['Footfall']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

gbr = GradientBoostingRegressor(n_estimators = 300
                               , max_depth = 5
                               )
gbr.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=5, max_features=None, max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=300,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

# Regular cross validation
np.mean(cross_val_score(gbr, X_test, y_test, n_jobs = 3))

0.95762687239052224

# K-fold cross validation

k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
cross_val_score(gbr, X, y, cv=k_fold, n_jobs=3)

array([ 0.96254909,  0.96338802,  0.96328219,  0.96139247,  0.96212329,
        0.9647377 ,  0.96253464,  0.96374171,  0.96351884,  0.96228093])

Examining the distance of predictions from the actual, and looking for common characteristics among the parts with the biggest differences.

# Plot of error over time
y_pred = gbr.predict(X_test)

cv_error = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Error': (y_test - y_pred)})
cv_error = pd.merge(df, cv_error, left_index=True, right_index=True)

error_plot = cv_error[['Date', 'Error']]
error_plot = error_plot.set_index('Date')
error_plot.plot(figsize = (20,10))

<matplotlib.axes._subplots.AxesSubplot at 0x1774e13a518>

cv_error.sort_values('Error').head()

	ID	Park_ID	Date	Direction_Of_Wind	Average_Breeze_Speed	Max_Breeze_Speed	Min_Breeze_Speed	Var1	Average_Atmospheric_Pressure	Max_Atmospheric_Pressure	...	Max_Weekly_Atmosphere	Min_Weekly_Atmosphere	Max_Weekly_Pollution	Min_Weekly_Pollution	Avg_Weekly_Moisture	Max_Weekly_Moisture	Min_Weekly_Moisture	Actual	Error	Predicted
95947	3686224	24	2000-02-12	178.000000	16.720000	38.000000	0.000000	1.66	8300.000000	8317.000000	...	8339.769841	8302.928571	318.951724	153.875862	246.730159	284.904762	200.650794	528	-250.086489	778.086489
22191	3388817	17	1992-11-10	138.026667	78.072267	97.077333	58.773333	0.00	8005.033333	8043.626667	...	8285.809524	8245.206349	291.910448	153.671642	254.158163	278.739796	225.091837	1011	-235.739930	1246.739930
18749	3388419	19	1992-07-10	83.000000	31.160000	45.600000	15.200000	0.00	8341.000000	8355.000000	...	8341.722222	8287.119048	324.059701	160.686567	240.734694	281.693878	186.000000	1035	-228.127802	1263.127802
18751	3388421	21	1992-07-10	81.000000	27.360000	45.600000	15.200000	0.00	8348.000000	8358.000000	...	8341.722222	8287.119048	324.059701	160.686567	240.734694	281.693878	186.000000	999	-227.231848	1226.231848
22247	3394917	17	1992-11-12	138.166667	78.330667	97.330667	59.026667	0.00	8005.033333	8043.626667	...	8285.809524	8245.206349	291.910448	153.671642	254.158163	278.739796	225.091837	652	-222.944937	874.944937

5 rows × 50 columns

cv_error.sort_values('Error').tail()

	ID	Park_ID	Date	Direction_Of_Wind	Average_Breeze_Speed	Max_Breeze_Speed	Min_Breeze_Speed	Var1	Average_Atmospheric_Pressure	Max_Atmospheric_Pressure	...	Max_Weekly_Atmosphere	Min_Weekly_Atmosphere	Max_Weekly_Pollution	Min_Weekly_Pollution	Avg_Weekly_Moisture	Max_Weekly_Moisture	Min_Weekly_Moisture	Actual	Error	Predicted
69981	3562439	39	1997-07-13	204.000000	25.080000	45.600000	0.000000	0.000000	8000.260000	8042.993333	...	8361.619048	8293.492063	315.517730	194.496454	250.790816	287.770408	208.040816	1703	182.894997	1520.105003
3429	3335939	39	1991-01-05	20.000000	36.480000	53.200000	22.800000	0.000000	8000.400000	8042.313333	...	8362.590278	8287.125000	299.838926	111.087248	265.745455	290.850000	225.940909	1323	184.154605	1138.845395
26230	3403224	24	1993-04-03	159.000000	26.600000	38.000000	15.200000	0.000000	8341.000000	8358.000000	...	8381.301587	8309.087302	303.657143	147.142857	252.214286	287.311224	201.443878	1230	195.827119	1034.172881
11167	3352417	17	1991-10-13	140.313333	78.072267	97.026667	59.077333	0.000000	8005.033333	8044.040000	...	8408.166667	8359.476190	309.082707	173.593985	252.872449	284.448980	204.811224	1536	203.969139	1332.030861
48807	3486833	33	1995-06-18	67.000000	21.280000	38.000000	0.000000	192.631933	8420.000000	8441.000000	...	8362.158730	8313.317460	317.285714	182.400000	232.301020	282.260204	176.326531	1663	206.940381	1456.059619