Introduction

The goal of this project was to identify persons of interest (POI’s) from a dataset of Enron employees. The dataset initially contained 146 observations across 21 features and a POI status label. There were 18 POI’s and 128 non-POI’s. I used a supervised machine learning algorithm to identify patterns in the email and financial data that separated POI’s from non-POI’s. The financial and email variables included:

Financial Variables Email Variables
bonus email_address
deferral_payments from_messages
deferred_income from_poi_to_this_person
director_fees from_this_person_to_poi
exer_stock_opts_over_tot shared_receipt_with_poi
exercised_stock_options to_messages
expenses
loan_advances
long_term_incentive
other
restricted_stock
restricted_stock_deferred
salary
salary_over_bonus
total_payments
total_stock_value

Three outliers in the dataset needed to be removed: “TOTAL”, “THE TRAVEL AGENCY IN THE PARK”, and “LOCKHART EUGENE E”. The first two weren’t actual people, so they did not contribute to the model. Eugene Lockhart was missing data for all of his features, so he was removed as well. Because of the nature of the dataset, it was difficult to remove any outliers that varied significantly from the rest of the population. The dataset consisted of less than 150 people and there was a lot of missing data, as shown in the following chart:

There were several financial and email outliers, but I ultimately included them because they contained useful information. People like Ken Lay and Joseph Hirko had the largest exercised stock options (by several orders of magnitude), but they were both persons of interest. With such a small number of observations, excluding any outliers would have had a significant impact on the model.

Feature Engineering

I used Exploratory Data Analysis to identify the variables below as the most important features in the dataset. I engineered two of the features (the fraction of emails sent to POI’s and the fraction of emails from POI’s) because it did not make sense to use the raw number of emails to or from a POI. The total number of emails sent from or received by each employee ranged from 10 to over 10,000. To better compare human interactions, it made more sense to code this data as a proportion. When I graphed a scatterplot of the two features (further below), they created clear boundaries that helped improve my classifier. The fraction of emails sent to a POI (x-axis) below 16% and above 68% uniformly contained non-POI’s. The same was true for the fraction of emails from a POI (y-axis) below 2% and above 14%. These clear boundaries helped my classifier distinguish between POI’s and non-POI’s.

Variable Definition
salary Guaranteed annual salary for each employee
bonus Additional compensation tied to an employee’s performance review
total_stock_value Total value of Enron stock an employee received as compensation
fract_from_poi Fraction of total emails received from a poi
fract_to_poi Fraction of total emails sent to a poi
director_fees Compensation for attendance at board meetings
restricted_stock_deferred Stock that was not fully transferrable until certain conditions were met
exercised_stock_options Amount of Enron stock an employee bought or sold
expenses Costs occurred by an employee while conducting business

I also used SelectKBest to help determine which features contributed the most variance in the data. The negative log of the p-values are plotted below. This helped highlight a few features I initially overlooked when completing my analysis. In particular, the bonus and total_stock_value features jumped out at me as important variables to consider. The following plot shows each feature’s relative impact:

Below are the evaluation metrics for my final model with and without the fract_to_poi and fract_from_poi features. The results show that after adding the engineered features to the model, precision increased from 27% to 35% and recall increased from 44% to 78%. I also created the variables salary_over_bonus and exer_stock_opts_over_tot. When I tested them in my algorithm, their evaluation metrics were better than the original 7, but lower than fract_from_poi and fract_to_poi alone, so I omitted them. The results of adding the engineered features to the model are summarized below:

Fract_to_poi Fract_from_poi Salary_over_bonus Exer_stock_opts_over_tot Accuracy Precision Recall F1 F2
77% 27% 44% 34% 39%
78% 34% 78% 48% 63%
76% 32% 67% 43% 55%

The following plots helped me further understand the underlying nature of the financial and email variables. The boxplot of financial variables helped me see that only Non-POI’s had director_fees and restricted_stock_deferred variables present. The emails boxplot made me think critically about how I could engineer new fractional email features which I eventually included in my classifier:

After identifying useful features, I scaled them before trying PCA in my model. I scaled the features because the ranges of the variables were too large and would throw off predictions. For instance, the range for fractional email variables ranged from 0.0 to 1.0 while exercised stock options varied from 1,000 to around 30,000,000. With such a large range, PCA would not be able to properly determine which variable accounted for most of the variance in the data. I created my own MinMaxScaler and applied it right after creating new features. It ensured that all variables ranged from 0 to 1. PCA allowed the data to speak for itself in a way. I also used GridSearchCV to tune the parameters of PCA itself. After testing out a range of 2 to 11 components for PCA I realized that I had better evaluation metrics without the transformation, so I excluded it.

Algorithm Selection

I tested out several algorithms before settling on a Support Vector Machine. I started with a GaussianNaiveBayes to get a benchmark, then tried out a Decision Tree, KNearest Neighbors, AdaBoost and finally Support Vector Machines. The KNearest Neighbors Classifier had the worst results overall at only .22 for precision, recall and F1 scores. AdaBoost was slower than the other classifiers and did only marginally better than KNearest Neighbors. The Decision Tree Classifier initially had the best overall precision, recall and F1 scores but as we see further below, there was another classifier that performed slightly better.

Support Vector Machines had the best overall scores once I tuned the class_weight parameter to “balanced”. This was important because the dataset was very imbalanced. There were only 18 POI’s and 128 non-POI’s (125 once outliers were removed). Before tuning the class_weight parameter, the algorithm penalized mistakes to both classes equally. Adjusting the class_weight parameter remedied this. The original survey results are summarized below. I used the following features in each model:

Classifier Accuracy Precision Recall F1 F2
DecisionTree 83% 35% 36% 35% 36%
Naive Bayes 33% 17% 100% 29% 50%
KNearest Neighbors 79% 22% 23% 23% 23%
SupportVectorMachines 76% 33% 73% 46% 59%

Parameter Tuning

Tuning the parameters of an algorithm involves testing out different combinations of input arguments. These input arguments, known as hyperparamters are set before an algorithm is fit to data and affect the performance on an independent data set. These hyperparameters affect the performance of a model, including but not limited to its complexity, learning capacity, and speed. This is one way to “custom fit” an algorithm to the training data it is modeling. Parameter tuning is a way to adjust an algorithm at a high level prior to training. For instance, if you are using a DecisionTree Classifier and your model is underfitting to your data, you can adjust the min_samples_split hyperparameter to allow the model to make more granular splits and hence more closely fit the data.

Parameter tuning applies to both pre-processing steps like PCA and the algorithm chosen. I used GridSearchCV to accomplish this. It allowed me to try several different parameters in one shot as opposed to rerunning the algorithm every time I wanted to make an adjustment. I tried a range of 2-11 PCA components and found that PCA didn’t help my model much. When testing the DecisionTree classifier, I was able to experiment with different max_depths, split criteria and min_samples_splits. We calculated entropy for various scenarios in the lessons so I tried using that as an alternative criterion, but the results didn’t change much from baseline.

In the end I stuck with the baseline parameters for SVC. I tested different C-values, gammas and class weights. Again, they didn’t perform better than baseline. I set the random_state to 42 because the answer is always 42 :). When I did not specify a random_state, the scores kept bouncing all over the place. I realized this was an important step to ensure that others could reproduce my results.

Model Validation

Validation involves ensuring a model is effective on both training and testing data. One pitfall in validation is when you accidentally overfit your model to the training data. Overfitting is when your model performs well on training data but poorly on new testing data. This can happen if you don’t partition a large enough chunk for your testing set or when you use an overly-convoluted boundary when training.

I validated my analysis with cross validation using StratifiedShuffleSplit. I split the data into 10 folds, trained it on each of the 10 folds, then computed the accuracy scores for each. This kept the proportion of POI’s and non-POI’s as close as possible to the proportion seen in the full dataset. The average overall accuracy score using this method was 81.2%.

Conclusion

My final precision and recall scores were 35% and 78%, respectively. Precision is a ratio of true positives to the total number of true positives and false positives. In other words, it asks, “Out of all of the items that are truly positive, how many were correctly labeled as positive?” Recall is a ratio of true positives to the total number of true positives and false negatives. It asks, “Out of all of the items labeled as positive, how many truly belong to the positive class?” Below are the equations for precision and recall as well as the evaluation metrics from tester.py:

Classifier Accuracy Precision Recall F1 F2
Support Vector Machines 78% 35% 78% 48% 63%

In context, my algorithm had more false positives than false negatives. True positives were situations where a POI was correctly identified as a POI and true negaitives were when a Non-POI was correctly identified as such. A false positive was a situation where a POI was labeled as a Non-POI and a false negative was a situation where a Non-POI was labeled as a POI. This led to high recall scores and low precision scores. I could have used accuracy as my primary metric, but that would have been misleading.There were only 18 POI’s in the entire dataset. Because the dataset was so imbalanced, I could have achieved 87.4% accuracy by simply labeling all observations as Non-POI’s. That is why precision and recall were better metrics for evaluating my algorithm on this dataset.