2012 and 2016 Presidential Elections

Source: https://www.kaggle.com/joelwilson/2012-2016-presidential-elections

In our class, we were asked to analyze two datasets over the course of the semester with a variety of machine learning techniques, including Supervised, Unsupervised, and Neural Network techniques. The first of these two datasets I chose was a dataset containing the 2012 and 2016 general election results. In particular, I hoped to discuss the topic of which counties happened to switch parties during the general election, and why. For this task, I found a Kaggle dataset which contains a list of all counties along with their demographic information, voting record in the 2012 election, and voting record in the 2016 election.

Out of all counties which voted for Obama in the 2012 general election, almost a third switched to the Republican party to vote for Trump in 2016. Alternatively, virtually no counties switched from voting for Romney in 2012 to vote for Hillary Clinton in 2016. To normalize the data, I removed counties which voted for Romney in 2012 before further processing. However, the intensity of the vote for Obama in 2012 did prove significant and was left in as a training feature. I also regularized the data features using a RobustScaler, offsetting and scaling each feature taking into account their mean and standard deviation respectively. This prevented any data feature from unnecessarily taking precedence over any other simply due to larger values.

After dividing the data into input and output series, such that nothing would be fed to the learner which might directly inform it as to a voting outcome, and creating cross-validated sets, the learner was optimized on the binary-cross-entropy between the prediction of the learner and the binary voting outcome of each batch of counties.

Figure 1: A pruned decision tree output by R predicting whether or not Obama 2012 counties will vote for Clinton in 2016. "Obama" represents the percentage of votes in 2012 for Obama, "SBO215207" indicates "Asian-owned firms, percent, 2007," "RHI825214" indicates "White alone, not Hispanic or Latino, percent, 2014," "HSG445213" indicates "Homeownership rate, 2009-2013," and "Edu_bachelors" indicates "Bachelor's degree or higher, percent of persons age 25+, 2009-2013." The tree is read from the top, following to the left if the criteria listed are false, and to the right if the criteria are true.

Figure 1: A pruned decision tree output by R predicting whether or not Obama 2012 counties will vote for Clinton in 2016. "Obama" represents the percentage of votes in 2012 for Obama, "SBO215207" indicates "Asian-owned firms, percent, 2007," "RHI825214" indicates "White alone, not Hispanic or Latino, percent, 2014," "HSG445213" indicates "Homeownership rate, 2009-2013," and "Edu_bachelors" indicates "Bachelor's degree or higher, percent of persons age 25+, 2009-2013." The tree is read from the top, following to the left if the criteria listed are false, and to the right if the criteria are true.

Decision Trees

Decision Trees proved complicated to perform in Python / Sklearn, as no pruning feature has yet to be included in the library (although one of my peers did create a pruned tree for sklearn which I am unsure whether or not he has pushed to the repo, code available on request). Therefore I have separated these from the other learners in this writing as they were performed in different programmatic environments, and therefore may have unforeseen training or scoring differences.

I found very positive results from several pruned, and boosted decision tree learners on these data, upwards of 90% fit on the data. Decision Trees are also nice because they clearly show in human-readable terms how to follow their decision-making process. The figure to the right is just one of these learners [Figure 1].

The learner in Figure 1 can be a little hard to interpret, here is how it should be read: If a county's votes were greater than 57% for Obama in 2012, this learner predicts that county will vote for Clinton 93.9% of the time. However, for counties which voted less than 57% for Obama in 2012, many factors contribute to their switching to Trump.

Ironically, the edge cases for this learner say that higher education and lower white non-Hispanic populations predict a Trump turnout for high-Obama-vote counties. For low-Obama-vote counties, higher bachelor's education again splits the vote for Trump, however only by 12 counties. More significantly, high rates of "Asian-owned firms, percent 2007" also predicted Trump's rise. None of these branches are statistically relevant on their own compared to the wider trend, and highlight the fallacy of using machine learning to infer attributes about your data. The curse of dimensionality predicts that with large enough data sets, even random noise in your data can correlate with improvements in prediction. These small splits fit this description as they are further down the tree and split only a small amount of data. Furthermore, these attributes can be replaced by other randomly produced learners returning the exact opposite results, simply further up or down the tree. The curse of dimensionality can cause many correlative features to improve prediction when no causal relationship exists, and many non-mutually-independent features may be used interchangeably by these learners.

Simple Learners Did Best

Support Vector Machines (SVM) were by far the best learner, with their best results coming in at 97%. K-Nearest Neighbor (KNN) also did extremely well, topping out at 94% accuracy. Both of these are relatively simple learners, but they did optimally perform better than boosted decision trees by a few points. This can be because of a variety of factors specific to these particular learners: For one, KNN has the advantage of finding the most significantly similar county to any given county and simply transferring those county's outcomes to the output. This requires no regression, which is something typically difficult for something as complex as social dynamics. SVM's, on the other hand, imply that the data is mostly linearly separable, which means that while more than one feature must be at play (a-la the decision tree's findings), a simple linear combination of features is enough to decide the output 97% of the time. This is surprising, to say the least, but it is possible it shows that, on average, voters who change their voting tendencies for the 2016 election do so under pressures which correlate with demographics (even if it is not precisely clear what these demographic correlates are, or what demographics they effect). The nuance conveyed in recent past elections seems to be enough to bias the learner to simple yet powerful conclusions.

Neural Network Optimization

Sequential Neural Network Configuration

Figure 4: The node count and activation function of each layer of the NN in sequence. Without at least one ReLU layer convergence was too slow, yet with all ReLU layers ReLU dying would take place from the start. As such, a mixture was made replacing ReLU at the start with Tanh.
Figure 2: Accuracy score for my Tensorflow / Keras implementation of the neural network. As you can see, the network is quickly flipping between prediction extrema (always True, always False), the mean bias of the dataset being around 32%. Because of this quick convergence, from here on the ReLU layers die and fail to produce any further improvement.

Figure 2: Accuracy score for my Tensorflow / Keras implementation of the neural network. As you can see, the network is quickly flipping between prediction extrema (always True, always False), the mean bias of the dataset being around 32%. Because of this quick convergence, from here on the ReLU layers die and fail to produce any further improvement.

Figure 3: Here we use Simulated Annealing to improve on our results in Figure 2. Score is in % accuracy, and N is the number of iterations of training (log scale). Multiple hyperparameter combinations are shown. Notice that even with only one training step, greater than average starting values could be attained depending on starting temperature (T).

Figure 3: Here we use Simulated Annealing to improve on our results in Figure 2. Score is in % accuracy, and N is the number of iterations of training (log scale). Multiple hyperparameter combinations are shown. Notice that even with only one training step, greater than average starting values could be attained depending on starting temperature (T).

We attempted Neural Network optimization for this data with mixed results. My Keras / TensorFlow implementation would consistently, and quickly, flip-flop from normalized errors, starting at one minus the mean error (32%) and then quickly converging to the mean (68%), resulting in ReLU death [Figure 2]. This could be solved by further normalizing the data set, but here we use a different improvement method.

As an alternative to using the normal method of stochastic gradient descent backpropagation for further training, we instead attempted to train the network based on randomized optimization techniques such as Randomized Hill Climbing with Random Restarts, Simulated Annealing [Figure 3], and Genetic Algorithm Optimization of the same network, using a Randomized Optimization library called ABAGAIL. The results with these methods were comparable to other solutions, even matching SVM scores in the best case. Likely this is because when imbalanced data would occasionally randomly cause a learner to begin to learn at the one minus mean error, randomized restart would fix this inappropriate starting location. Simulated Annealing worked best of all. However, Genetic Algorithm optimization had to be quit early due to the time complexity of the problem.

Clustering

We also attempted clustering as a form of dimensionality reduction for our data before attempting learning on the data. This also improved neural network results, but not more so than randomized optimization results. However, I do believe it is worth sharing some of the results I uncovered and discuss their implications. These results are not used empirically, rather their surprising visual correlation with real world labels are most of note, seeing as how they did not perform better than previous methods, along with the attempted methodology.

I chose to perform a genetic algorithm hyperparameter optimization (using sklearn-deap, to which I am a contributor), choosing between clustering methods and their hyperparameters based on their success at being used to train a simple learner, such as an SVM or small Neural Network. The best results were found using Expectation Maximization (EM), looking for 5 clusters. Why Expectation Maximization? Well, when we look at the raw data we find that demographic voting clusters tend to overlap, and are therefore not wholly separable into distinct clusters. Expectation Maximization is one of just a few clustering algorithms that can handle overlapping clusters with probabilities instead of absolute labels.

Now, why 5 clusters? Well, truthfully it can be 5, 6, or 7 clusters reasonably. However, five was the first and most commonly chosen number of components by the hyperparameter search. Due to Occam's razor, I believe five is best to prevent overfitting since nothing under these values produced reasonable results. However, once five clusters are reached, the results are visually astounding.

Figure 5a: The real data with Trump-voting counties in orange and the Hillary-voting counties in blue. Notice that it looks like in many images that one color dominates the other. I have attempted to minimize this but often times it is merely the first color drawn, rather than the most significant, which dominates the picture. This is in the same order as the images in Figure 5b which correlate to these data. All labels are as described in the following reference table.

Figure 5b: The EM data using 5 clusters in the same order as those in Figure 5a. These data have been thinned out randomly to improve processor calculation speed. Some of these do not seem to correlate well, but some may surprise you. I will let you determine for yourself what color is which candidate, as it is an unsupervised algorithm. All labels are as described in the following reference table.

Dangerous Conclusions

Clustering as Commentary

In many of the above comparisons, between clustered data and their real world classifications, convincing correlations only occur with greater than 4 clusters. My first instinct for this was to conclude that a two party system was insufficient to classify the American population into voting clusters. However, while this is likely true, it assumes that the learning method is trying to cluster people based on parties. In fact, as an unsupervised learner, it is not trying to cluster people into parties at all. Rather, it is clustering people by their similarity with each other, factoring in party-line voting in a past election as one of its features. If anything, the minuscule population which is not easily clustered by two out of the five components of the clustering algorithm shows the robustness of a two party clustering while recognizing its imperfection. The noisiness of a country's worth of counties is inevitable, and if we removed past voting data we would likely get an entirely different result. The fact that just one out of so many features could cause people to be clustered into just two groups with only three small outliers speaks volumes to highly binary and opinionated nature of America's population.

It could also speak to an overriding demographic polarization in American politics. Even though there is no one factor which plays into voting, there seem to be clusterable groups within America's culture, and we can expect that groups which are similar will vote similarly.  Peer group, ultimately, may be the deciding factor into who votes for which candidate.

Future vs. the Past

Either of these conclusions might be true, but we also ought not to take from these learners that future elections might work the same way, or even that the election we are considering causally behaved this way. What our learner's give us is a model, or story about our data, one which accuracy classifies the outcomes of an election by learning from a part of the data and testing on a different part. Two things should be considered when we learn this:

  1. The subset of the data we learn from can bias our learner to automatically get at least that percentage of the classifications correct.
  2. This bias is a luxury we do not have for predicting future data.

A learner like KNN saves the training data for all future testing and will get 100% accuracy with a k value of 1 on any testing of that same training data. Other learner's attempt to regress on the training data in a way that prevents overfitting, but still we should be cautious about over-interpreting the results.

Correlation vs. Causation

Furtherer, we ought to be aware that the existence of any given story about our data which returns an accurate prediction, does not preclude the existence of other equally valid stories, just as we saw with the decision tree model, and it does not make our story causally related to the outcome.

For our clustering model also, we ought to be aware that it does not perfectly line up 1:1 with our data, and in some comparisons more than others. We should also remember the distributions overlap, and therefore might easily change year after year. Similarly, we do not know why it chose these clusters the way it did, or what they represent in the real world if there even is a real world correlation at all. All we know is that this clustering best helped train a predictive model, and looks good to the human eye, both of which have biases. 
 

Further Reading

  1. Election Results in the 3rd Dimension

  2. The Counties That Flipped From Obama To Trump, In 3 Charts

Thumbnail Attribution: Wikipedia

Comment