% summarise(PropGreens = sum(Votes[Party == "Green Party"]) / sum(Votes), TotalGLVotes = sum(Votes)) %>% ungroup() %>% left_join(Locations2014, by = "VotingPlace") %>% left_join(Meshblocks2013, by = c("MB2014" = "MB")) %>% select(PropGreens, TotalGLVotes, WGS84Latitude, WGS84Longitude, MeanBedrooms2013:PropStudentAllowance2013) # Tidy up names of variables. All the census data is from 2013 so we don't # need to say so in each variable: names(greens)" />
(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)
Motivation
This post is the fourth in a series that make up my entry in Ari Lamstein’s R Election Analysis Contest.
Earlier posts introduced the nzelect R package, basic usage, how it was built, and an exploratory Shiny web application. Today I follow up on discussion in the StatsChat blog. A post there showed a screen shot from my Shiny app, zooming in on Auckland votes for Green Party compared to the Labour Party (the two principal left-leaning parties in New Zealand). There was discussion in the comments of whether house price, income, and education were the predictors of the choice between these two parties (for the subset of voters who chose one of the two).
Combining census meshblock data with voting locations
Making it easier to investigate this issue is one the primary motivations for the nzelect R package and incorporating census data that can be matched to voting locations and electorates has been on the plan from the beginning. It’s proven a bigger job than I hoped and is very incomplete, but so far I’ve incorporated around 35 selected variables from the 2013 Census at the meshblock level.
First look
The plot below shows how some of those variables relate to the variable of interest here – the Green Party votes as a proportion of all that went to Green Party or Labour Party, by voting location.
Full descriptions of the variables are in the helpfile for the Meshblocks2013 object in the nzelect package. People without access to that can still read the source code of the helpfile which hopefully will make sense. Time and space constraints put me off creating a big table here in this blog post.
The blue lines show outlier-resistant robust linear regressions on one explanatory variable at a time, but the non-robust versions are almost identical. All code has been relegated to the bottom of the post. When the nzelect package has a more complete set of census data, including more variables and area unit, electorate and territorial authority variables, I’ll do a future post that focuses more on the code to join these things together.
Two variables that are in the nzelect package have been excluded. These are PropEuropean, the proportion of meshblock population of European ethnicity; and PropOwnResidence, the proportion of individuals that own their own residence. I excluded these early as part of my modelling strategy because they are highly collinear with the other set of candidate explanatory variables – variance inflation factors over 10. What this means in practice is that once you know the proportions of Maori, Asian and Pacific ethnicity in a meshblock there’s not much to learn from the proportion of people of European ethnicity. And once you know the proportion of households that are owned, there’s not much information contained in knowing the proportion of individuals who own their residence.
Spatial elements
Note that I’ve left in two variables for the latitude and longitude, which (probably as expected) don’t have any obvious relationship to the proportion of Green v Labour voters. This is so when I get into statistical modelling, I can take into account expected spatial auto-correlation. Simply put, that means that the results from two voting places that are close together shouldn’t be treated as completely fresh independent information, but given somewhat less importance because of their co-location. This is just a nuisance aspect of the model with regard to today’s question so I don’t report on the results; but in the modelling I do later I make sure I have a spatial spline lurking in the background to suck up any spatial effects, leaving the estimates of the impact of other explanatory variables safer.
Imputation
For confidentialisation, meshblock census results are subject to random rounding (which is why some of the proportions exceed 1 – the numerator has been rounded one way, and the denominator another) and to outright cell suppression. For today’s purposes, the cell suppression is more of a problem. As I’ve potentially got lots of explanatory variables in my model, many meshblocks are missing at least one variable due to the confidentialisation. If I excluded each voting place whose meshblock was missing data, I’d be throwing out 61% of the data.
This is enough of a problem that I’d consider going up a level and using area units (larger shapes than meshblocks) instead, although many area units have multiple voting locations in them so we would lose a fair bit of granularity in doing so. However, the nzelect data doesn’t yet include its area unit slice and I’m short of time, so I opt for imputation instead.
I use multiple imputation, which means that you create statistical models to predict each missing value based on the values that you do have, and do this multiple times for each missing value, including the random element of the model and hence getting a different imputed value each time. Then when you fit your final model of actual interest you fit it to each one of your datasets with missing value imputations (in this example I have 20 full sets of the data) and you base your calculations of standard errors, confidence intervals, etc. on the randomness that comes from the imputation process as well as the usual modelling uncertainty. It’s not a fool-proof method – nothing is in the face of missing data – but it’s a good one and it does the job in this case (how do I know? I tried other methods, including deleting all rows, and got similar results).
Modelling
When it comes to the actual modelling I use a generalised additive model with a binomial family response. The ‘additive’ bit comes from a flexible two dimensional spline which is just there to deal with the nuisance of the spatial autocorrelation. The binomial family response is appropriate for the data which uses proportions. Here’s the end results showing which variables are good for predicting the proportion of Green / Labour votes that are for the Green Party:
Interpretation
So, just to be really clear, here are the factors in the surrounding meshblock that seem lead to a voting place returning a high proportion of Green/Labour voters voting Green:
Higher proportion of individuals are self employed
Higher proportion of individuals have a Bachelor degree
Higher proportion of individuals have no religion
Higher proportion of individuals were overseas 5 years ago
Higher proportion of households don’t own their residence
And here are the factors that are associated with a high proportion of Green/Labour voters voting Labour:
High proportion of individuals state they are of Pacific, Asian or Maori ethnicity
High proportion of individuals have no qualifications
High proportion of individuals were born in New Zealand
High proportion of individuals have partner or are separated
Higher individual median income
Higher proportion of individuals are aged 20 to 24
There’s no discernable impact from the median household rent, best proxy in the particular data available for household prices. So on the data available, the general demographics argument wins over the housing prices one.
The surprising one here was the negative coefficient in front of median income. This differs from the simple two variable relationship seen in the first graphic in this post; what we’re seeing with the more sophisticated modelling is evidence that the apparent relationship between income and higher proportions of Green voters over Labour is a proxy for other demographic factors (like education, ethnicity and employment), and when those factors are controlled for the relationship between income and choosing Green over Labour is negative.
Care needs to be taken in leaping from this analysis to conclusions about individual behaviour, which would form the ecological fallacy. For example, while areas with higher incomes seem more inclined to choose Labour over Green after controlling for other factors, it’s at least possible that it’s being surrounded by higher income people that has the effect, rather than being higher income oneself; and similarly with any other observed effect. But with caution and caveats you can at least form some tentative conclusions about individuals; only individual level data could resolve that, and the New Zealand Election Study (which would provide that) isn’t yet available for download.
Final warnings
One thing we don’t do at this point is knock out all those “non significant” variables like proportion of people on unemployment benefit, proportion who are full time students, etc and refit the model. This leads to all the standard errors being biased downwards, the effect sizes being biased upwards, test statistics not having the claimed distributions, confidence intervals being too small, and in fact pretty much any inference after that point is invalid. This doesn’t stop that kind of stepwise model building from being common practice, but don’t let me catch you doing it in a context where inference about effect sizes and “which variables have how much explanatory power” are important (sometimes for predictive models that stepwise approach is more justifiable, if done carefully).
This analysis is incomplete because the nzelect project is incomplete. Most glaringly, two whole sets of individual level census variables have been excluded simply because I didn’t get round to including them in the original data package. This doesn’t invalidate the findings but it does add extra cautions and caveats. I’ll fix that eventually but it might be a few months away.
OK, that’s enough to go along with. Thanks for reading. Here’s the code that did that analysis:
To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...