Red State/Blue State: Geographic Clustering

ArcGIS includes a large suite of geostatistical analysis tools ranging from simple interpolation and cluster analyses to sophisticated kriging and geographically-weighted regression methods. This exercise has you explore them using a fairly small dataset (US states; N = 50) and a large dataset (counties in the lower 48 states; N = 3107).  ESRI's help files include a lot of useful background information on the various geostatistical tools. I encourage you to take time to explore these resources.  All of the geostatistics modules are included in your student version license.


Part one:   For many years the Tax Foundation compared total Federal taxes paid by state (from the IRS's annual Data Book) with total Federal spending on each state (from the Census's Consolidated Federal Funds Report). Their most recent data from 2005 reveal wide discrepancies across the 50 states. At one extreme, "winners" such as New Mexico, Mississippi, West Virginia and Alaska received about $2 in federal spending for each $1 they send to Washington in taxes. At the other extreme, "losers" such as New Jersey, Nevada, Connecticut, New Hampshire and Minnesota got less than 75 cents in federal spending for each $1 they send to Washington. Similar discrepancies exist today, although the enormous growth in deficit spending means that every state now receives more in federal spending than it sends to Washington in Federal taxes.

For at least the last 20 years, states voting "red" (Republican) in presidential elections generally had larger federal tax/spend ratios--and contributed more to the federal deficit--than states voting "blue" (Democrat). But is this relationship statistically significant?

First, try a quick and dirty analysis using 2008 election data.  Use Excel's Chi-square utility to compare the 2x2 table showing the actual numbers of Winner-McCain, Winner-Obama, Loser-McCain and Loser-Obama states versus another table representing the results you would expect if Obama and McCain had won the same proportions of fiscal winners to losers.



Now download my compilation of Tax Foundation and election data and use Excel's linear regression utility (in the Data Analysis tools) to test this correlation using data from other years. Rather than use a binary red/blue variable, you can use the popular vote data to calculate the natural logarithm of the ratio of Democratic votes to Republican votes: ln(D/R). This variable represents the degree to which a state is red or blue.

My original hypothesis was that the incumbent majority in Washington would give relatively more federal money to states that were politically aligned with them (and maybe swing states that might be "bribed" to vote for them), and relatively less federal money to states aligned with the out-of-power party. But these federal spending and tax patterns don't change much in response to changing political majorities in Washington.


Part 2: I obtained more recent data from the IRS and CFFR and calculated similar spending-to-tax ratios by state for 2010. This spreadsheet also contains log ratios of the popular votes by state for the 2004, 2008 and 2012 presidential elections. (Census has not compiled a CFFR since 2010, because Congress decided to stop wasting our tax dollars reporting on how they're, uh...wasting our tax dollars.)

As an advanced variant of the simple red state/blue state exercise, use Arc's geostatistical tools to test for more recent spatial correlations between states' federal tax/spend ratios and political orientations. The red state/blue state political divide exhibits significant geographic clustering which reflects differences in regional media markets, degrees of urbanization, etc. And you can test the significance of other factors, such as differing median educational attainment, rates of church attendance, median household incomes, etc. You are encouraged to incorporate any data you like into this analysis.

  1. First, join the Excel data to your States layer, and try out some of Arc's Spatial Statistics Tools, analyzing the spatial clustering of red and blue states, and winners and losers. Try the High/Low Clustering, Spatial Autocorrelation, Cluster & Outlier Analysis and Hot Spot utilities. (Review the documentation on each of these tools first)

  2. What other factors might explain the regionalization of American politics, e.g., education levels, income levels, age distribution, degree of urbanization, immigration rates, racial composition, etc.? Find and join some additional explanatory variables to your States layer. Review the documentation on ArcGIS's Ordinary Least Squares (OLS) regression utility and use this procedure to model and test the relationship between political orientation and these other variables in the lower 48 states, omitting DC (not a state), Alaska and Hawaii.

  3. One of the coolest tools in ArcGIS is the Geographically-Weighted Regression utility (see the example in Part 3). While conventional regression procedures yield single coefficient point estimates and significance tests, this utility yields coefficient estimates that vary across space. Estimate a geographically-weighted regression (GWR) model using the same variables that you used in the OLS model. Save and explain some of the more important coefficient maps.

  4. The standard OLS model is based on the assumption that the data are independent and identically-distributed (IID), so factors like location and local densities of data aren't supposed to matter. But "closer things are more closely related" implies spatial autocorrelation, which can be viewed as an information redundancy problem: sampling an additional datapoint in a cluster of similar datapoints inflates your nominal sample size and the reported significance levels of your regression coefficients. One solution to this problem is to use a spatially-weighted regression, where datapoints in densely-sampled regions are given lower weights than datapoints in sparsely-sampled regions. (NOTE: spatial weight matrices should always be calculated from projected, not geographic, data!)

    Review the documentation on ArcGIS's spatial weights matrix utility so that you understand how the weights are created. Then create a spatial weights matrix for the 48 states in the continental US, and re-estimate the GWR model incorporating these weights.

Summarize and compare the results of the OLS, GWR and spatially-weighted GWR models.


Part 3: Try doing a fine-grain analysis of the red-blue divide at the county level using the same tools. Download and join the county-level 2008 vote results Excel file into the US counties shapefile. Calculate the same red-blue log-ratio variable for counties that you calculated for the states. Your basic objective here is to use ArcGIS's geostatistics tools to visualize, analyze and hopefully explain the socioeconomic determinants and spatial clustering patterns behind the red-county/blue-county divide.

I crunched the CFFR data for 2009 by object code, agency and program; and at various levels of geography: state, county and Congressional District. Unfortunately, the most recent county-level data on federal tax burdens that I could find is for 2004; it's on the Tax Foundation's website.

The counties attribute table included with the US counties shapefile includes lots of other potential predictors of red-county/blue-county. You will notice that urban counties tend to vote Democratic while rural counties tend to vote Republican. There are two Rural-Urban Continuum Code fields for 1993 and 2003 in the counties shapefile's attribute table. RURURBCC03 may be a good predictor of political orientation.

Here's the output map from a Hot Spot Analysis of the log-ratio of Obama to McCain votes, by county, which represents core areas of red-state and blue-state voting strengths.

The following maps are output from a trial geographically-weighted regression of the log odds ratio of the 2008 popular vote by county, LNVRATIO = ln[Obama/McCain], against median household income (MEDHHINC) and percentage of population that graduated from college (PCTCOLLG):

LNVRATIO = Β0 + Β1MEDHHINC + Β2PCTCOLLG + Ε

The income coefficient map is generally positive, with negative clusters along the upper Mississippi and west coast.

I normalized the income coefficient map by the corresponding coefficient standard error map to obtain a t-test map of coefficient significance with break values of -1.96 and +1.96.

I then superimposed the coefficient map with 50% transparency on top of the t-test map to identify the clusters of counties with significantly positive or negative income coefficients.

Here are equivalent composite sign/significance maps for the higher education and intercept coeffients and the normalized residuals:



The spatial distribution of residuals appears to exhibit some clustering (positive spatial autocorrelation) which would imply the estimated coefficient significances are overstated.  I used the Spatial Autocorrelation (Moran's I) tool to obtain a calculated Moran's index of 0.187319 with a variance of only 0.000029 versus an expected index of -0.000322.  The null hypothesis of no spatial autocorrelation is strongly rejected.

The appropriate correction for autocorrelation in the residuals would be to construct a weight matrix based on inverse distances between county centroids, so that nearby (and more highly-correlated) counties carry relatively less weight in the regression procedure.


Part 4:  A cartogram is a map in which the features themselves are scaled according to some relevant variable, with necessary distortions of shape required to fit the features together.  I downloaded a cartogram tool written by Tom Gross from the ESRI website, installed it and used it to create the maps below.

Here's a cartogram of US states scaled by their 2008 Electoral College votes, and thematized to show the 2008 popular votes (the logarithm of the ratio of Obama votes to McCain votes).

This map shows a truer balance of blue and red than the more conventional red-blue maps which exaggerate the red states with large areas but small populations. 

Here's another cartogram of counties scaled by their 2010 populations, and thematized to show the county-level popular vote.  I dissolved the county polygons by state to create the state polygons.  Note the more severe distortions of these state boundaries. 


Again, this map shows a lot more balance between blue and red than the more conventional county vote map shown above, which is dominated by red counties with large land areas but low population densities. 

Since these are basically maps with uniform densities, they may have nicer sampling properties for geostatistical analysis than a conventional map!

Try installing this cartogram tool on your own computer, and create a cartogram of states sized by their 2012 electoral votes, showing the 2012 election results. 

Here's a cluster analysis of the cartogram created from the Getis-Ord Hot Spot Analysis geostatistical tool: