This post originally appeared on the MIT Election Data and Science Lab (MEDSL) Medium website, and may be found in its original form here.
Last year, the MIT Election Data and Science Lab (MEDSL) released one of its most extensive datasets, the 2016 Precinct-Level Returns. With information on the elections for President, U.S. House of Representatives, U.S. Senate, and State offices, it is one of the most comprehensive election datasets freely available to academics and practitioners, who put it right to work. As they have combed through it, though, these political scientists, geographers, and data scientists have repeatedly asked: Where is the matching shapefile? This is an important question: spatially referenced election results would allow analysts to more easily join and visualize disparate types of data, such as demographic or economic data, making it very useful for a variety of analyses.
It also seems like it should be an easy question to answer. In reality, though, joining precinct-level election results to spatial bounds is quite challenging; the question’s simplicity is deceptive, and analysts should be skeptical of those who claim to have done it single-handedly or very quickly.
Why don’t we just match precinct returns to VTDs? For those who are unfamiliar with spatially referenced data, a quick primer before we dive into the details:
We’ll be focused on “voting tabulation districts” (VTDs), a set of open-source and fairly reliable shapefiles that can be obtained through the U.S. Census Bureau. VTDs come out of the Voting District Project, which is a program that allows states to send their small-area geography (wards, precincts, etc.) to the Census Bureau to be included in the Census Redistricting Data tabulations every ten years. State and local shapefiles, unfortunately, are not nearly as readily available as other local information (precinct election returns, for example) — and where they are available, they may not be centrally accessible. Further complicating all of this, the files aren’t produced to a uniform standard. The Census Bureau began the VTD program precisely to address this gap, aligning Census geography with local election geography, and providing the necessary links, via shapefiles, for analysis of this data. Because they roughly correspond to election precincts, matching precinct-level election results directly to VTDs would seem to be an elegant solution to the shapefile question. Unfortunately, there are a number of problems with that approach. These problems arise largely due to the fact that the purpose of the Voting District Project is to provide accurate demographic and spatial data to the state for redistricting, not for the creation of a precinct map as its own end. While there is a significant amount of overlap between precincts and VTDs in terms of geometry, the devil lurks in the details in two major ways. First, not all states participate fully in the VTD program. Kentucky, Montana, Oregon, and Rhode Island, for example, do not. California is officially listed as participating, but the state’s VTD files were not available on the Census Bureau website as of this blog post. Second, and more importantly, matching VTD codes to official election returns presents particular challenges. VTD names (identifiers) are usually reported by individual counties, often leading to bewildering differences in naming schemes, standardization, and data quality both between and within states. Frequently, these naming schemes do not correspond with the naming schemes used to report election returns. Even taking these challenges into account, though, we wanted to see what would happen if we attempted to match the 2016 precinct election returns with the most recent VTD definitions. Could we create a national shapefile and populate it with election results?
As part of a recent internal “hackathon,” Jacob Coblentz (a Senior Research Associate at MEDSL) and I attempted to create maps for Alabama and Florida using VTD geography and the MEDSL dataset for presidential election returns. We were inspired, in part, by the recent release of prototype VTD shapefiles, which were issued in preparation for the 2020 decennial census.
We chose to focus on these two states for their scale and for the fact that they appeared to have high-quality VTD identifiers. After a sizeable amount of initial data cleaning and address standardization, we employed exact matching as well as fuzzy matching through the RecordLinkage and fastLinkpackages in R. The results were dismal. After more than 20 collective hours of work over two days, we had matched approximately 60% of Alabama precinct results to the correct geometry. For Florida, we’d managed to match a paltry 7%. Furthering our disappointment, we quickly realized that the matching results in Alabama appeared to be highly correlated with population density: smaller, denser precincts were more likely to be mismatched or left unmatched altogether. Above: The results of our work on an Alabama map. We were able to match only about 60% of Alabama precinct results to the correct geometry; most of those matching results seemed to be highly correlated with population density. The table here presents a broader look at a few other states we tried. While there were some positive outliers — Oklahoma, for instance — in general we found that we could not use this matching process for most states. Either we ended up missing a significant percentage of precincts, or we were unable to match anything at all (for example, if the identifiers didn’t match up).
What would it take to make a national precinct map? At a basic level, making a comprehensive map of precinct returns would require a significant amount of time, money, and GIS expertise. The best example of an effort at this scale comes from the Precinct-Level Election Data Project, which was a collaboration between scholars at Harvard and Stanford. Their work to collect of election returns and shapefiles took a team of 19 professors and graduate students over a year to complete. The team painstakingly joined election results with VTDs, spending a significant amount of time googling individual place names (e.g. “Addison Community Center” in Alabama) and cross-referencing the address by location within the VTD files. Their notes on Florida, in particular, provide a great example of the scale of this issue, as they detail how some counties had to be cross-referenced and entered by hand. Making a map using the Harvard-Stanford project as a model is technically feasible. A team that wanted to pursue it would have to carry out three functions within each state:
First, you would have to try to match VTDs directly to election returns. As we found with Alabama, this is a tricky proposition. It gets trickier when we add in the fact that the team would have to also obtain those precinct results (itself an arduous process). Altogether, this step is quick at consuming resources, and otherwise inefficient.
Next, you would need to find the addresses of missing precincts, and use GIS software to geolocate those addresses within existing VTDs. The availability of these addresses varies widely by state; to track them down would require outreach to individual counties.
Finally, missing spatial areas would have to be drawn by hand, most likely using ArcGIS. To do that, you would need to procure precinct map PDFs or other references from states or counties. Once you had those, an analyst would need to draw and enter the data by hand. While this would only apply in limited circumstances, it is highly resource-intensive and prone to error.
But what about that national map that I saw on the internet? While certainly not impossible, the creation of a nationwide precinct return shapefile would require a significant amount of labor. Each step, from obtaining the results to merging them with shapefiles, is time-consuming and sometimes arduous, even for a group that had the human-, time-, and other resources that the Harvard-Stanford team had at its disposal. As such, I would be extremely cautious of nationwide precinct maps that you might find floating around on the internet, especially ones that claim to be pieced together with limited manpower or on very short time-frames. Under those circumstances, the creation of these maps would have to be almost entirely automated using largely unofficial results, as official precinct-level results must be obtained by individual counties using (undoubtedly slower) emails and FOIA-type requests.
Moreover, our Alabama example demonstrates how errors in these maps would likely be disproportionately concentrated in highly-populated areas, which are much harder to join because of the naming schemes in these counties. In that light, researchers should be cautious when using these sorts of proprietary files, and cross-reference against other data sources as much as possible. On a more positive note, however, there are a few excellent examples of smaller maps that utilize high-quality spatial and election data. Examples of these include the LA Times map of the 2016 election results in California, and Nathaniel Kelso’s crowd-sourced map project, both of which provide good documentation on their process as well as the difficulties they encountered in creating these maps.
Mapping the future The creation of a national precinct-level shapefile paired with official results would be both difficult and resource-intensive. One potential development that may make this process easier would be the adoption of the Election Results Common Data Format Specification, a type of common data format (CDF). This would standardize pre-election and post-election data, making it far easier to match precinct identifiers from election-night reporting and shapefiles. More information on the CDF can be found from the National Institute of Standards and Technology.
We know that others are interested in creating national shapefiles, and we wish them luck. For now, MEDSL is focusing on the task of collecting, normalizing, and cleaning precinct election returns. It is certainly not out of the realm of possibility for those returns to be joined and used to create a national shapefile, if an intrepid researcher reading this has the inclination to do so. For some states, combining those returns with shapefiles will be a piece of cake; for others, it will be a significant time sink.
In the meantime, let us all exercise the general good data practice of healthy skepticism, and always cross-reference questionable data with reliable sources such as the Harvard Election Data Archive, MEDSL, or official election websites.