6  Models

By the end of this chapter you should gain the following knowledge and practical skills.

Knowledge
Practical skills

6.1 Introduction

So far the analysis presented in this book has been data-driven. Having described data in a consistent way (Chapter 2), visual analysis approaches have been applied to expose structure in datasets, informed by established visualization guidelines. Chapters 4 and 5 involved model building, but these were largely value-free models based on limited prior theory. This chapter is presented as a worked data analysis. We attend to a well-known dataset with a more explicit and theoretically-informed motivation.

The chapter explores variation in voting behaviour in the UK’s 2016 referendum on leaving the EU. You might remember that whilst there was a slight majority for Leave (c. 52%), the vote varied between different parts of the country. There were many theories and explanations offered for why particular places voted the way they did, often related to the demographic composition of those areas. We will explore whether the sorts of compositional demographic factors discussed vary systematically with area-level Leave voting. Using regression frameworks, we will model the relative effect of each of these compositional factors in structuring variation in the vote and construct data graphics that allow these models and parameters to be considered and evaluated in detail.

Note

This chapter assumes some basic familiarity with linear regression modelling. For a clear and detailed overview, with excellent social science examples, you may wish to consult Bartholomew et al. (2008).

6.2 Concepts

6.2.1 Quantifying and exploring variation

In Figure 6.1 is a map and bar chart displaying total variation in vote shares for Leave in Great Britain (GB), estimated at Parliamentary Constituency level (see Hanretty 2017). The values themselves are the difference in estimated vote shares from an expectation that the Leave vote by constituency, y_{i} our outcome of interest, is uniformly distributed across the country and so equivalent to the overall vote share for Leave of c. 52%. Although a slightly contrived formulation, we could express this as an intercept-only linear regression model, where the estimated slope (\beta_{1}) is ‘turned off’ (takes the value 0) and the intercept (\beta_{0}) is the GB average vote share for Leave (\bar{y}):

\begin{align*} y_{i}= \beta_{0} + \beta_{1} + \varepsilon_{i} \end{align*}

So we estimate the Leave vote in each constituency (y_{i}) as a function of:

  • \beta_{0}, the intercept, the GB average vote share (\bar{y}) +
  • \beta_{1}=0, a negated slope, +
  • \varepsilon_{i}, a statistical error term capturing the difference between y_{i}, the observed Leave vote in a constituency, and the unobservable ‘true population’ value of the Leave vote in each constituency

How does this relate to the idea of characterising variation? The length and colour of each bar in Figure 6.1 is scaled according to model residuals: the difference between y_{i}, the observed value, and the expected value of the Leave vote under the uniform model. The sum of these bar lengths is therefore the total variance that we later try to account for by updating our regression model to generate new expected values using information on the demographic composition of constituencies.

Figure 6.1: Residuals from uniform model comparing constituency Leave vote to GB average.

Figure 6.1 is similar to the maps that were published widely in press reports in the aftermath of the vote, and demonstrates that there is indeed substantial variation in Leave voting between different parts of the country. The intercept-only model consistently underestimates the vote in Scotland and most of London. Outside of this, constituencies voting in smaller proportions than would be expected for Leave are distributed more discretely in the country: the dark red dot with surrounding red area in the east of England is Cambridge and Cambridgeshire, constituencies in Bristol (south west), Manchester and Liverpool (north west), Brighton (south), are also reasonably strong red.

When evaluating the effectiveness of modelled values, there are various checks that can be performed. An obvious one here is whether there is bias in the residuals – whether residuals have structure that suggests they are grouped in a way not captured by the model. Given the motivation behind our analysis it is no surprise that there is a geographic pattern to the residuals in Figure 6.1, but also the non-symmetrical shape of the ‘signed’ bars in the left of the graphic is instructive. There are more constituencies with positive values than negative, the Leave vote is underestimated by the uniform model for 57% of constituencies, and some constituencies have quite large negative values. The strongest vote for Leave was Boston and Skegness with 76% for Leave, but the strongest for Remain was Hackney North and Stoke Newington with 80% for Remain.

6.2.2 Quantifying and exploring co-variation

More interesting still is whether the pattern of variation in Figure 6.1 is correlated with compositional factors that we think explain this variation; and also whether bias or structure in residuals exists even after accounting for these compositional factors.

In Table 6.1 is a list of possible explanatory variables describing the demographic composition of constituencies. Each variable is expressed as a proportion of the constituency’s population. So the degree educated variable describes the proportion of residents in the constituency educated at least to degree-level. Comparison across these variables is challenging due to the fact that their ranges differ: the EU-born variable ranges from 0.6% to 17%; the white variable from 14% to 98%. Common practice for addressing these sorts of range problem is to z-score transform the variables so that each variable is expressed in standard deviation units from its mean.

Table 6.1: Breakdown of variable types
Census variable Constituency %
post-industrial / knowledge economy
degree-educated with degrees +
professional occupations ns-sec manager/professional
younger adults adults aged <44
heavy industry manufacturing and transport
diversity/values/outcomes
not good health reported fair, bad, very bad
white ethnicity white British/Irish
Christian Christian
EU-born EU-born (not UK)
metropolitan / 'big city'
own home own home
no car don't own a car

Figure 6.2 presents scatterplots from which the extent of linear association between these demographics and Leave voting in each constituency can be inferred. Each dot is a constituency, arranged on the x-axis according to the value of each explanatory variable and the y-axis according to the share of Leave vote. The scatterplots are faceted by explanatory variable and ordered left-to-right and top-to-bottom according to correlation coefficient. The variable most heavily correlated with Leave voting is that measuring levels of degree education: as the share of a constituency’s population educated at least to degree-level increases, the share of Leave vote in that constituency decreases. An association in the same direction, but to a lesser extent, is observed for variables representing similar concepts: professional occupations, younger adults, EU-born, no-car and the reverse for Christian, not-good health and heavy industry.

Figure 6.2: Scatterplots of constituency Leave vote against selected explanatory variables.

It is of course possible, and likely, that there is autocorrelation in many of the compositional characteristics of constituencies and Leave voting. Parallel coordinate plots may help visually explore this multivariate space. Whereas in scatterplots, observations are represented as points located in x- and y- axes that are orthogonal, in a parallel coordinate plot axes are parallel and the values of each observation are encoded via a line connecting the multiple parallel axes. In Figure 6.3 each of the thin lines is a constituency coloured according to whether the constituency voted majority Remain (red) or Leave (blue). The first variable encoded is the size of the Leave vote and variables are then ordered according to linear association with Leave. Note that we have reversed the polarity of variables such as degree-educated and professional so that we expect more Leave (blue lines) towards the right locations of the parallel axes. That the blue and red lines are reasonably separated suggests that there is autocorrelation across many of the demographics characteristics in constituencies voting differently for Leave and Remain.

Figure 6.3: Parallel coordinate plot of constituency Leave vote and selected explanatory variables.
Note

Although parallel coordinate plots enable some aspects of association between multiple variables to be inferred, they have several deficiencies. Association can only really be directly inferred by comparing variables that are immediately adjacent. The order of parallel variables can affect the visual appearance and patterns of plots. And relatedly, visual patterns of the plot that are salient may be incidental to the phenomena being investigated.

6.2.3 Modelling for co-variation

Linear regression provides a framework for systematically describing the associations implied by the scatterplots and parallel coordinate plot, and with respect to the constituency-level variation identified in Figure 6.1. Having seen these data, the demographic variables in Figure 6.2, we can derive new expected values of constituency-level Leave voting.

To express this in equation form, we update the uniform model such that Leave vote is a function of the selected explanatory variables. For single-variable linear regression, we might select the proportion of residents educated at least to degree-level ({d_{i1}}):

\begin{align*} y_{i}&= \beta_{0} + \beta_{1}{d_{i1}} + \varepsilon_{i} \\ \end{align*}

So we now estimate the Leave vote in each constituency (y_{i}) as a function of:

  • \beta_{0}, the intercept, the GB average vote share (\bar{y}) +
  • \beta_{1}=\beta_{1}{d_{i1}}, the slope, indicating in which direction and to what extent degree-educated is associated with Leave, +
  • \varepsilon_{i}, the difference between y_{i} (the observed value) and the unobservable ‘true population’ value of the Leave vote in that constituency (statistical error)

It is of course likely that some demographic variables account for different elements of the variation in the Leave vote than others. You will be aware that the linear regression model can be extended to include many explanatory variables:

\begin{align*} y_{i}&= \beta_{0} +\beta_{1}x_{i1} + ... + \beta_{k}x_{ik} + \varepsilon_{i} \\ \end{align*}

So this results in separate \beta_{k} coefficients for separate explanatory variables. These coefficients can be interpreted as the degree of association between the explanatory variable k and the outcome variable, keeping all the other explanatory variables constant – or the distinct correlation between an explanatory variable k and the outcome variable, net of the other variables included in the model.

Figure 6.4: Outputs from multiple regression model of Leave vote by demographic composition of constituency

In Figure 6.4 are regression coefficients (\beta_{k}) from a multiple regression model with degree-educated, no car, white, heavy industry, EU-born and not good health selected as explanatory variables. Coefficients are reported as dots with estimates of uncertainty represented as lines encoding 95% confidence intervals. Most variables’ coefficients are in the direction that would be expected given the associations in Figure 6.2. Net of variation in the other compositional factors, increased levels of degree education in a constituency has the effect of reducing the Leave vote. The two exceptions are EU-born and white: after controlling for variation in the other demographic variables, increased proportions of white residents reduces the Leave vote and increased proportions of residents that are EU-born increases the Leave vote. Since the confidence interval for white crosses zero, this coefficient is subject to much uncertainty. Further exploration may allow us to identify whether these counter-intuitive effects are genuine or the result of a poorly-specified model.

6.2.4 Evaluating model bias

Our analysis becomes more interesting when we start to explore and characterise model bias: any underlying structure to the observations that are less-well accounted for by the model.

For area-level regression models such as ours, it is usual for residuals to exhibit some spatial autocorrelation structure. For certain parts of a country a model will overestimate an outcome given the relationship implied between explanatory and outcome variables; for other parts the outcome will be underestimated. This might occur due to:

  • Spatial dependence in variable values over space. We know that the geography of GB is quite socially distinctive, so it is reasonable to expect, for example, the range in variables like heavy industry and white to be bounded to economic regions and metropolitan-versus-peripheral regional contexts.
  • Spatial nonstationarity in processes over space. It is possible that associations between variables might be grouped over space – that the associations vary for different parts of the country. For example, high levels of EU-born migration might affect political attitudes, and thus area-level voting, differently in different parts of a country.

We can test for and characterise spatial autocorrelation by performing a graphical inference test, a map line-up (Beecham et al. 2017; Wickham et al. 2010) against a null hypothesis of complete spatial randomness (CSR) in residuals. Here a plot of real data, the true map of residuals, is hidden amongst a set of decoys, in this case maps with the residual values randomly permuted around constituencies. If the real map can be correctly identified from the decoys, then this lends statistical credibility to the claim that the observed data are not consistent with the null of CSR. Graphical line-up tests have been used in various domains, also to test regression assumptions (Loy, Hofmann, and Cook 2017). The map line-up in Figure 6.5 demonstrates that there is very obviously spatial and regional autocorrelation in residuals, and therefore structure that our regression model misses.

Figure 6.5: Map line-up of residuals in which the ‘real’ dataset is presented alongside 8 decoy plots generated under assumption of CSR.

There are different ways of updating our model according to this geographic context. We have talked about patterning in residuals as being spatial, with values varying smoothly and continuously depending on location. This might be the case, but given the phenomena we are studying, it also plausible that distinct contexts are linked to regions. The residuals in Figure 6.5 – the real being plot 3 – do seem to be grouped by regional boundaries, particularly Scotland looks categorically different. This suggests that geographic context might be usefully represented as a category rather than continuous variable (location in x,y). We will therefore update our model representing geographic context as a regional grouping and cover approaches both to modelling spatial dependence in values and spatial non-stationarity in processes.

6.2.5 Geographic context as grouped nuisance term

A common approach to treating geographic dependence in the values of variables is to model geographic context as a Fixed Effect (FE). A dummy variable is created for each group (region in our case), and every region receives a constant. Any group-level sources of variation in the outcome are collapsed into the FE variable, which means that regression coefficients are not complicated by this more messy variation – they now capture the association between demographics and Leave after adjusting for systematic differences in the Leave vote due to region. So, for example, we know that Scotland is politically different from the rest of GB and that this appears to drag down the observed Leave vote for its constituencies. The constant term on region adjusts for this and prevents the estimated regression coefficients (inferred associations between variables) from being affected. Also estimated via the constant is the ‘base level’ in the outcome for each element of the group – e.g. net of demographic composition, the expected Leave vote in a particular region.

The linear regression model, extended with the FE term ({\gamma_{j}}), for a single variable model:

\begin{align*} y_{i}&= {\gamma_{j}} + \beta_{1}x_{i1} + \varepsilon_{i} \\ \end{align*}

So we now estimate the Leave vote in each constituency (y_{i}) as a function of:

  • {\gamma_{j}}, a constant term similar to an intercept for region j, +
  • \beta_{1}=\beta_{1}x_{i1}, the slope, indicating in which direction and to what extent some explanatory variable measured at constituency i is associated with Leave, +
  • \varepsilon_{i}, the difference between y_{i} (the observed value) at constituency i and the unobservable true population value of the Leave vote in that constituency (statistical error)

Presented in Figure 6.6 are updated regression coefficients for a multivariate model fit with a FE on region. In the left panel are the FE constants. Together these capture the variance in Leave vote between regions after accounting for demographic composition. These coefficients have useful properties: they are the estimated size of the Leave vote for a constituency in a region net of demographic composition. London is in interesting here. When initially analysing variation in the vote, constituencies in Scotland and London were distinctive in voting in much smaller proportions than the rest of the country for Leave. Given the associations we observe with Leave voting and demographic composition, however, if we were to randomly sample two constituencies that contain the same demographic characteristics, one in London and one in another region (say North West), on average we would expect the Leave vote in the London constituency to be higher (~60%) than that sampled from North West (~51%). A separate, and more anticipated pattern is that Scotland would have a lower Leave vote (~38%) – that is, net of demographics there is some additional context in Scotland that means Leave is lower than in other regions.

In the right panel are the regression coefficients net of this between-region variation. Previously the white variable had a a slight negative association with Leave, counterintuitively. Now the white variable has a direction of effect that conforms to expectation – net of variation in other demographics increased proportions of white residents is associated with increased Leave voting. For the other variablea – EU born – the coefficient still unexpectedly suggests a positive association with Leave.

Figure 6.6: Output from multiple regression model of Leave vote by demographic composition of constituency with FE on region

6.2.6 Geographic context as grouped effects

Rather than simply allowing a constant term to vary, we can update the linear regression model with an interaction term ({\beta_{1j}}{x_{i1}}) that allows the coefficient estimates to vary depending on region. This means we get a separate constant term and coefficient estimate of the effect of each variable on Leave for every region.

\begin{align*} y_{i}&= {\gamma_{j}} + {\beta_{1j}}x_{i1} + \varepsilon_{i} \\ \end{align*}

  • {\gamma_{j}}, a constant term similar to an intercept for region j, +
  • {\beta_{1j}}x_{i1}, the region-specific slope, indicating in which direction and to what extent some demographic variable at constituency i and in region j is associated with Leave, +
  • \varepsilon_{i}, the difference between y_{i} (the observed value) at constituency i and the unobservable true ‘population’ value of the Leave vote in that constituency (statistical error)

In Figure 6.7 are region-specific coefficients derived from a multivariate model with an interaction term introduced on region. In each region, degree-educated has a negative coefficient and with reasonably tight uncertainty estimates, or at least CIs that do not cross 0. The other variables are subject to more uncertainty. The no-car variable is also negatively associated with Leave, a variable we thought may separate metropolitan versus peripheral contexts, but the strength of negative association, after controlling for variation in other demographic factors, does vary by region. The heavy industry variable, previously identified as being strongly associated with Leave, has a clear positive association only for London and to a much lesser extent for North West and Wales (small coefficients). The EU-born variable is again the least consistent as it flips between positive and negative association when analysed at the regional-level: after controlling for variation in other demographic characteristics it is positively associated with Leave for North West, Scotland, South West, but negatively associated with Leave for the North East (though with coefficients that are subject to much variation).

Figure 6.7: Output from multiple regression model of Leave vote by demographic composition of constituency with FE and interaction on region.

6.2.7 Estimate volatility and alternative modelling approaches

Our treatment of regression frameworks has in this chapter been reasonably breezy; there are problems that we have not discussed. Introducing FE and interaction terms without adding data, for example, reduces statistical power as data are heavily partitioned. Given the fact that our data are hierarchically structured (constituencies sit within regions) hierarchical or multi-level modelling may be more appropriate to modelling this sort of regional grouping. Multi-level modelling uses partial pooling, borrowing data to make estimated coefficients more conservative, less locally biased, where there are comparatively few observations in particular groupings (see Gelman and Hill 2006). There are also many ways in which associations between values can be modelled continuously over space. In the case of geographically weighted regression (GWR) (Brunsdon, Fortheringham, and Charlton 2002), local regression coefficients for each spatial unit. If applied to our dataset, separate regression coefficients for each constituency. GW-statistics enable spatial non-stationarity in process to be flexibly explored and characterised. As GWR involves generating many hundreds of parameter estimates, visual approaches are often primarily used in its interpretation (see Dykes and Brunsdon 2007).

6.3 Techniques

The technical element to this chapter demonstrates how linear regression models can be specified in R, including approaches to extract model summaries and diagnostics, and of course represent and evaluate them using data graphics. Data recording estimated vote shares for Leave by Parliamentary Constituency, as well as constituency-level 2011 Census demographics, were originally collected from the parlitools package.

6.3.1 Import, transform, explore

  • Download the 06-template.qmd file for this chapter and save it to your vis4sds project.
  • Open your vis4sds project in RStudio and load the template file by clicking File > Open File ... > 06-template.qmd.

The template file lists the required packages: tidyverse, sf and tidymodels for extracting model outputs. The processed data with selected 2011 Census demographics can be loaded from the book’s accompanying data repository. In this folder is also a .geojson file containing a hexagon cartogram of UK parliamentary constituencies, derived from Open-Innovations’ HexJSON format.

Explanatory variables describing the demographic composition of constituencies are proportions. In order to support comparison in the multivariate models they are z-score transformed. Here, the distance between observed values for each 2011 Census variable is expressed in standard deviation units from the mean across constituencies for that variable. Our approach is to perform this transformation to each explanatory variable before piping into the model specification. This is achieved with across(). The first argument is the set of columns to which you would like the same function to be applied; and the second is the function you would like to apply. Remembering that mutate() works over columns of a data frame, and that a single column of a data frame is a vector of values, the notation .x is used to access each element of the columns being worked across.

# z-score transform explanatory variables before model
# specification.
cons_data |>
  mutate(
    across(
      .cols=c(younger:heavy_industry),
      .fns=~(.x-mean(.x))/sd(.x)
      )
  )
  <some-model-specification-code>

In Figure 6.2 and Figure 6.3 associations between candidate explanatory variables and Leave are explored using scatterplots and parallel coordinate plots respectively. To avoid cluttering this section, documented code for reproducing these plots is in the 06-template.qmd file for this chapter and in the dropdown below.

Expose code for parallel coordinate plots
# Data staging and ggplot2 code for PCPs --------------------------

# Pull out and order variable names on their correlation with Leave.
order_vars <- cons_data |>
  mutate(across(c(younger:heavy_industry), ~(.x-mean(.x))/sd(.x))) |> 
  pivot_longer(cols=younger:heavy_industry, names_to="expl_var", values_to="prop") |> 
  group_by(expl_var) |>  
  summarise(cor=cor(leave,prop)) |> ungroup() |> arrange(cor) |>  
  pull(expl_var)
# Create staged dataset for plotting.
plot_data <- cons_data |> 
  mutate(
    majority=if_else(leave>.5, "Leave", "Remain"),
    across(c(leave, younger:heavy_industry), ~(.x-mean(.x))/sd(.x)),
    decile=ntile(leave, 10),
    is_extreme = decile > 9 | decile < 2
  )  |> 
  # Select out variables needed for plot.
  select(
    majority, is_extreme, constituency_name, leave, 
    degree, professional, younger, eu_born, no_car, white, own_home, 
    christian, not_good_health, heavy_industry
    ) |>  
  # Change polarity in selected variables.
  mutate(
    degree=-degree, professional=-professional, younger=-younger, 
    eu_born=-eu_born, no_car=-no_car
  ) |>  
  # Gather explanatory variables for along rows.
  pivot_longer(cols= c(leave:not_good_health), names_to="var", values_to="z_score") |> 
  # Recode new explanatory variable as factor ordered according to known
  # assocs. Reverse order here as coord_flip() used in plot.
  mutate(
    var=factor(var, levels=c("leave", order_vars)),
    var=fct_rev(var)
  ) 
# Plot PCP.
plot_data |>  
  ggplot(aes(x=var, y=z_score, group=c(constituency_name), colour=majority)) +
  geom_path( alpha=0.15, linewidth=.2) +
  scale_colour_manual(values=c("#2166ac", "#b2182b")) +
  coord_flip()
Task

Whilst the template provides code for reproducing the faceted scatterplots and parallel coordinate plots, there are some omissions. You will notice that in Figure 6.3 two very high Leave and Remain constituencies are highlighted, using thicker red and blue lines and labelling.

Can you update the ggplot2 spec to create a similar effect? There are different ways of doing this, but you may want to add a separate geom_segment() layer filter()-ed on these selected boroughs. The text annotations may be generated manually using annotate(), or derived from data using geom_text().

6.3.2 Model tidily

The most straightforward way of specifying a linear regression model is with the lm() function and summary() to extract regression coefficients.

model <- cons_data |>
  mutate(
    across(
      .cols=c(younger:heavy_industry),
      .fns=~(.x-mean(.x))/sd(.x)
      )
  ) %>%
  lm(leave ~ degree, data=.)

summary(model)
# Call:
# lm(formula = leave ~ degree, data = .)
#
# Residuals:
#      Min       1Q   Median       3Q      Max
# -0.25521 -0.02548  0.01957  0.05143  0.11237
#
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  0.520583   0.002896  179.78   <2e-16 ***
# degree      -0.088276   0.002898  -30.46   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.07279 on 630 degrees of freedom
# Multiple R-squared:  0.5956,  Adjusted R-squared:  0.595
# F-statistic: 927.9 on 1 and 630 DF,  p-value: < 2.2e-16

With tidymodels, specifically the broom package, we can extract model outputs in a format that adheres to tidy data (Wickham 2014). An example use of functions from broom for extracting model outputs:

  • tidy() returns estimated coefficients as a data frame.
tidy(model)
# # A tibble: 2 × 5
# term        estimate std.error statistic   p.value
# <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#   1 (Intercept)   0.521    0.00290     180.  0
# 2 degree       -0.0883   0.00290     -30.5 5.67e-126
  • glance() returns a single row containing summaries of model fit.
glance(model)
# # A tibble: 1 × 12
# r.squared adj.r.squared  sigma statistic   p.value    df logLik    AIC    BIC
# <dbl>         <dbl>  <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
#   1     0.596         0.595 0.0728      928. 5.67e-126     1   760. -1514. -1501.
# # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
  • augment() returns a data frame of residuals and predictions (fitted values) for the model realisation.
augment(model)
# # A tibble: 632 × 8
# leave  degree .fitted   .resid    .hat .sigma   .cooksd .std.resid
# <dbl>   <dbl>   <dbl>    <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
#   1 0.579 -0.211    0.539  0.0398  0.00165 0.0728 0.000247       0.547
# 2 0.678 -0.748    0.587  0.0914  0.00247 0.0728 0.00195        1.26
# 3 0.386  1.63     0.376  0.00957 0.00582 0.0729 0.0000509      0.132
# 4 0.653 -0.964    0.606  0.0473  0.00306 0.0728 0.000648       0.650
# ...

The advantage of generating model diagnostics and outputs that are tidy is that it eases the process of working with many model realisations. This is a common requirement for modern data analysis, where statistical inferences are made empirically from resampling. For example, we may wish to generate single-variable linear regression models separately for each selected explanatory variable. We could, for example, use these outputs to annotate the scatterplots in Figure 6.2 by their regression line and colour observations according to their residual values, distance from the regression line. These models can be generated with reasonably little code by making use of broom and a style of functional programming in R, which is supported by the purrr package.

Example code:

single_model_fits <- cons_data |>
  mutate(across(c(younger:heavy_industry), ~(.x-mean(.x))/sd(.x))) |>
  pivot_longer(
    cols=younger:heavy_industry,
    names_to="expl_var", values_to="z_score"
    ) |>
  # Nest to generate list-column by expl_var.
  nest(data=-expl_var) |>
  mutate(
    # Use map() to iterate over the list of datasets.
    model = map(data, ~lm(leave ~ z_score, data = .x)),
    # glance() for each model fit.
    fits = map(model, glance),
    # tidy() for coefficients.
    coefs = map(model, tidy),
    # augment() for predictions/residuals.
    values=map(model, augment),
  )

single_model_fits |>
  # Unnest output from glance.
  unnest(cols = fits) |>
  # Remove other list-columns.
  select(-c(data, model))

# # A tibble: 10 × 15
# expl_var  r.squared adj.r.squared  sigma statistic   p.value    df logLik    AIC
# <chr>         <dbl>         <dbl>  <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>
# 1 younger       0.289         0.288 0.0965      257. 1.05e- 48     1   582. -1158.
# 2 own_home      0.185         0.184 0.103       143. 7.42e- 30     1   539. -1071.
# 3 no_car        0.157         0.155 0.105       117. 3.81e- 25     1   528. -1050.
# 4 white         0.169         0.168 0.104       128. 3.79e- 27     1   532. -1059.
# 5 eu_born       0.233         0.232 0.100       191. 3.42e- 38     1   558. -1110.
# 6 christian     0.238         0.236 0.100       196. 4.95e- 39     1   560. -1114.
# 7 professi…     0.320         0.319 0.0944      296. 1.08e- 54     1   596. -1186.
# 8 degree        0.596         0.595 0.0728      928. 5.67e-126     1   760. -1514.
# 9 not_good…     0.316         0.315 0.0947      291. 5.93e- 54     1   594. -1182.
# 10 heavy_in…     0.504         0.503 0.0806      640. 5.43e- 98     1   696. -1385.
# # ℹ 6 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>,
# #   coefs <list>, values <list>

Code description:

  1. Setup: In order to generate separate models for separate explanatory variables we need to generate nested data frames. These are data frames stored in a special type of column (a list-column) in which the values of the column is a list of data frames – in this case one for each explanatory variable over which we would like to compute a model. You can think of parameterising nest() in a similar way to group_by(). We first pivot_longer() to generate a data frame where each observation contains the recorded Leave vote for a constituency and its corresponding z_score value for each explanatory variable. There are 10 explanatory variables and so nest() returns a data frame with the dimensions 10x2 – a variable identifying the explanatory variable on which the model is to be built (expl_var) and a list-column, each element containing a data frame with the dimensions 632x13.
  2. Build model: In mutate(), purrr’s map() function is used to iterate over the list of datasets and fit a model to each nested dataset. The new column model is a list-column this time containing a list of model objects.
  3. Generate outputs: Next, the different model outputs can be generated using glance(), tidy(), augment(), with map() to iterate over the list of model objects. The new columns are now list-columns of data frames containing model outputs.
  4. Extract outputs: Finally we want to extract the values from these nested data. This can be achieved using unnest() and supplying to the cols argument the names of the list-columns from which we want to extract values.

6.3.3 Plot models tidily

In Figure 6.4 estimated regression coefficient models are plotted from a multivariate model, annotated with 95% Confidence Intervals. The ggplot specification is reasonably straightforward.

The code for Figure 6.4:

model <- cons_data |>
  mutate(across(c(younger:heavy_industry), ~(.x-mean(.x))/sd(.x))) %>%
  lm(leave ~
    degree + eu_born + white  + no_car + not_good_health + heavy_industry,
     data=.)

tidy(model) |>
  filter(term != "(Intercept)") |>
  ggplot(
    aes(x=reorder(term, -estimate),
        y=estimate, ymin=estimate-1.96*std.error,
        ymax=estimate+1.96*std.error)
        ) +
  geom_pointrange() +
  coord_flip()

The plot specification:

  1. Data: A data frame of model coefficients extracted from the multivariate model object (model) using tidy().
  2. Encoding: y-position varies according to the size of the coefficient estimate and the 95% confidence intervals, derived from std.error and encoded using ymin and ymax parameters.
  3. Marks: geom_pointrange(), which understands ymin and ymax, for the dots with confidence intervals.
  4. Setting: coord_flip() to make variable names easier to read.

6.3.4 Extend model terms

To include a Fixed Effect (FE) term on region, the region variable is simply added as a variable to lm(). However, we must convert it to a factor variable; this has the effect of creating dummies on each value of region. Default behaviour within lm() is to hold back a reference value of region with FE regression coefficients describing the effect on the outcome of a constituency located in a given region relative to that reference region. So the reference region (intercept) in the model below is East Midlands – the first in the factor to appear alphabetically. The signed coefficient estimates for regions identifies whether, after controlling for variation in demographics, the Leave vote for a particular region is expected to be higher or lower than this.

cons_data |>
  mutate(
    across(c(younger:heavy_industry), ~(.x-mean(.x))/sd(.x)),
    region=factor(region)) %>%
  lm(leave ~
       region + degree + eu_born + white  + no_car + not_good_health + heavy_industry,
     data=.) |>
  tidy()

# # A tibble: 17 × 5
# term                           estimate std.error statistic  p.value
# <chr>                             <dbl>     <dbl>     <dbl>    <dbl>
#   1 (Intercept)                     0.530     0.00581    91.3   0
# 2 regionEast of England           0.00363   0.00787     0.462 6.45e- 1
# 3 regionLondon                    0.0654    0.00948     6.90  1.30e-11
# 4 regionNorth East                0.00482   0.00945     0.510 6.10e- 1
# 5 regionNorth West               -0.0200    0.00728    -2.75  6.12e- 3
# 6 regionScotland                 -0.145     0.00843   -17.2   1.28e-54
# 7 regionSouth East                0.00377   0.00752     0.502 6.16e- 1
# 8 regionSouth West               -0.0233    0.00789    -2.95  3.26e- 3
# 9 regionWales                    -0.0547    0.00860    -6.36  3.87e-10
# 10 regionWest Midlands             0.0236    0.00745     3.17  1.59e- 3
# 11 regionYorkshire and The Humber  0.0112    0.00762     1.47  1.41e- 1
# 12 degree                         -0.0772    0.00339   -22.8   1.30e-83
# 13 eu_born                         0.0163    0.00308     5.29  1.72e- 7
# 14 white                           0.0303    0.00314     9.66  1.18e-20
# 15 no_car                         -0.0336    0.00292   -11.5   6.50e-28
# 16 not_good_health                 0.0102    0.00331     3.07  2.24e- 3
# 17 heavy_industry                  0.0132    0.00266     4.96  9.23e- 7

We want our model to represent a dummy for every region and so we add -1 to the specification. Doing this removes the intercept (reference region), making R^2 no longer meaningful.

cons_data |>
  mutate(
    across(c(younger:heavy_industry), ~(.x-mean(.x))/sd(.x)),
    region=factor(region)) %>%
  lm(leave ~
       region + degree + eu_born + white  + no_car + not_good_health + heavy_industry -1,
     data=.) |>
 glance()

# # A tibble: 1 × 12
# r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC    BIC
# <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
#   1     0.995         0.995 0.0371     7625.       0    17  1193. -2351. -2270.
# # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

To include an Interaction on region, we need to set a variable that will be used to represent these regional constants (cons), and the Interaction is added with the notation :.

model <- cons_data |>
  mutate(
    across(c(younger:heavy_industry), ~(.x-mean(.x))/sd(.x)),
    region=as.factor(region), cons=1) %>%
  lm(leave ~ 0 +
       (cons + degree  + eu_born + white  + no_car + not_good_health +
       heavy_industry):(region),
     data=.
     )

The model updated with the regional Interaction term results in many more coefficients that are, as discussed, somewhat unstable. To plot them as in Figure 6.7 we minimally update the code used to generate the previous model outputs.

tidy(model) |>
  separate(term, into= c("term", "region"), sep=":") |>
  mutate(region=str_remove(region,"region")) |>
  filter(term!="cons") |>
  ggplot() +
  geom_col(aes(x=reorder(term, -estimate), y=estimate), alpha=.3)+
  geom_pointrange(aes(
    x=reorder(term, -estimate),y=estimate,
    ymin=estimate-1.96*std.error, ymax=estimate+1.96*std.error
  )) +
  geom_hline(yintercept = 0, size=.2)+
  facet_wrap(~region) +
  coord_flip()

The plot specification:

  1. Data: A data frame of model coefficients extracted from the multivariate model object using tidy(). To make clean plot labels we need to remove unnecessary text in the term variable (e.g. “cons:regionEast Midlands”). separate() allows us to split this column on : and then str_remove() is quite obvious. We do not wish to plot the FE constants and so filter() them out.
  2. Encoding: y-position varies according to the size of the coefficient estimate and the 95% confidence intervals, in exactly the same was as for Figure 6.4.
  3. Marks: geom_pointrange(), encoded as in Figure 6.4. The only addition is light bars in the background (geom_col()). This seems to aid interpretation of the direction and size of the coefficients.
  4. Facets: facet_wrap() on region in order to display coefficients estimated separately for each region.

6.3.5 Evaluate models with lineups

In Figure 6.8 is a map line-up of the residuals from FE-updated model – our expectation is that these residuals should no longer be spatially autocorrelated, since we collapse regional varation into our FE term.

Figure 6.8: Map line-up of residuals from model with Fixed Effect on region. The ‘real’ dataset is presented alongside 8 decoy plots generated by randomly permuting the observed residuals around constituencies. Adding the FE term has addressed some of the systematic over- and under- estimation of the vote between regions (compare for example Figure 6.5). There is nevertheless obvious spatial autocorrelation, plot 4 being the real data. Further analysis, for example of the constituencies for which Leave is particularly over- and under- represented, may be instructive.

Using functional-style programming, and tidymodels, plot lineups can be generated with surprisingly paired-back code. First generate a model object and extract residuals from it, again making use of nest(), map() and augment():

model <- cons_data |>
  select(-c(population, population_density)) |>
  mutate(
    across(c(younger:heavy_industry), ~(.x-mean(.x))/sd(.x)),
    type="full_dataset", region=as.factor(region)
    ) |>
  nest(data=-type) |>
  mutate(
    # Include `-1` to eliminate the constant term and include a dummy for every area
    model=map(data,
      ~lm(leave ~ region +  degree  + eu_born + white  + no_car + not_good_health + heavy_industry -1, data=.x)
      ),
    # augment() for predictions / residuals
    values=map(model, augment)
  )

Next, generate permuted data by randomly shuffling residual values around constituencies. To do this requires some knowledge of the rsample package and its functions. We extract residuals from the list-column named values, remove redundant list-columns (with select()) and then unnest() on the original data and the new resids field to return to a dataset where each row is a constituency, but now containing residual values for the multivariate model. From here, we use the permutations() function from rsample to shuffle the constituency ID column (pcon19cd) randomly around, generating eight permuted datasets and appending the real data (apparent=TRUE). This results in a new data frame where each row contains a permuted dataset, stored in a list-column named splits and labelled via an id column. We need to map() over splits to convert each split object into a data frame, using rsample’s analysis() function. From here, we unnest() to generate a dataset where each row is constituency and its corresponding residual value (real or shuffled) for a given permutation id.

permuted_data <- model |>
  mutate(
    resids=map(values, ~.x |>  select(.resid))
  ) |>
  select(-c(model, values)) |>
  unnest(cols=c(data,resids)) |>
  select(pcon19cd, .resid) |>
  permutations(permute=c(pcon19cd), times=8, apparent=TRUE) |>
  mutate(data=map(splits, ~rsample::analysis(.))) |>
  select(id, data) |>
  unnest(cols=data)

Now we have the permuted dataset, the lineup can be generated straightforwardly with standard ggplot2 code:

# Store max value of residuals for setting limits in map colour scheme.
max_resid <- max(abs(permuted_data$.resid))
# Store vector of permutation IDs for shuffling facets in the plots.
ids <- permuted_data |> pull(id) |> unique()

cons_hex |>
  select(cons_code, region) |>
  inner_join(permuted_data, by=c("cons_code"="pcon19cd")) |>
  mutate(id=factor(id, levels=sample(ids))) |>
  ggplot() +
  geom_sf(aes(fill=.resid), colour="#636363", linewidth=0.05)+
  geom_sf(
    data=. %>% group_by(region) %>% summarise(),
    colour="#636363", linewidth=0.2, fill="transparent"
    )+
  facet_wrap(~id, ncol=3) +
  scale_fill_distiller(palette="RdBu", direction=1,
                       limits=c(-max_resid, max_resid), guide="none")

The plot specification:

  1. Data: inner_join the permuted data on the simple features file containing the hexagon cartogram boundaries (cons_hex). To generate the lineup we facet_wrap() on the permutation id. By default ggplot2 will draw facets in a particular order – determined either by the numeric or alphabetical order of the facet variable’s values, or by an order determined by a factor variable. Each time we plot the lineups we want the order in which the real and decoy plots are drawn to vary. Therefore we convert id to a factor variable and shuffle the levels (the ordering) around, using the sample() function on a vector of permutation IDs (ids) before piping to ggplot(). Note that we also record the maximum absolute value of the residuals to ensure that they are coloured symmetrically on 0 (max_resid). Finally, you may notice there are two geom_sf() calls in the plot specification. The second draws regional boundary outlines across each plot. This is achieved by collapsing the hex data on region (using group_by() and summarise()).
  2. Encoding: hexagons are filled according to the residual values (fill=.resid).
  3. Marks: geom_sf() for drawing the hexagon outlines. The first geom_sf colours each constituency on its residual value. The second does not encode any data values – notice there is no aes() – and is simply used to draw the region outlines.
  4. Facets: facet_wrap() on region in order to display coefficients estimated separately for each region.
  5. Scale: scale_fill_distiller() for ColorBrewer (Harrower and Brewer 2003) scheme, using the RdBu palette and with limits set to max_resid.
  6. Setting: The linewidth parameter of the hexagon outlines is varied so that the regional outlines in the second call to geom_sf() appear more salient. Also here, a transparent fill to ensure that the regional outlines do not occlude the encoded residuals.

6.4 Conclusion

This chapter demonstrated how visual and computational approaches can be used together in a somewhat more ‘traditional’ area-level regression analysis. Associations between constituency-level Leave voting in the UK’s 2016 EU Referendum and selected variables describing the demographic and socio-economic composition of constituencies were explored, with data graphics used to characterise bias in the generated models – to identify geographic and regional groupings that our early models ignore. Two classes of model update for addressing this geographic grouping were covered: those that treat geographic dependence in the values of variables as a nuisance term that is to be quantified and controlled away; and those that explicitly try to model for geographic grouping in processes. We introduced some initial techniques for dealing with both, treating geography as a categorical variable: introducing a Fixed Effect term to assess regional dependence and Interaction term to assess regional non-stationarity. Importantly, the chapter reused some of the dplyr and functional programming code templates instrumental for working over models. There was a step-up in code complexity. Hopefully you will see in the next chapter that this sort of functional programming style (Wickham, Çetinkaya-Rundel, and Grolemund 2023) greatly aids the process of performing and working with resampled datasets, a key feature of modern computational data analysis.

6.5 Further Reading

For an area-level analysis of the Brexit vote:

  • Beecham, R., Williams, N. and Comber, L. 2020. “Regionally-structured explanations behind area-level populism: An update to recent ecological analyses.” PLOS One, 15(3): e0229974. doi: 10.1371/journal.pone.0229974.

On modelling for geographic dependence and non-stationarity:

  • Comber, A., Brunsdon, C., Charlton, M. et al. 2023. “A route map for successful applications of Geographically Weighted Regression.” Geographical Analysis, 55 (1): 155–178. doi: 10.1111/gean.12316.

  • L. J. Wolf et al., 2023. “On Spatial and Platial Dependence: Examining Shrinkage in Spatially Dependent Multilevel Models.” Annals of the American Association of Geographers, 55(1): 1–13.

On graphical inference:

  • Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E.K., Swayne, D. F. and Wickham, H. 2010. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.” Royal Society Philosophical Transactions A, 367:4361– 83. doi: 10.1098/rsta.2009.0120.

On model building in the tidyverse:

  • Ismay, C. and Kim, A. 2020. “Statistical Inference via Data Science: A ModernDive into r and the Tidyverse” CRC Press. doi: 10.1201/9780367409913.

  • Kuhn, M. and Silge, J. 2023. “Tidy Modelling with R.” O’Reilly.

On functional programming in R and tidyverse:

  • Wickham, H., Çetinkaya-Rundel, M., Grolemund, G. 2023, “R for Data Science, 2nd Edition”, Orielly.
    • Chapter 27.