2 Data Fundamentals

By the end of this chapter you should gain the following knowledge and practical skills.

Knowledge outcomes

Learn the vocabulary and concepts used to describe data.
Appreciate the characteristics and importance of tidy data (Wickham 2014).

Skills outcomes

Read-in large external files as data frames.
Calculate descriptive summaries over datasets using dplyr.
Learn how to structure, join and reshape data using dplyr and tidyr.
Create statistical graphics for initial data description and exploration.

2.1 Introduction

This chapter covers the basics of how to describe and organise data. While this might sound prosaic, there are several reasons why being able to consistently describe a dataset is important. First, it is the initial step in any analysis and helps delimit the analytical procedures that can be deployed. This is especially relevant to modern data analysis, where it is common to apply the same analysis templates over many different datasets. Describing data using a consistent vocabulary enables you to identify which analysis templates to reuse. Second, relates to the point in Chapter 1, that social data science projects usually involve repurposing datasets for the first time. It is often not obvious whether a dataset contains sufficient detail and structure to characterise the behaviours being researched and the target populations it is assumed to represent. This leads to additional levels of uncertainty and places greater importance on the initial steps of data processing, description and exploration.

Through the chapter we will develop vocabulary for describing and thinking about data, as well as some of the most important data processing and organisation techniques in R. We will do so using data from New York’s Citibike system.

Data vocabulary

A consistent vocabulary for describing data is especially useful when learning modern visualization toolkits like ggplot2, Tableau and vega-lite. We will expand upon this in some detail in Chapter 3 as we introduce the fundamentals of visualization design and the Grammar of Graphics (Wilkinson 1999).

2.2 Concepts

2.2.1 Data frames

Throughout this book we will work with data frames. These are spreadsheet-like representations where rows are observations and columns are variables. In an R data frame, variables are vectors that must be of equal length. Where observations have missing values, for certain variables the missing values must be substituted with something, usually with NA or similar. This constraint can cause difficulties. For example, when working with variables that contain many values of different length for an observation, we create a special class of column, a list-column. Organising data according to this simple structure – rows as observations, columns as variables – is useful for developing analysis templates that work with the tidyverse package ecosystem.

2.2.2 Types of variable

A familiar classification for describing data is that developed by Stevens (1946) when considering the level of measurement of a variable. Stevens (1946) organises variables into two classes: variables that describe categories of things and variables that describe measurements of things.

Categories include attributes like gender, customer segments, ranked orders (1st, 2nd, 3rd largest etc.). Measurements include quantities like distance, age and travel time. Categories can be further subdivided into those that are unordered (nominal) and those that are ordered (ordinal). Measurements can also be subdivided. Interval measurements are quantities where the computed difference between two values is meaningful. Ratio measurements have this property, but also have a meaningful 0, where 0 means the absence of something, and the ratio of two values can be computed.

Why is this useful? The measurement level of a variable determines the types of data analysis operations that can be performed and therefore allows us to make quick decisions when working with a dataset for the first time (Table 2.1).

Table 2.1: Breakdown of variable types and corresponding mathematical operations.

Measurement	Example	Operators	Midpoint	Spread
Categories
Nominal	Political parties; street names	= ≠	mode	entropy
Ordinal	Terrorism threat levels	= ≠ <>	median	percentile
Measures
Interval	Temperatures; years	= ≠ <> + -	mean	variance
Ratio	Distances; prices	= ≠ <> + - \| × ÷	mean	variance

Table 2.2: Breakdown of variable types and corresponding mathematical operations.

Task 1

Complete the data description table below identifying the measurement level of each variable in the New York bikeshare stations dataset.

Variable name	Variable value	Measurement level
`name`	“Central Park”
`capacity`	80
`rank_capacity`	45
`date_opened`	“2014-05-23”
`longitude`	-74.00149746
`latitude`	40.74177603

2.2.3 Types of observation

Observations form an entire population or a sample that we expect is representative of a target population. In social data science applications we often work with datasets that are so-called population-level. The Citibike dataset is a complete, population-level dataset in that every Citibike journey is recorded. Whether or not this is truly a population-level dataset, however, depends on the analysis purpose. When analysing trips made by Citibike users, are we interested only in those cyclists? Or are we taking the patterns observed through our analysis to make inferences about New York cycling more generally? If the latter, then there are problems as the level of detail we have on our sample is pretty trivial compared to traditional, actively-collected datasets, where data collection activities are designed with a target population in mind. It may therefore be difficult to gauge how representative Citibike users and Citibike cycling is of New York’s general cycling population. The flipside is that so-called passively-collected data may not suffer from the same problems of non-response bias and social-desirability bias as traditional, actively-collected data.

2.2.4 Tidy data

We will work with data frames organised such that columns always and only refer to variables and rows always and only refer to observations. This arrangement, called tidy (Wickham 2014), has two key advantages. First, if data are arranged in this tidy form, then it is easier to apply and re-use tools for wrangling them as they have the same underlying structure. Second, placing variables into columns with each column containing a vector of values, means that we can take advantage of R’s vectorised functions for transforming data. This is demonstrated in the technical element of the chapter.

The three rules for tidy data (Wickham 2014):

Each variable forms a column.
Each observation forms a row.
Each type of observation unit forms a table.

The concept of tidy data, and its usefulness for tidyverse-style operations, is best explained through example. The technical element to this chapter is therefore comparatively lengthy and demonstrates key coding templates for organising and re-organising data frames for analysis.

2.3 Techniques

In this section we import, describe, transform and tidy data from New York’s Citibike bikeshare system.

Download the 02-template.qmd¹ file, and save it to your vis4sds project, created in Chapter 1.
Open your vis4sds project in RStudio, and load the template file by clicking File > Open File ... > 02-template.qmd.

2.3.1 Import

In the template file there is documentation on how to set up your R session with key packages – tidyverse , fst, lubridate, sf. The data were collected using the bikedata R package. A subset of data from New York’s bikeshare system, Citibike, were collected for this chapter and can be downloaded from the book’s accompanying data repository².

The code for reading in these data may be familiar to most readers. The here package, which reliably creates paths relative to a project’s root, is used to pass the locations at which the New York trips and stations data are stored as a parameter to read_csv() and read_fst(). Notice that we use assignment (<-), so data are loaded as objects and appear in the Environment pane of RStudio.

# Read in local copies of the trips and stations data.
ny_trips <- read_fst(here("data", "ny_trips.fst"))
ny_stations <- read_csv(here("data", "ny_stations.csv"))

fst for in-memory analysis

fst is a special class of file that implements in the background various operations to speed-up reading and writing of data. This makes it possible to work with large datasets in-memory in R rather than connecting to a database and returning data summaries/subsets.

Inspecting the layout of the stations data with View(ny_stations) you will notice that the top line is the header and contains column (variable) names. The glimpse() function can be used to quickly describe a data frame’s dimensions. We have 500,000 trip observations in ny_trips and 11 variables; the 500,000 represents a random sample of c.1.9 million trips recorded in June 2020. The function also prints out the object type for each of these variables, with the variables either of type int, chr or dbl.

Figure 2.1: `ny_trips` and `ny_stations` as they appear when calling `View()`.

In this case the assignment needs correcting. start_time and stop_time may be better represented in date-time format; the station identifier variables (e.g. start_station_id) are more efficient when converted to int types; and the geographic coordinates, currently stored as text strings (chr), are better converted as floating points (dbl) or POINT geometry types (Pebesma 2018). In the 02-template.qmd file are code chunks for doing these conversions. There are some slightly more involved data transform procedures in this code, which you may wish to ignore at this stage.

glimpse(ny_trips)
## Rows: 500,000
## Columns: 11
## $ id               <int> 1, 2, 3, 4, 5, 6, 7…
## $ city             <chr> "ny", "ny", "ny", "n.."
## $ trip_duration    <dbl> 1062, 3810, 1017, 226..
## $ start_time       <chr> "2020-06-01 00:00:03", ...
## $ stop_time        <chr> "2020-06-01 00:17:46", ...
## $ start_station_id <chr> "ny3419", "ny366", "ny389", ...
## $ end_station_id   <chr> "ny3419", "ny336", "ny3562", ...
## $ bike_id          <chr> "39852", "37558", "37512", ...
## $ user_type        <chr> "Customer", "Subscriber", ...
## $ birth_year       <chr> "1997", "1969", "1988", ...
## $ gender           <dbl> 2, 0, 2, 0, 2, 1, 2, 2, ...

glimpse(ny_stations)
## Rows: 1,010
## Columns: 6
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, ...
## $ city      <chr> "ny", "ny", "ny", "ny", "ny", "n..."
## $ stn_id    <chr> "ny116", "ny119", "ny120", "ny127", "n..."
## $ name      <chr> "W 17 St & 8 Ave", "Park Ave", "B…."
## $ longitude <chr> "-74.00149746", "-73.97803415", "..."
## $ latitude  <chr> "40.74177603", "40.69608941", "..."

Table 2.3: A breakdown of data types in R.

2.3.2 Manipulate

Manipulate with `dplyr` and pipes (`|>`)

dplyr is the foundational package of the tidyverse. It provides a grammar of data manipulation, with access to functions that can be variously combined to support most data processing and manipulation tasks. Once familiar with dplyr functions, you will find yourself generating analysis templates to re-use whenever working on a dataset.

All dplyr functions operate in a consistent way:

Start with a data frame.
Pass arguments to a function performing some updates to the data frame.
Return the updated data frame.

So every dplyr function expects a data frame and will always return a data frame.

Table 2.4: dplyr functions (verbs) for manipulating data frames.

Table 2.5: dplyr functions (verbs) for manipulating data frames.

function()	Description
filter()	Picks rows (observations) if their values match a specified criteria
arrange()	Reorders rows (observations) based on their values
select()	Picks a subset of columns (variables) by name (or name characteristics)
rename()	Changes the name of columns in the data frame
mutate()	Adds new columns
group_by()	Chunks the dataset into groups for grouped operations
summarise()	Calculates single-row (non-grouped) or multiple-row (if grouped) summary values
...

dplyr functions are designed to be chained together, and this chaining of functions can be achieved using the pipe operator (|>). Pipes are mechanisms for passing information in a program. They take the output of a set of code (a dplyr specification) and make it the input of the next set (another dplyr specification). Pipes can be easily applied to dplyr functions and the functions of all packages that form the tidyverse. We mentioned in Chapter 1 that ggplot2 provides a framework for specifying a layered grammar of graphics (more on this in Chapter 3). Together with the pipe operator, dplyr supports a layered grammar of data manipulation. You will see this throughout the book as we develop and re-use code templates for performing some data manipulation that is then piped to a ggplot2 specification for visual analysis.

`count()` and `summarise()` over rows

Let’s combine some dplyr functions to generate statistical summaries of the New York bikeshare data. First we’ll count the number of trips made in June 2020 by user_type, a variable distinguishing casual users from those formally registered to use the system (Customer vs. Subscriber cyclists). dplyr has a convenience function for counting, so we could run the code below, also in the 02-template.qmd for this chapter.

# Take the ny_trips data frame.
ny_trips |>
  # Run the count function and sort the result.
  count(user_type, sort=TRUE)
##    user_type      n
## 1 Subscriber 347204
## 2   Customer 152796

There are a few things happening in the count() function. It takes the usr_type variable from ny_trips, organises or groups the rows in the data frame according to its values (Customer | Subscriber), counts the rows and then orders the summarised output descending on the counts.

Often you will want to do more than simply counting, and you may also want to be more explicit in the way the data frame is grouped for computation. A common workflow is to combine group_by() and summarise(), and in this case arrange() to replicate the count() example.

# Take the ny_trips data frame.
ny_trips |>
  # Group by user_type.
  group_by(user_type) |>
    # Count the number of observations per group.
    summarise(count=n()) |>
    # Arrange the grouped and summarised (collapsed) rows in count.
    arrange(desc(count))
## # A tibble: 2 × 2
##   user_type   count
##   <chr>       <int>
## 1 Subscriber 347204
## 2 Customer   152796

In ny_trips there is a variable measuring trip duration in seconds (trip_duration). It may be instructive to calculate some summary statistics to see how trip duration varies between these groups. The code below uses group_by(), summarise() and arrange() in exactly the same way, but with the addition of other aggregate functions summarises the trip_duration variable according to central tendency (mean and standard deviation) and by user_type.

# Take the ny_trips data frame.
ny_trips |>
  mutate(trip_duration=trip_duration/60) |>
  # Group by user type.
  group_by(user_type) |>
  # Summarise over the grouped rows,
  # generate a new variable for each type of summary.
  summarise(
    count=n(),
    avg_duration=mean(trip_duration),
    median_duration=median(trip_duration),
    sd_duration=sd(trip_duration),
    min_duration=min(trip_duration),
    max_duration=max(trip_duration)
    ) |>
  # Arrange on the count variable.
  arrange(desc(count))

## # A tibble: 2 × 7
##   user_type   count avg_dur med_dur sd_dur min_dur max_dur
##   <chr>       <int> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
## 1 Subscriber 347204  20.3   14.4    116.   1.02    33090.
## 2 Customer   152796  43.3   23.1    383.   1.02    46982.
## # … with abbreviated variable names

As each line is commented you hopefully get a sense of what is happening in the code above. Since dplyr functions read like verbs, and code is executed sequentially, it greatly helps to organise dplyr code such that each new verb (function call) occupies a single line, separated with a pipe (|>). Once you are familiar with dplyr, it becomes very easy to read, write, re-use and share code in this way.

Table 2.6: A breakdown of aggregate functions commonly used with summarise().

Table 2.7: A breakdown of aggregate functions commonly used with summarise().

Function	Description
n()	Counts the number of observations
n_distinct(var)	Counts the number of unique observations
sum(var)	Sums the values of observations
max(var)\|min(var)	Finds the min\|max values of observations
mean(var)\|median(var)\| ...	Calculates central tendency of observations
...	Many more

On pipes |>

Remembering that pipes take the output of a set of code and make it the input of the next set, separate lines are used for each call to the pipe operator. This is good practice for supporting readability of your code, and for debugging and learning how your data is affected by each line. Especially if dplyr is new to you, we recommend you run each code line separated by a pipe (|>) in the Console and observe how the dataset is changed.

Manipulate dates with `lubridate`

Let’s continue this investigation of trips by user type by profiling how usage varies over time. To do this we will need to work with date-time variables. The lubridate package provides various convenience functions for this.

In the code block below we extract the day of week and hour of day from the start_time variable using lubridate’s day accessor functions. Documentation on these can be accessed in the usual way (?<function-name>). Next we count the number of trips made by hour of day, day of week and user type. The summarised data frame will be re-used several times in our analysis, so we store it as an object with a suitable name (ny_temporal) using the assignment operator.

# Create an hour of day and day of week summary by user type
# and assign it the name "ny_temporal".
ny_temporal <- ny_trips |>
  mutate(
    # Create a new column to identify dow.
    day=wday(start_time, label=TRUE),
    # Create a new column to identify hod.
    hour=hour(start_time)) |>
  # Group by day, hour, user_type.
  group_by(user_type, day, hour) |>
  # Count the grouped rows.
  summarise(count=n()) |>
  ungroup()

Keeping track of derived data

Whether or not to store derived data frames, like the newly assigned ny_temporal, in a session is not an easy decision. You should avoid cluttering the Environment pane with many data objects. Often when generating charts it is necessary to create these sorts of derived tables as input data to ggplot2. Adding derived data frames to the RStudio Environment pane each time an exploratory plot is created risks an unhelpfully large number of such tables. A general rule: if the derived data frame is to be used more than three times in a data analysis or is computationally intensive, create and assign it (<-) as a named object.

In Figure 2.2 the newly derived data are plotted. Code for creating this data graphic is below. The plot demonstrates a familiar weekday-weekend pattern of usage. Trip frequencies peak in the morning and evening rush hours during weekdays and mid/late-morning and afternoon during weekends, with the weekday afternoon peak much larger than the morning peak. There are obvious differences in the type of trips made by subscribers versus customers – the temporal signature for subscribers appears to match more closely what one would expect of commuting behaviour. That this pattern exists is notable when remembering the data are from June 2020, a time when New York residents were emerging from lockdown. It would be instructive to compare with data from a non-Covid year. The fact that bikeshare is recorded continuously, in contrast to actively-collected survey data, makes this sort of behavioural change analysis possible.

Figure 2.2: Citibike trips made by hour of day and day and week, differentiated by customer type.

Code

# Calculate trips by hour of day, day and customer type.
ny_trips |> 
  mutate(
    day=wday(start_time, label=TRUE),
    hour=hour(start_time)) |> 
  group_by(user_type, day, hour) |> 
  summarise(count=n()) |> 
  ungroup() |> 
  # Pipe to ggplot2 for plotting.
  ggplot(aes(x=hour, y=count)) +
  geom_line(aes(colour=user_type), size=1) +
  scale_colour_manual(values=c("#e31a1c", "#1f78b4")) +
  facet_wrap(~day, nrow=1)+
  labs(x="hour of day", y="trip counts", colour="user type")

Relate tables with `join()`

Trip distance is not recorded directly in the ny_trips table, but may be important for profiling usage behaviour. Since ny_stations contains coordinates corresponding to station locations, distances can be calculated by linking these station coordinates to the origin and destination stations recorded in ny_trips. To relate the two tables we need to specify a join between them.

A sensible approach is to:

Select all uniquely cycled trip pairs (origin-destination pairs) that appear in the ny_trips table.
Bring in the corresponding coordinate pairs representing the origin and destination stations by joining on the ny_stations table.
Calculate the distance between the coordinate pairs representing the origin and destination.

The code below is one way of achieving this.

# Take the ny_trips data frame.
od_pairs <- ny_trips |>
  # Select trip origin and destination (OD) station columns
  # and extract unique OD pairs.
  select(start_station_id, end_station_id) |>  unique() |>
  # Select lon, lat columns from ny_stations and join on origin column.
  left_join(
    ny_stations |> select(stn_id, longitude, latitude),
    by=c("start_station_id"="stn_id")
    ) |>
  # Rename new lon, lat columns and associate with origin station.
  rename(o_lon=longitude, o_lat=latitude) |>
  # Select lon, lat columns from ny_stations and join
  # on destination column.
  left_join(
    ny_stations |> select(stn_id, longitude, latitude),
    by=c("end_station_id"="stn_id")
    ) |>
  # Rename new lon, lat columns and associate with destination station.
  rename(d_lon=longitude, d_lat=latitude) |>
  # Compute distance calculation on each row (od_pair).
  rowwise() |>
  # Calculate distance and express in kms.
  mutate(
    dist=
    geosphere::distHaversine(c(o_lat, o_lon), c(d_lat, d_lon))/1000
    ) |>
  ungroup()

Some new functions are introduced: select() to pick or drop variables, rename() to rename variables and a convenience function for calculating straight line distances from polar coordinates, distHaversine(). The key function to emphasise is the left_join(). If you’ve worked with relational databases, dplyr’s join functions will be familiar to you. In a left_join all the values from the first (left-most) table are retained, ny_trips in this case, and variables from the table on the right , ny_stations, are added. We specify the variable on which tables should be joined with the by= parameter, station_id in this case. If there is a station_id in ny_trips that doesn’t exist in ny_stations then corresponding cells are filled out with NA.

Table 2.8: A breakdown of dplyr join functions.

Table 2.9: A breakdown of dplyr join functions.

*_join(x, y) ...
left_join()	all rows from table x
right_join()	all rows from table y
full_join()	all rows from both table x and y
`semi_join()`	all rows from table x where there are matching values in table y, keeping just columns from table x
inner_join()	all rows from table x where there are matching values in table y, returning all combinations where there are multiple matches
anti_join	all rows from table x where there are not matching values in table y, never duplicating rows of table x

From the newly created distance variable we can calculate the average (mean) trip distance for the 500,000 sampled trips – 1.6km. This might seem very short, but remember that these are straight-line distances between pairs of docking stations. Ideally we would calculate distances derived from cycle trajectories. A separate reason, discovered when generating a histogram on the dist variable, is that there are a large number of trips that start and end at the same docking station. These might be unsuccessful hires – people failing to undock a bike for example. We could investigate this further by paying attention to the docking stations at which same origin-destination trips occur, as in the code block below.

ny_trips |>
  filter(start_station_id==end_station_id) |>
  group_by(start_station_id) |>  summarise(count=n()) |>
  left_join(
    ny_stations |>   select(stn_id, name),
    by=c("start_station_id"="stn_id")
    ) |>
  arrange(desc(count))
## # A tibble: 958 x 3
##    start_station_id count name
##    <chr>            <int> <chr>
##  1 ny3423            2017 West Drive & Prospect Park West
##  2 ny3881            1263 12 Ave & W 125 St
##  3 ny514             1024 12 Ave & W 40 St
##  4 ny3349             978 Grand Army Plaza & Plaza St West
##  5 ny3992             964 W 169 St & Fort Washington Ave
##  6 ny3374             860 Central Park North & Adam Clayton Powell 
##  7 ny3782             837 Brooklyn Bridge Park - Pier 2
##  8 ny3599             829 Franklin Ave & Empire Blvd
##  9 ny3521             793 Lenox Ave & W 111 St
## 10 ny2006             782 Central Park S & 6 Ave
## # … with 948 more rows

The top 10 docking stations are either in parks, near parks or located along the river. This, coupled with the fact that same origin-destination trips occur in much greater relative number for casual users (Customer), associated with discretionary leisure-oriented cycling, than regular users (Subscriber) is further evidence that these are valid trips. Note also the different shapes in the distribution of distances for trips cycled by subscribers and customers (Figure 2.3), again suggesting these groups may use Citibike in different ways. Code for creating Figure 2.3 appears below the graphic.

Figure 2.3: Citibike trip distances (straight-line km) for Subscriber and Customer cyclists.

Code

# Plot faceted histograms.
ny_trips |> 
  # Cast as factor variable to draw Subscriber plot facet first.
  mutate(
    user_type=factor(user_type, levels=c("Subscriber", "Customer"))
    ) |> 
  ggplot(aes(dist)) +
  geom_histogram() +
  facet_wrap(~user_type)+
  labs(x="distance = km", y="frequency")

Write functions of your own

There may be times when you need to create functions of your own. Most often this is when you find yourself copy-pasting the same chunks of code with minimal adaptation.

Functions have three key characteristics:

They are (usually) named – the name should be expressive and communicate what the function does.
They have brackets <function-name()> usually containing arguments – inputs, which determine what the function does and returns.
Immediately followed by <function-name()> are curly brackets ({}) used to contain the body – code that performs a distinct task, described by the function’s name.

Effective functions are short and perform single, discrete operations.

You will recall that in the ny_trips table there is a variable called birth_year. From this we can derive cyclists’ approximate age in years. Below is a function called get_age(). The function expects two arguments: yob – a year of birth as type chr; and yref – a reference year. In the body, lubridate’s as.period() function is used to calculate the time in years that elapsed between these dates.

# get_age() depends on lubridate.
library(lubridate)

# Calculate time elapsed between two dates in years (age).
# yob : datetime object recording birth year.
# yref : datetime object recording reference year.
get_age <- function(yob, yref) {
    period <- as.period(interval(yob, yref),unit = "year")
    return(period$year)
}

ny_trips <- ny_trips |>
  # Calculate age from birth_date.
  mutate(
    age=get_age(
      as.POSIXct(birth_year, format="%Y"),
      as.POSIXct("2020", format="%Y")
    )
  )

We can use the two new derived variables – distance travelled and age – in our analysis. In Figure 2.4, we explore how approximate travel speeds vary by age, trip distance and customer type. Again the average speed calculation should be treated cautiously as it is based on straight line distances, and it is likely that this will vary depending on whether the trip is made for ‘utilitarian’ or ‘leisure’ purposes. Additionally, due to the heavy subsetting data become a little volatile for certain age groups, and so the age variable is aggregated into 5-year bands.

There are some notable patterns in Figure 2.4. Subscribers make faster trips than do customers, although this gap narrows as trip distance increases. Trips with a straight-line distance of 4.5km are non-trivial and so may be classed as utilitarian even for non-regular customers. There is a very slight effect of decreasing trip speed by age cycled for the longer trips. The volatility in the older age groups for trips >4.5km suggests more data, and a more involved analysis, is required to confidently establish this. For example, it may be that the comparatively rare occurrence of trips in the 65-70 age group is made by only a small subset of cyclists; with a larger dataset we may expect a regression to the mean effect that negates noise caused by outlier individuals.

Figure 2.4: Citibike average trip speeds (approximate) by age, customer type and straight-line trip distance.

Code

# Generate staged dataset for plotting. 
# Filter weekday-only trips, less than 1hr in duration, 
# exclude trips that start and end at same docking station, 
# and generate binned age and distance variables.  
temp_data <- ny_trips |> 
  mutate(
    day=wday(start_time, label=TRUE), 
    is_weekday=as.numeric(!day %in% c("Sat", "Sun"))
    ) |>
  filter(
    is_weekday==1,
    start_station_id!=end_station_id,
    duration_minutes<=60,
    between(age, 16, 74),
    dist>.5
    ) |> 
  mutate(
    dist_bands=case_when(
      dist < 1.5 ~ "<1.5km",
      dist < 3 ~ ">1.5-3km",
      dist < 4.5 ~ ">3-4.5km",
      TRUE ~ ">4.5km"),
    age_band=if_else(age %% 10 > 4, ceiling(age/5)*5, 
      floor(age/5)*5),
    speed=dist/(duration_minutes/60)
  )  |> 
  group_by(user_type, age_band, dist_bands) |> 
    summarise(avg_speed=mean(speed), sample_size=n(), 
    std=sd(speed)) |> 
  ungroup() 

# Plot.
temp_data |> 
  ggplot(aes(x=age_band, y=avg_speed))+
  geom_line(aes(colour=user_type)) +
  scale_colour_manual(values=c("#e31a1c", "#1f78b4")) +
  scale_fill_manual(values=c("#e31a1c", "#1f78b4")) +
  facet_wrap(~dist_bands, nrow=1) +
  labs(
    x="age", y="speed - km/h ", fill="user type", 
    colour="user type"
  )

# Remove staging dataset.
rm(temp_data)

2.3.3 Tidy

The ny_trips and ny_stations data already comply with the rules for tidy data (Wickham 2014). Each row in ny_trips is a distinct trip and each row in ny_stations a distinct station. However, it is common to encounter datasets that are untidy and must be reshaped. In the book’s data repository are two examples: ny_spread_rows and ny_spread_columns. ny_spread_rows is so-called because the variable summary_type is spread across the rows (observations); ny_spread_columns because multiple variables are stored in single columns – the dist_weekday, duration_weekday columns.

ny_spread_rows
## # A tibble: 411,032 × 6
##    o_station d_station wkday   count summary_type  value
##        <dbl>     <dbl> <chr>   <dbl> <chr>         <dbl>
##  1        72       116 weekend     1 dist           1.15
##  2        72       116 weekend     1 duration      18.2
##  3        72       127 weekend     4 dist           7.18
##  4        72       127 weekend     4 duration     122.
##  5        72       146 weekend     4 dist           9.21
##  6        72       146 weekend     4 duration     122.
##  7        72       164 weekend     1 dist           2.66
##  8        72       164 weekend     1 duration      12.5
##  9        72       173 weekend     2 dist           2.13
## 10        72       173 weekend     2 duration      43.6
## # … with 411,022 more rows

ny_spread_columns
## # A tibble: 156,449 × 8
##    o_stat d_stat ct_wend ct_wdy dst_wnd dst_wdy dur_wnd dur_wdy
##     <dbl> <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
##  1    72   116       1      3    1.15   3.45    18.2     49.9
##  2    72   127       4      4    7.18   7.18    122.     101.
##  3    72   146       4      2    9.21   4.61    122.     64.1
##  4    72   164       1      1    2.66   2.66    12.5     43.2
##  5    72   173       2     13    2.13   13.9    43.6    189.
##  6    72   195       1      4    2.56   10.2    24.7     98.3
##  7    72   212       3      3    4.83   4.83    40.3     54.0
##  8    72   223       1     NA    1.13     NA    21.1     NA
##  9    72   228       2      1    4.97   2.49    30.2     13.6
## 10    72   229       1     NA    1.22     NA    39.2     NA
## # … with 156,439 more rows, and abbreviated variable names

To re-organise the table in tidy form, we should identify what constitutes a distinct observation – an origin-destination pair summarising counts, distances and durations of trips that occur during the weekday or weekend. From here, the table’s variables are:

o_station: station id of the origin station
d_station: station id of the destination station
wkday: trip occurs on weekday or weekend
count: count of trips for observation type
dist: total straight-line distance in km (cumulative) of trips for observation type
duration: total duration in minutes (cumulative) of trips for observation type

There are two functions for reshaping untidy data, from the tidyr package: pivot_longer() and pivot_wider(). pivot_longer() is used to tidy data in which observations are spread across columns; pivot_wider() to tidy data in which variables are spread across rows. The functions are especially useful in visual data analysis to fix messy data, but also to flexibly reshape data supplied to ggplot2 specifications (more on this in Chapters 3 and 4).

To fix ny_spread_rows, we use pivot_wider() and pass to the function’s arguments the name of the problematic column and the column containing values used to populate the newly created columns.

ny_spread_rows |>
   pivot_wider(names_from=summary_type, values_from=value)

## # A tibble: 205,516 × 6
##    o_station d_station wkday   count  dist duration
##        <dbl>     <dbl> <chr>   <dbl> <dbl>    <dbl>
##  1        72       116 weekend     1  1.15     18.2
##  2        72       127 weekend     4  7.18    122.
##  3        72       146 weekend     4  9.21    122.
##  4        72       164 weekend     1  2.66     12.5
##  5        72       173 weekend     2  2.13     43.6
##  6        72       195 weekend     1  2.56     24.7
##  7        72       212 weekend     3  4.83     40.3
##  8        72       223 weekend     1  1.13     21.1
##  9        72       228 weekend     2  4.97     30.2
## 10        72       229 weekend     1  1.22     39.2
## # … with 205,506 more rows

To fix ny_spread_columns requires a little more thought. First we pivot_longer() on columns that are muddled with multiple variables. This results in a long and thin dataset similar to ny_spread_rows – each row is the origin-destination pair with either a count, distance or duration recorded for trips occurring on weekends or weekdays. The muddled variables, for example dist_weekend duration_weekday, now appear in the rows of a new column with the default title name. This column is separated on the _ mark to create two new columns, summary_type and wkday, used in pivot_wider().

ny_spread_columns |>
  pivot_longer(cols = count_weekend:duration_weekday) |>
  separate(col = name, into = c("summary_type", "wkday"), sep = "_") |>
  pivot_wider(names_from = summary_type, values_from = value)

## # A tibble: 312,898 × 6
##    o_station d_station wkday   count  dist duration
##        <dbl>     <dbl> <chr>   <dbl> <dbl>    <dbl>
##  1        72       116 weekend     1  1.15     18.2
##  2        72       116 weekday     3  3.45     49.9
##  3        72       127 weekend     4  7.18    122.
##  4        72       127 weekday     4  7.18    101.
##  5        72       146 weekend     4  9.21    122.
##  6        72       146 weekday     2  4.61     64.1
##  7        72       164 weekend     1  2.66     12.5
##  8        72       164 weekday     1  2.66     43.2
##  9        72       173 weekend     2  2.13     43.6
## 10        72       173 weekday    13 13.9     189.
## # … with 312,888 more rows

Task 2

Figure 2.2 uses a derived dataset that summarises trip counts by user_type and day_of_week. This dataset is created in the template file for the chapter and is named ny_temporal. Each observation is a trip count on a day of week, hour of day and for a given user type (Customer or Subscriber).

ny_temporal <- ny_trips |>
  mutate(
    day=wday(start_time, label=TRUE),
    hour=hour(start_time)) |>
  group_by(user_type, day, hour) |>
  summarise(count=n()) |>
  ungroup()

To explore whether customers and subscribers have different usage behaviours, we calculate the proportion of trips made by day of week for these two user groups. Customers, as expected, contribute a greater relative number of trips on weekends than do subscribers.

# # A tibble: 7 × 3
#   day   Customer Subscriber
#   <ord>    <dbl>      <dbl>
# 1 Sun     0.198       0.144
# 2 Mon     0.137       0.163
# 3 Tue     0.144       0.172
# 4 Wed     0.104       0.125
# 5 Thu     0.0973      0.122
# 6 Fri     0.135       0.138
# 7 Sat     0.185       0.136

Can you write some dplyr code to generate such a summary? There are several possible approaches, but you will need to think about which variables to group_by() and summarise() over, and you may need to pivot_wider() your dataset in order to compare the user types side-by-side.

2.4 Conclusions

Developing the vocabulary and technical skills to systematically describe and organise data is crucial to modern data analysis. This chapter has covered the fundamentals: that data consist of observations and variables of different types (Stevens 1946) and that in order to work effectively with datasets, these data should be organised such that they are tidy (Wickham 2014). Most of the chapter content was dedicated to the techniques that enable these concepts to be operationalised. We covered how to download, transform and reshape a reasonably large dataset from New York’s Citibike system. In doing so, we generated insights that might inform further data collection and analysis activity. In the next chapter we will extend this conceptual and technical knowledge as we introduce the fundamentals of visual data analysis and ggplot2’s grammar of graphics.

2.5 Further Reading

There are many accessible introductions to the tidyverse for modern data analysis. Two excellent resources:

Wickham, H., Çetinkaya-Rundel, M., Grolemund, G. 2023, “R for Data Science, 2nd Edition”, Sebastopol, CA: O’Reilly.
- Chapter 3.
Ismay, C. and Kim, A. 2020. “Statistical Inference via Data Science: A ModernDive into R and the Tidyverse”, New York, NY: CRC Press. doi: 10.1201/9780367409913.
- Chapters 3, 4.

Hadley Wickham’s original paper on Tidy Data:

Wickham, H. 2010. “Tidy Data” Journal of Statistical Software, 59(10): 1–23. doi: 10.18637/jss.v059.i10.

BBC Visual and Data Journalism Team. 2019. “BBC Visual and Data Journalism Cookbook for R Graphics.” https://github.com/bbc/rcookbook.

Beecham, R., N. Williams, and L. Comber. 2020. “Regionally-Structured Explanations Behind Area-Level Populism: An Update to Recent Ecological Analyses.” PLOS One 15 (3): e0229974. https://doi.org/10.1371/journal.pone.0229974.

Buja, A., D. Cook, H. Hofmann, M. Lawrence, E-K Lee, D. Swayne, and H. Wickham. 2009. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 (1906): 4361–83. https://doi.org/10.1098/rsta.2009.0120.

Burn-Murdoch, J. 2023. “Making Charts That Make an Impact.” Invited talk, Data Visualization Society’s Outlier Conference. https://www.youtube.com/watch?v=tIbaQUo6H9g&ab_channel=DataVisualizationSociety.

Comber, A., C. Brunsdon, M. Charlton, G. Dong, R. Harris, B. Lu, Y. Lü, et al. 2023. “A Route Map for Successful Applications of Geographically Weighted Regression.” Geographical Analysis 55 (1): 155–78. https://doi.org/10.1111/gean.12316.

Franconeri, S. L., L. M. Padilla, P. Shah, J. M. Zacks, and J. Hullman. 2021. “The Science of Visual Data Communication: What Works.” Psychological Science in the Public Interest 22 (3): 110–61. https://doi.org/10.1177/15291006211051956 .

Healy, K. 2019. Data Visualization: A Practical Introduction. Princeton, NJ: Princeton University Press. https://socviz.co.

Hullman, J., and A. Gelman. 2021. “Designing for Interactive Exploratory Data Analysis Requires Theories of Graphical Inference.” Harvard Data Science Review 3 (3).

Ismay, C., and A. Kim. 2020. Statistical Inference via Data Science: A ModernDive into R and the Tidyverse. New York, NY: CRC Press. https://doi.org/10.1201/9780367409913.

Kale, A., F. Nguyen, M. Kay, and J. Hullman. 2019. “Hypothetical Outcome Plots Help Untrained Observers Judge Trends in Ambiguous Data.” IEEE Transactions on Visualization and Computer Graphics 25 (1): 892–902. https://doi.org/10.1109/TVCG.2018.2864909.

Kay, M. 2021. “Uncertainty Visualization as a Moral Imperative.” Invited talk, BostonCHI meeting. https://www.youtube.com/watch?v=mfQ3QVyw4N0&ab_channel=BostonCHI.

Kosara, R. 2023. “Lesson 4: Presentation, Uncertainty, ISOTYPE.” ObservableHQ Notebook; ObservableHQ. https://observablehq.com/@observablehq/lesson-4-presentation-uncertainty-isotype?collection=@observablehq/advanced-data-vis-course.

Pebesma, E. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

Scherer, C. 2023. “Designing Data Visualizations to Successfully Tell a Story.” Workshop at Posit::conf(2023), Chicago, IL. https://posit-conf-2023.github.io/dataviz-storytelling/.

Stevens, S. 1946. “On the Theory of Scales of Measurement.” Science 103 (2684): 677–80. https://doi.org/10.1126/science.103.2684.677.

The Turing Way Community. 2025. “The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research.” Zenodo. https://doi.org/10.5281/zenodo.15213042.

Wickham, H. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

Wilkinson, L. 1999. The Grammar of Graphics. New York, NY: Springer.

Wolf, L. J., L. Anselin, D. Arribas-Bel, and L. Rivers Mobley. 2021. “On Spatial and Platial Dependence: Examining Shrinkage in Spatially Dependent Multilevel Models.” Annals of the American Association of Geographers 111 (6): 1679–91. https://doi.org/10.1080/24694452.2020.1841602.

Yang, F., M. Cau, C. Mortenson, H. Fakhari, A. D. Lokmanoglu, J. Hullman, S. Franconeri, N. Diakopoulos, E. C. Nisbet, and M. Kay. 2024. “Swaying the Public? Impacts of Election Forecast Visualizations on Emotion, Trust, and Intention in the 2022 U.S. Midterms.” IEEE Transactions on Visualization and Computer Graphics 30 (1): 23–33. https://doi.org/10.1109/TVCG.2023.3327356.

https://vis4sds.github.io/vis4sds/files/02-template.qmd↩︎
https://github.com/vis4sds/data↩︎

2.1 Introduction

2.2 Concepts

2.2.1 Data frames

2.2.2 Types of variable

2.2.3 Types of observation

2.2.4 Tidy data

2.3 Techniques

2.3.1 Import

2.3.2 Manipulate

Manipulate with dplyr and pipes (|>)

count() and summarise() over rows

Manipulate dates with lubridate

Relate tables with join()

Write functions of your own

2.3.3 Tidy

2.4 Conclusions

2.5 Further Reading

Manipulate with `dplyr` and pipes (`|>`)

`count()` and `summarise()` over rows

Manipulate dates with `lubridate`

Relate tables with `join()`