32 The data-plot-pipeline

As Chapter 2.1 first introduced, we can express multi-layer plotly graphs as a sequence (or, more specifically, a directed acyclic graph) of dplyr data manipulations and mappings to visuals. For example, to create Figure 32.1, we could group txhousing by city to ensure the first layer of add_lines() draws a different line for each city, then filter() down to Houston so that the second call to add_lines() draws only Houston.

allCities <- txhousing %>%
  group_by(city) %>%
  plot_ly(x = ~date, y = ~median) %>%
  add_lines(alpha = 0.2, name = "Texan Cities", hoverinfo = "none")
  
allCities %>%
  filter(city == "Houston") %>%
  add_lines(name = "Houston")

FIGURE 32.1: Monthly median house prices over time for 46 Texan cities (in blue). Houston is highlighted in orange.

Sometimes the directed acyclic graph property of a magrittr pipeline can be too restrictive for certain types of plots. In this example, after filtering the data down to Houston, there is no way to recover the original data inside the pipeline. The add_fun() function helps to work-around this restriction⁴⁰ – it works by applying a function to the plotly object, but does not affect the data associated with the plotly object. This effectively provides a way to isolate data transformations within the pipeline⁴¹. Figure 32.2 uses this idea to highlight both Houston and San Antonio.

allCities %>%
  add_fun(function(plot) {
    plot %>% filter(city == "Houston") %>% add_lines(name = "Houston")
  }) %>%
  add_fun(function(plot) {
    plot %>% filter(city == "San Antonio") %>% 
      add_lines(name = "San Antonio")
  })

FIGURE 32.2: Monthly median house price in Houston and San Antonio in comparison to other Texan cities.

It is useful to think of the function supplied to add_fun() as a “layer” function – a function that accepts a plot object as input, possibly applies a transformation to the data, and maps that data to visual objects. To make layering functions more modular, flexible, and expressive, the add_fun() allows you to pass additional arguments to a layer function. Figure 32.3 makes use of this pattern, by creating a reusable function for layering both a particular city as well as the first, second, and third quartile of median monthly house sales (by city).

# reusable function for highlighting a particular city
layer_city <- function(plot, name) {
  plot %>% filter(city == name) %>% add_lines(name = name)
}

# reusable function for plotting overall median & IQR
layer_iqr <- function(plot) {
  plot %>%
    group_by(date) %>% 
    summarise(
      q1 = quantile(median, 0.25, na.rm = TRUE),
      m = median(median, na.rm = TRUE),
      q3 = quantile(median, 0.75, na.rm = TRUE)
    ) %>%
    add_lines(y = ~m, name = "median", color = I("black")) %>%
    add_ribbons(ymin = ~q1, ymax = ~q3, name = "IQR", color = I("black"))
}

allCities %>%
  add_fun(layer_iqr) %>%
  add_fun(layer_city, "Houston") %>%
  add_fun(layer_city, "San Antonio")

FIGURE 32.3: First, second, and third quartile of median monthly house price in Texas.

A layering function does not have to be a data-plot-pipeline itself. Its only requirement on a layering function is that the first argument is a plot object and it returns a plot object. This provides an opportunity to say, fit a model to the plot data, extract the model components you desire, and map those components to visuals. Furthermore, since plotly’s add_*() functions don’t require a data.frame, you can supply those components directly to attributes (as long as they are well-defined), as done in Figure 32.4 via the forecast package (Hyndman 2018).

library(forecast)
layer_forecast <- function(plot) {
  d <- plotly_data(plot)
  series <- with(d, 
    ts(median, frequency = 12, start = c(2000, 1), end = c(2015, 7))
  )
  fore <- forecast(ets(series), h = 48, level = c(80, 95))
  plot %>%
    add_ribbons(x = time(fore$mean), ymin = fore$lower[, 2],
                ymax = fore$upper[, 2], color = I("gray95"), 
                name = "95% confidence", inherit = FALSE) %>%
    add_ribbons(x = time(fore$mean), ymin = fore$lower[, 1],
                ymax = fore$upper[, 1], color = I("gray80"), 
                name = "80% confidence", inherit = FALSE) %>%
    add_lines(x = time(fore$mean), y = fore$mean, color = I("blue"), 
              name = "prediction")
}

txhousing %>%
  group_by(city) %>%
  plot_ly(x = ~date, y = ~median) %>%
  add_lines(alpha = 0.2, name = "Texan Cities", hoverinfo = "none") %>%
  add_fun(layer_iqr) %>%
  add_fun(layer_forecast)

FIGURE 32.4: Layering on a 4-year forecast from a exponential smoothing state space model.

In summary, the “data-plot-pipeline” is desirable for a number of reasons: (1) makes your code easier to read and understand, (2) encourages you to think of both your data and plots using a single, uniform data structure, which (3) makes it easy to combine and reuse transformations.

References

Hyndman, Rob J. 2018. Forecast: Forecasting Functions for Time Series and Linear Models. http://github.com/robjhyndman/forecast.

Credit to Winston Chang and Hadley Wickham for this idea. The add_fun() is very much like layer_f() function in ggvis.↩
Also, effectively putting a pipeline inside a pipeline ↩