32 The data-plot-pipeline
As Chapter 2.1 first introduced, we can express multi-layer plotly graphs as a sequence (or, more specifically, a directed acyclic graph) of dplyr data manipulations and mappings to visuals. For example, to create Figure 32.1, we could group txhousing
by city
to ensure the first layer of add_lines()
draws a different line for each city, then filter()
down to Houston so that the second call to add_lines()
draws only Houston.
allCities <- txhousing %>%
group_by(city) %>%
plot_ly(x = ~date, y = ~median) %>%
add_lines(alpha = 0.2, name = "Texan Cities", hoverinfo = "none")
allCities %>%
filter(city == "Houston") %>%
add_lines(name = "Houston")
Sometimes the directed acyclic graph property of a magrittr pipeline can be too restrictive for certain types of plots. In this example, after filtering the data down to Houston, there is no way to recover the original data inside the pipeline. The add_fun()
function helps to work-around this restriction40 – it works by applying a function to the plotly object, but does not affect the data associated with the plotly object. This effectively provides a way to isolate data transformations within the pipeline41. Figure 32.2 uses this idea to highlight both Houston and San Antonio.
allCities %>%
add_fun(function(plot) {
plot %>% filter(city == "Houston") %>% add_lines(name = "Houston")
}) %>%
add_fun(function(plot) {
plot %>% filter(city == "San Antonio") %>%
add_lines(name = "San Antonio")
})
It is useful to think of the function supplied to add_fun()
as a “layer” function – a function that accepts a plot object as input, possibly applies a transformation to the data, and maps that data to visual objects. To make layering functions more modular, flexible, and expressive, the add_fun()
allows you to pass additional arguments to a layer function. Figure 32.3 makes use of this pattern, by creating a reusable function for layering both a particular city as well as the first, second, and third quartile of median monthly house sales (by city).
# reusable function for highlighting a particular city
layer_city <- function(plot, name) {
plot %>% filter(city == name) %>% add_lines(name = name)
}
# reusable function for plotting overall median & IQR
layer_iqr <- function(plot) {
plot %>%
group_by(date) %>%
summarise(
q1 = quantile(median, 0.25, na.rm = TRUE),
m = median(median, na.rm = TRUE),
q3 = quantile(median, 0.75, na.rm = TRUE)
) %>%
add_lines(y = ~m, name = "median", color = I("black")) %>%
add_ribbons(ymin = ~q1, ymax = ~q3, name = "IQR", color = I("black"))
}
allCities %>%
add_fun(layer_iqr) %>%
add_fun(layer_city, "Houston") %>%
add_fun(layer_city, "San Antonio")
A layering function does not have to be a data-plot-pipeline itself. Its only requirement on a layering function is that the first argument is a plot object and it returns a plot object. This provides an opportunity to say, fit a model to the plot data, extract the model components you desire, and map those components to visuals. Furthermore, since plotly’s add_*()
functions don’t require a data.frame, you can supply those components directly to attributes (as long as they are well-defined), as done in Figure 32.4 via the forecast package (Hyndman 2018).
library(forecast)
layer_forecast <- function(plot) {
d <- plotly_data(plot)
series <- with(d,
ts(median, frequency = 12, start = c(2000, 1), end = c(2015, 7))
)
fore <- forecast(ets(series), h = 48, level = c(80, 95))
plot %>%
add_ribbons(x = time(fore$mean), ymin = fore$lower[, 2],
ymax = fore$upper[, 2], color = I("gray95"),
name = "95% confidence", inherit = FALSE) %>%
add_ribbons(x = time(fore$mean), ymin = fore$lower[, 1],
ymax = fore$upper[, 1], color = I("gray80"),
name = "80% confidence", inherit = FALSE) %>%
add_lines(x = time(fore$mean), y = fore$mean, color = I("blue"),
name = "prediction")
}
txhousing %>%
group_by(city) %>%
plot_ly(x = ~date, y = ~median) %>%
add_lines(alpha = 0.2, name = "Texan Cities", hoverinfo = "none") %>%
add_fun(layer_iqr) %>%
add_fun(layer_forecast)
In summary, the “data-plot-pipeline” is desirable for a number of reasons: (1) makes your code easier to read and understand, (2) encourages you to think of both your data and plots using a single, uniform data structure, which (3) makes it easy to combine and reuse transformations.
References
Hyndman, Rob J. 2018. Forecast: Forecasting Functions for Time Series and Linear Models. http://github.com/robjhyndman/forecast.
Credit to Winston Chang and Hadley Wickham for this idea. The
add_fun()
is very much likelayer_f()
function in ggvis.↩Also, effectively putting a pipeline inside a pipeline↩