16 Client-side linking

16.1 Graphical queries

This section focuses on a particular approach to linking views known as graphical (database) queries using the R package plotly. With plotly, one can write R code to pose graphical queries that operate entirely client-side in a web browser (i.e., no special web server or callback to R is required). In addition to teaching you how to pose queries with the highlight_key() function, this section shows you how to control how queries are triggered and visually rendered via the highlight() function.

Figure 16.1 shows a scatterplot of the relationship between weight and miles per gallon of 32 cars. It also uses highlight_key() to assign the number of cylinders to each point so that when a particular point is ‘queried’ all points with the same number of cylinders are highlighted (the number of cylinders is displayed with text just for demonstration purposes). By default, a mouse click triggers a query, and a double-click clears the query, but both of these events can be customized through the highlight() function. By typing help(highlight) in your R console, you can learn more about what events are supported for turning graphical queries on and off.

library(plotly)
mtcars %>%
  highlight_key(~cyl) %>%
  plot_ly(
    x = ~wt, y = ~mpg, text = ~cyl, mode = "markers+text", 
    textposition = "top", hoverinfo = "x+y"
  ) %>%
  highlight(on = "plotly_hover", off = "plotly_doubleclick")

FIGURE 16.1: A visual depiction of how highlight_key() attaches metadata to graphical elements to enable graphical database queries. Each point represents a different car and the number of cylinders (cyl) is assigned as metadata so that when a particular point is queried all points with the same number of cylinders are highlighted.

Generally speaking, highlight_key() assigns data values to graphical marks so that when graphical mark(s) are directly manipulated through the on event, it uses the corresponding data values (call it $SELECTION_VALUE) to perform an SQL query of the following form.

SELECT * FROM mtcars WHERE cyl IN $SELECTION_VALUE

For a more useful example, lets use graphical querying to pose interactive queries of the txhousing dataset. This data contains monthly housing sales in Texan cities acquired from the TAMU real estate center and made available via the ggplot2 package. Figure 16.2 shows the median house price in each city over time which produces a rather busy (spaghetti) plot. To help combat the overplotting, we could add the ability to click a particular a given point on a line to highlight that particular city. This interactive ability is enabled by simply using highlight_key() to declare that the city variable be used as the querying criteria within the graphical querying framework.

One subtlety to be aware of in terms of what makes Figure 16.2 possible is that every point along a line may have a different data value assigned to it. In this case, since the city column is used as both the visual grouping and querying variable, we effectively get the ability to highlight a group by clicking on any point along that line. Section 16.4.1 has examples of using different grouping and querying variables to query multiple related groups of visual geometries at once, which can be a powerful technique.22

# load the `txhousing` dataset
data(txhousing, package = "ggplot2")

# declare `city` as the SQL 'query by' column
tx <- highlight_key(txhousing, ~city)

# initiate a plotly object
base <- plot_ly(tx, color = I("black")) %>% 
  group_by(city)

# create a time series of median house price
base %>%
  group_by(city) %>%
  add_lines(x = ~date, y = ~median)

FIGURE 16.2: Graphical query of housing prices in various Texas cities. The query in this particular example must be triggered through clicking directly on a time series.

Querying a city via direct manipulation is somewhat helpful for focusing on a particular time series, but it’s not so helpful for querying a city by name and/or comparing multiple cities at once. As it turns out, plotly makes it easy to add a selectize.js powered dropdown widget for querying by name (aka indirect manipulation) by setting selectize = TRUE.23 When it comes to comparing multiple cities, we want to be able to both retain previous selections (persistent = TRUE) as well as control the highlighting color (dynamic = TRUE). This videos explains how to use these features in Figure 16.3 to compare pricing across different cities.

highlight(time_series, on = "plotly_click", selectize = TRUE, dynamic = TRUE, persistent = TRUE)

FIGURE 16.3: Using a selectize dropdown widget to search for cities by name and comparing multiple cities through persistent selection with a dynamic highlighting color. For a visual and audio explanation, see https://vimeo.com/202647310.

By querying a few different cities in Figure 16.3, one obvious thing we can learn is that not every city has complete pricing information (e.g., South Padre Island, San Marcos, etc). To learn more about what cities are missing information as well as how that missingness is structured, Figure 16.4 links a view of the raw time series to a dot-plot of the corresponding number of missing values per city. In addition to making it easy to see how cities rank in terms of missing house prices, it also provides a way to query the corresponding time series (i.e., reveal the structure of those missing values) by brushing cities in the dot-plot. This general pattern of linking aggregated views of the data to more detailed views fits the famous and practical information visualization advice from Shneiderman (1996): “Overview first, zoom and filter, then details on demand”.

# remember, `base` is a plotly object, but we can use dplyr verbs to
# manipulate the input data 
# (`txhousing` with `city` as a grouping and querying variable)
dot_plot <- base %>%
  summarise(miss = sum(is.na(median))) %>%
  filter(miss > 0) %>%
  add_markers(x = ~miss, y = ~forcats::fct_reorder(city, miss), hoverinfo = "x+y") %>%
  layout(
    xaxis = list(title = "Number of months missing"),
    yaxis = list(title = "")
  ) 

subplot(dot_plot, time_series, widths = c(0.2, 0.8), titleX = TRUE) %>%
  layout(showlegend = FALSE) %>%
  highlight(on = "plotly_selected", dynamic = TRUE, selectize = TRUE)

FIGURE 16.4: Linking a dot-plot of the number of missing housing prices with the raw time series. By brushing markers on the dot-plot, their raw time series is highlighted on the right hand side.

How does plotly know to highlight the time series when markers in the dot-plot are selected? The answer lies in what data values are embedded in the graphical markers via highlight_key(). When ‘South Padre Island’ is selected, like in Figure 16.5, it seems as though the logic says to simply change the color of any graphical elements that match that value, but the logic behind plotly’s graphical queries is a bit more subtle and powerful. Another, more accurate, framing of the logic is to first imagine a linked database query being performed behind the scenes (as in Figure 16.5). When ‘South Padre Island’ is selected, it first filters the aggregated dot-plot data down to just that one row, then it filters down the raw time-series data down to every row with ‘South Padre Island’ as a city. The drawing logic will then call Plotly.addTrace() with the newly filtered data which adds a new graphical layer representing the selection, allowing us to have finely-tuned control over the visual encoding of the data query.

A diagram of the graphical querying framework underlying Figure 16.4.

FIGURE 16.5: A diagram of the graphical querying framework underlying Figure 16.4.

The biggest advantage of drawing an entirely new graphical layer with the filtered data is that it becomes easy to leverage statistical trace types for producing summaries that are conditional on the query. Figure 16.6 leverages this functionality to dynamically produce probability densities of house price in response to a query events. Section 16.4.2 has more examples of leveraging statistical trace types with graphical queries.

hist <- base %>% add_histogram(x = ~median, histnorm = "probability density")
subplot(time_series, hist, nrows = 2) %>%
  layout(barmode = "overlay", showlegend = FALSE) %>%
  highlight(dynamic = TRUE, selectize = TRUE, selected = attrs_selected(opacity = 0.3))

FIGURE 16.6: Linking house prices as a function of time with their probability density estimates.

Another neat consequence of drawing a completely new layer is that we can control the plotly.js attributes in that layer through the selected argument of the highlight() function. In Figure 16.6 we use it to ensure the new highlighting layer has some transparency to more easily compare the city specific distribution to the overall distribution.

This section is designed to help give you a foundation for leveraging graphical queries in your own work. Hopefully by now you have a rough idea what graphical queries are, how they can be useful, and how to create them with highlight_key() and highlight(). Understanding the basic idea is one thing, but applying it effectively to new problems is another thing entirely. To help spark your imagination and demonstrate what’s possible, Section 16.4 has numerous subsections each with numerous examples of graphical queries in action.

16.2 Highlight versus filter events

Section 16.1 provides an overview of plotly’s framework for highlight events, but it also supports filter events. These events trigger slightly different logic:

  • A highlight event dims the opacity of existing marks, then adds an additional graphical layer representing the selection.
  • A filter event completely remove existing marks and rescales axes to the remaining data.24

Figure 16.7 provides a quick visual depiction in the difference between filter and highlight events. At least currently, filter events must be fired from filter widgets from the crosstalk package, and these widgets expect an object of class SharedData as input. As it turns out, the highlight_key() function, introduced in section 16.1, creates a SharedData instance and is essentially a wrapper for crosstalk::SharedData$new().

class(highlight_key(mtcars))
#> [1] "SharedData" "R6"

Figure 16.7 demonstrates the main difference in logic between filter and highlight events. Notice how, in the code implementation, the ‘querying variable’ definition for filter events is part of the filter widget. That is, city is defined as the variable of interest in filter_select(), not in the creation of tx. That is (intentionally) different from the approach for highlight events, where the ‘querying variable’ is a property of the dataset behind the graphical elements.

library(crosstalk)

# generally speaking, use a "unique" key for filter, 
# especially when you have multiple filters!
tx <- highlight_key(txhousing)
gg <- ggplot(tx) + geom_line(aes(date, median, group = city))
filter <- bscols(
  filter_select("id", "Select a city", tx, ~city),
  ggplotly(gg, dynamicTicks = TRUE),
  widths = c(12, 12)
)

tx2 <- highlight_key(txhousing, ~city, "Select a city")
gg <- ggplot(tx2) + geom_line(aes(date, median, group = city))
select <- highlight(
  ggplotly(gg, tooltip = "city"), 
  selectize = TRUE, persistent = TRUE
)

bscols(filter, select)

FIGURE 16.7: Comparing filter to highlight events. Filter events completely remove existing marks and rescales axes to the remaining data.

When using multiple filter widgets to filter the same dataset, as done in Figure 16.8, you should avoid referencing a non-unique querying variable (i.e., key-column) in the SharedData object used to populate the filter widgets. Remember that the default behavior of highlight_key() and SharedData$new() is to use the row-index (which is unique). This ensures the intersection of multiple filtering widgets queries the correct subset of data.

library(crosstalk)
tx <- highlight_key(txhousing)
widgets <- bscols(
  widths = c(12, 12, 12),
  filter_select("city", "Cities", tx, ~city),
  filter_slider("sales", "Sales", tx, ~sales),
  filter_checkbox("year", "Years", tx, ~year, inline = TRUE)
)
bscols(
  widths = c(4, 8), widgets, 
  plot_ly(tx, x = ~date, y = ~median, showlegend = FALSE) %>% 
    add_lines(color = ~city, colors = "black")
)

FIGURE 16.8: Filtering on multiple variables.

As Figure 16.9 demonstrates, filter and highlight events can work in conjunction with various htmlwidgets. In fact, since the semantics of filter are more well-defined than highlight, linking filter events across htmlwidgets via crosstalk should generally be more well-supported.25

library(leaflet)

eqs <- highlight_key(quakes)
stations <- filter_slider("station", "Number of Stations", eqs, ~stations)

p <- plot_ly(eqs, x = ~depth, y = ~mag) %>% 
  add_markers(alpha = 0.5) %>% 
  highlight("plotly_selected")

map <- leaflet(eqs) %>% 
  addTiles() %>% 
  addCircles()

bscols(
  widths = c(6, 6, 3), 
  p, map, stations
)

FIGURE 16.9: Linking plotly and leaflet through both filter and highlight events.

When combining filter and highlight events, one (current) limitation to be aware of is that the highlighting variable has to be nested inside filter variable(s). For example, in Figure 16.10, we can filter by continent and highlight by country, but there is currently no way to highlight by contintent and filter by country.

library(gapminder)
g <- highlight_key(gapminder, ~country)
continent_filter <- filter_select("filter", "Select a country", g, ~continent)

p <- plot_ly(g) %>%
  group_by(country) %>%
  add_lines(x = ~year, y = ~lifeExp, color = ~continent) %>%
  layout(xaxis = list(title = "")) %>%
  highlight(selected = attrs_selected(showlegend = FALSE))

bscols(continent_filter, p, widths = 12)

FIGURE 16.10: Combining filtering and highlighting with non-unique querying variables.

16.3 Linking animated views

The graphical querying framework (Section 16.1) works in tandem with key-frame animations Section (14). Figure 16.11 extends Figure 14.1 by layering on linear models specific to each frame and specifying continent as a key variable. As a result, one may interactively highlight any continent they wish, and track the relationship through the animation. In the animated version of Figure 14.1, the user highlights the Americas, which makes it much easier to see that the relationship between GDP per capita and life expectancy was very strong starting in the 1950s, but progressively weakened throughout the years.

g <- highlight_key(gapminder, ~continent)
gg <- ggplot(g, aes(gdpPercap, lifeExp, color = continent, frame = year)) +
  geom_point(aes(size = pop, ids = country)) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_x_log10()
highlight(ggplotly(gg), "plotly_hover")

FIGURE 16.11: Highlighting the relationship between GDP per capita and life expectancy in the Americas and tracking that relationship through several decades.

In addition to highlighting objects within an animation, objects may also be linked between animations. Figure 16.12 links two animated views: on the left-hand side is population density by country and on the right-hand side is GDP per capita versus life expectancy. By default, all of the years are shown in black and the current year is shown in red. By pressing play to animate through the years, we can see that all three of these variables have increased (on average) fairly consistently over time. By linking the animated layers, we may condition on an interesting region of this data space to make comparisons in the overall relationship over time.

For example, in Figure 16.12, countries below the 50th percentile in terms of population density are highlighted in blue, then the animation is played again to reveal a fairly interesting difference in these groups. From 1952 to 1977, countries with a low population density seem to enjoy large increases in GDP per capita and moderate increases in life expectancy, then in the early 80s, their GPD seems to decrease while the life expectancy greatly increases. In comparison, the high density countries seems to enjoy a more consistent and steady increase in both GDP and life expectancy. Of course, there are a handful of exceptions to the overall trend, such as the noticeable drop in life expectancy for a handful of countries during the nineties, which are mostly African countries feeling the affects of war.

The gapminder data does not include a measure of population density, but the gap dataset (included with the plotlyBook R package) adds a column containing the population per square kilometer (popDen), which helps implement Figure 16.12. In order to link the animated layers (i.e., red points), we need another version of gap that marks the country variable as the link between the plots (gapKey).

data(gap, package = "plotlyBook")

gapKey <- highlight_key(gap, ~country)

p1 <- plot_ly(gap, y = ~country, x = ~popDen, hoverinfo = "x") %>%
  add_markers(alpha = 0.1, color = I("black")) %>%
  add_markers(data = gapKey, frame = ~year, ids = ~country, color = I("red")) %>%
  layout(xaxis = list(type = "log"))

p2 <- plot_ly(gap, x = ~gdpPercap, y = ~lifeExp, size = ~popDen, 
              text = ~country, hoverinfo = "text") %>%
  add_markers(color = I("black"), alpha = 0.1) %>%
  add_markers(data = gapKey, frame = ~year, ids = ~country, color = I("red")) %>%
  layout(xaxis = list(type = "log"))

subplot(p1, p2, nrows = 1, widths = c(0.3, 0.7), titleX = TRUE) %>%
  hide_legend() %>%
  animation_opts(1000, redraw = FALSE) %>%
  layout(hovermode = "y", margin = list(l = 100)) %>%
  highlight("plotly_selected", color = "blue", opacityDim = 1, hoverinfo = "none")

FIGURE 16.12: Comparing the evolution in the relationship between per capita GDP and life expectancy in countries with large populations (red) and small populations (blue).

16.4 Examples

16.4.1 Querying facetted charts

A facetted chart, also known as a trellis or small multiples display, is an effective way to observe how a certain relationship or visual pattern changes with a discrete variable (Richard A. Becker 1996) (Tufte 2001b). The implementation of a facetted chart partitions a data set into groups, then produces a graphical panel for each group using a fixed visual encoding (e.g. a scatterplot). When these groups are related in some way, it can be useful to consider linking the panels through graphical queries to reveal greater insight, especially when it comes to making comparisons both within and across multiple groups.

Figure 16.13 is an example of making comparisons both within and across panels via graphical querying in a facetted chart. Each panel represents one year of English Premier League standings across time and each line represents a team (the querying variable). Since the x-axis represents the number of games within season and y-axis tracks cumulative points relative to the league average, lines with a positive slope represent above average performance and a negative slope represents below average performance. This design makes it easy to query a good (or bad) team for a particular year (via direct manipulation) to see who the team is as well as how they’ve compared to the competition in other years. In addition, the dynamic and persistent color brush allows us to query other teams to compare both within and across years. This example is shipped as a demo with the plotly package and uses data from the engsoccerdata package (Curley 2016). Thanks to Antony Unwin for providing the initial idea and inspiration for Figure 16.13 (Unwin 2016).

# By entering this demo in your R console it will print out the actual source code necessary to recreate the graphic
# Also, `demo(package = "plotly")` will list of all demos shipped with plotly
demo("crosstalk-highlight-epl-2", package = "plotly")

FIGURE 16.13: Graphical querying in a facetted worm chart of English Premier League football teams between 2007 and 2015. The combination of direct and indirect manipulation with the dynamic color brush makes it easy to make comparisons between good and/or bad teams relative to their known rivals. This particular comparison of Man U vs Arsenal demonstrates that, for the most part, Man U performed better from 2007 to 2015, expect in 2013.

The demo above requires some fairly advanced data pre-processing, so to learn how to implement graphical queries in trellis displays, let’s work with more minimal examples. Figure 16.14 gives us yet another look at the txhousing dataset. This time we focus on just four cities and give each city it’s own panel in the trellis display by leveraging facet_wrap() from ggplot2. Within each panel, we’ll wrap the house price time series by year by putting the month on the x-axis and grouping by year. Then, to link these panels, we can ulitize year as a querying variable. As a result, not only do we have the ability to analyze annual trends within city, but we can also query specific years to compare unusual or interesting years both within and across cities.

library(dplyr)
txsmall <- txhousing %>%
  select(city, year, month, median) %>%
  filter(city %in% c("Galveston", "Midland", "Odessa", "South Padre Island"))

txsmall %>%
  highlight_key(~year) %>% {
    ggplot(., aes(month, median, group = year)) + geom_line() +
      facet_wrap(~city, ncol = 2)
  } %>%
  ggplotly(tooltip = "year")

FIGURE 16.14: Monthly median house prices in four Texan cities. Querying by year allows one to compare unusual or interesting years both within and across cities.

Figure 16.15 displays the same information as 16.14, but shows a way to implement a linked trellis display via plot_ly() instead of ggplotly(). This approach leverages dplyr::do() to create plotly object for each city/panel, then routes that list of plots into subplot(). One nuance here is that the querying variable has to be defined within the do() statement, but everytime highlight_key() is called, it creates a crosstalk::SharedData object belonging to a new unique group, so to link these panels together the group must be set to a constant value (here we’ve set group = "txhousing-trellis").

txsmall %>%
  group_by(city) %>%
  do(
    p = highlight_key(., ~year, group = "txhousing-trellis") %>% 
      plot_ly(showlegend = FALSE) %>% 
      group_by(year) %>%
      add_lines(
        x = ~month, y = ~median, text = ~year, 
        hoverinfo = "text"
      ) %>%
      add_annotations(
        text = ~unique(city), 
        x = 0.5, y = 1, 
        xref = "paper", yref = "paper", 
        xanchor = "center", yanchor = "bottom", 
        showarrow = FALSE
      )
  ) %>%
  subplot(nrows = 2, margin = 0.05, shareY = TRUE, shareX = TRUE, titleY = FALSE)

FIGURE 16.15: Using plot_ly() instead of ggplotly() to implement a linked trellis display.

16.4.2 Statistical queries

16.4.2.1 Statistical queries with plot_ly()

Figure 16.6 introduced the concept of leveraging statistical trace types inside the graphical querying framework. This section gives some more examples of leveraging these trace types to dynamically produce statistical summaries of graphical queries. But first, to help understand what makes a trace “statistical”, consider the difference between add_bars() and add_histogram() (described in detail in Section 5). The important difference here is that add_bars() requires the bar heights to be pre-specified, whereas plotly.js does the relevant computations in add_histogram(). More generally, with a statistical trace, you provide a collection of “raw” values and plotly.js performs the statistical summaries necessary to render the graphic. As Figure 16.23 shows, sometimes you’ll want to fix certain parameters of the summary (e.g., number of bins in a histogram) to ensure the selection layer is comparable to original layer.

Figure 16.16 demonstrates routing of a scatterplot brushing event to two different statistical trace types: add_boxplot() and add_histogram(). Here we’ve selected all cars with 4 cylinders to show that cylinders appears to have a significant impact on miles per gallon for pickups and sport utility vehicles, but the interactive graphic allows us to query any subset of cars. Often times, with scatterplot brushing, it’s desirable to have the row index inform the SQL query (i.e., have a 1-to-1 mapping between a row of data and the marker encoding that row). This happens to be the default behavior of highlight_key() – if no data variable is specified, then it automatically uses the row index as the querying variable.

demo("crosstalk-highlight-binned-target-a", package = "plotly")

FIGURE 16.16: Linking a (jittered) dotplot of engine displacement by number of cylinders with boxplots of miles per gallon split by class and a bar chart of Dynamic 2-way ANOVA.

When using a statistical trace type with graphical queries, it’s often desirable to set the querying variable as the row index. That’s because, with a statistical trace, numerous data values are attached to each graphical mark; and in that case, it’s most intuitive if each value queries just one observation. Figure 16.17 gives a simple example of linking a (dynamic) bar chart with a scatterplot in this way to allow us to query interesting regions of the data space defined by engine displacement (disp), miles per gallon highway (hwy), and the class of car (class). Notice how selections can derive from either view, and since we’ve specified "plotly_selected" as the on event, either rectangular or lasso selections can be used to trigger the query.

d <- highlight_key(mpg)
base <- plot_ly(d, color = I("black"), showlegend = FALSE)

subplot(
  add_histogram(base, x = ~class),
  add_markers(base, x = ~displ, y = ~hwy)
) %>%
  # Selections are actually additional traces, and, by default, 
  # plotly.js will try to dodge bars placed under the same category
  layout(barmode = "overlay", dragmode = "lasso") %>%
  highlight("plotly_selected")

FIGURE 16.17: Linking a bar chart with a scatterplot to query interesting regions of the data space defined by engine displacement (disp), miles per gallon highway (hwy), and the class of car (class). Notice how, by using add_histogram(), the number of cars within each class is dynamically computed by plotly.js.

Figure 16.18 adds two more statistical trace types to Figure 16.17 to further explore how miles per gallon highway is related to fuel type (fl) and front/rear/4 wheel drive (drv). In particular, one can effectively condition on these discrete variables to see how the other distributions respond by brushing and dragging over markers. For example, in Figure 16.18, front-wheel drive cars are highlighted in red, then 4-wheel drive cars in blue, and as a result, we can see that the main effect of going from 4 to front wheel-drive is are also large interaction effect sizes with regular and diesel fuel types.

d <- highlight_key(mpg)
base <- plot_ly(d, color = I("black"), showlegend = FALSE)

subplot(
  add_markers(base, x = ~displ, y = ~hwy),
  add_boxplot(base, x = ~fl, y = ~hwy) %>%
    add_markers(x = ~fl, y = ~hwy, alpha = 0.1),
  add_trace(base, x = ~drv, y = ~hwy, type = "violin") %>%
    add_markers(x = ~drv, y = ~hwy, alpha = 0.1),
  shareY = TRUE
) %>%
  subplot(add_histogram(base, x = ~class), nrows = 2) %>%
  # Selections are actually additional traces, and, by default, 
  # plotly.js will try to dodge bars placed under the same category
  layout(barmode = "overlay") %>%
  highlight("plotly_selected", dynamic = TRUE)

FIGURE 16.18: Using statistical queries to perform a 2-way ANOVA on the mpg dataset. Cars with front-wheel drive are highlighted in red and 4-wheel drive highlighted in blue. The dynamically rendered boxplots by fuel type indicate significant interaction effects.

16.4.3 Statistical queries with ggplotly()

Compared to plot_ly(), statistical queries (client-side) with ggplotly() are fundamentally limited. That’s because, the statistical R functions that ggplot2 relies on to generate the graphical layers can’t necessarily be recomputed with different input data in your web browser. That being said, this is really only an issue when attempting to target a ggplot2 layer with a non-identity statistic (e.g., geom_smooth(), stat_summary(), etc). In that case, one should consider linking views server-side, as covered in section 17.

As Figure 16.19 demonstrates, you can still have a ggplot2 layer with a non-identity statistic serve as the source of a selection. In that case, ggplotly() will automatically attach all the input values of the querying variable into the creation of the relevant graphical object (e.g. the fitted line). That is why, in the example below, when a fitted line is hovered upon, all the points belonging to that particular group are highlighted, even when the querying variable is the row index.

m <- highlight_key(mpg)
p <- ggplot(m, aes(displ, hwy, colour = class)) +
    geom_point() +
    geom_smooth(se = FALSE, method = "lm")
ggplotly(p) %>% highlight("plotly_hover")

FIGURE 16.19: Engine displacement versus highway miles per gallon by class of car. The linear model for each class, as well as the individual observations, can be selected by hovering over the line of fitted values. An individual observation can also be selected by hovering over the relevant point.

Figure 16.19 demonstrates highlighting in a single view when the querying variable is the row index, but the linking could also be done by matching the querying variable with the ggplot2 group of interest, as is done in Figure 16.20. This way, when a user highlights an individual point, it highlights the entire group instead of just that point.

m <- highlight_key(mpg, ~class)
p1 <- ggplot(m, aes(displ, fill = class)) + geom_density()
p2 <- ggplot(m, aes(displ, hwy, fill = class)) + geom_point()
subplot(p1, p2) %>% hide_legend() %>% highlight("plotly_hover")

FIGURE 16.20: Clicking on a density estimate to highlight all the raw observations that went into that estimate.

In summary, we’ve learned numerous things about statistical queries:

  • A statistical trace (e.g., add_histogram(), add_boxplot(), etc) can be used as both the source and target of a graphical query.
  • When a statistical trace is the target of a graphical query, it’s often desirable to have the row index assigned as the querying variable.
  • A ggplot2 layer can be used as the source of a graphical query, but when it is the target, non-trivial statistical functions can not be recomputed client-side. In that case, one should consider linking views server side, as covered in section 17.

16.4.4 Geo-spatial queries

Section 4 covers several different approaches26 for rendering geo-spatial information, and each approach supports graphical querying. One clever approach is to render a 3D globe as a surface, then layer on geo-spatial data on top of that globe with a scatter3d trace. Not only is 3D a nice way to visualize geospatial data that has altitude (in addition to latitude and longitude), but it also grants the ability to interpolate color along a path. Figure 16.21 renders tropical storms paths on a 3D globe and uses color to encode the altitude of the storm at that point. Below the 3D view is a 2D view of altitude versus distance traveled. These views are linked by a graphical query where the querying variable is the storm ID.

demo("sf-plotly-3D-globe", package = "plotly")

FIGURE 16.21: Linking a 3D globe with tropical storm paths to a 2D view of the storm altitude versus distance traveled.

A more widely used approach to geo-spatial data visualization is to render lat/lon data on a basemap layer that updates in response to zoom events. The plot_mapbox() function from plotly does this via integration with mapbox. Figure 16.22 uses plot_mapbox() highlighting earthquakes west of Fiji to compare the relative frequency of their magnitude and number of reporting stations (to the overall relative frequency).

eqs <- highlight_key(quakes)
 
# you need a mapbox API key to use plot_mapbox()
# https://www.mapbox.com/signup/?route-to=https://www.mapbox.com/studio/account/tokens/
map <- plot_mapbox(eqs, x = ~long, y = ~lat) %>%
  add_markers(color = ~depth) %>%
  layout(
    mapbox = list(
      zoom = 2,
      center = list(lon = ~mean(long), lat = ~mean(lat))
    )
  ) %>%
  highlight("plotly_selected")
 
# shared properties of the two histograms
hist_base <- plot_ly(eqs, color = I("black"), histnorm = "probability density") %>%
  layout(barmode = "overlay", showlegend = FALSE) %>%
  highlight(selected = attrs_selected(opacity = 0.5))
 
histograms <- subplot(
  add_histogram(hist_base, x = ~mag),
  add_histogram(hist_base, x = ~stations),
  nrows = 2, titleX = TRUE
)
 
crosstalk::bscols(histograms, map)

FIGURE 16.22: Querying earthquakes by location and displaying their a histogram of their magnitude and number of stations.

Every 2D mapping approach in plotly (e.g., plot_mapbox(), plot_ly(), geom_sf()) has a special understanding of the simple features data structure provided by the sf package. Sievert (2018b) and Sievert (2018c) goes more in depth about simple features support in plotly and provides more examples of graphical queries and animation with simple features, but Figure 16.23 demonstrates a clever ‘trick’ to get bi-directional brushing between polygon centroids and a histogram showing a numerical summary of the polygons. The main idea is to leverage the st_centroid() function from sf to get the polygons centroids, then link those points to the histogram via highlight_key().

library(sf)
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
nc_query <- highlight_key(nc, group = "sf-rocks")
nc_centroid <- highlight_key(st_centroid(nc), group = "sf-rocks")

map <- plot_mapbox(color = I("black"), height = 250) %>%
  add_sf(data = nc) %>%
  add_sf(data = nc_centroid) %>%
  layout(showlegend = FALSE) %>%
  highlight("plotly_selected", dynamic = TRUE)

hist <- plot_ly(color = I("black"), height = 250) %>% 
  add_histogram(
    data = nc_query, x = ~AREA,
    xbins = list(start = 0, end = 0.3, size = 0.01)
  ) %>%
  layout(barmode = "overlay") %>% 
  highlight("plotly_selected")

crosstalk::bscols(widths = 12, map, hist)

FIGURE 16.23: Graphically querying North Carolina by location and area.

16.4.5 Linking with other htmlwidgets

The plotly package is able to share graphical queries with a limited set of other R packages that build upon the htmlwidgets standard. At the moment, graphical queries work best with leaflet and DT. Figure 16.24 links plotly with DT, and since the data set linked bewteen the two is an sf data frame, each row of the table is linked to a polygon on the map through the row index of the same dataset.

demo("sf-dt", package = "plotly")

FIGURE 16.24: Linking a plot_ly()-based map with a datatable() from the DT package.

As already shown in section 16.2, plotly can share graphical queries with leaflet. Some of the more advanced features (e.g., persistent selection with dynamic color brush) are not yet officially supported, but you can still leverage these experimental features by installing the experimental versions of leaflet referenced in the code below. For example, in Figure 16.25, persistent selection with dynamic colors allows one to first highlight earthquakes with a magnitude of 5 or higher in red, then earthquakes with a magnitude of 4.5 or lower, and the corresponding earthquakes are highlighted in the leaflet map. This reveals an interesting relationship in magnitude and geographic location, and leaflet provides the ability to zoom and pan on the map to investigate regions that have a high density of quakes.

# requires an experimental version of leaflet
# devtools::install_github("rstudio/leaflet#346")
library(leaflet)

qquery <- highlight_key(quakes)

p <- plot_ly(qquery, x = ~depth, y = ~mag) %>% 
  add_markers(alpha = 0.5) %>%
  highlight("plotly_selected", dynamic = TRUE)

map <- leaflet(qquery) %>% 
  addTiles() %>% 
  addCircles()

# persistent selection can be specified via options()
withr::with_options(
  list(persistent = TRUE), 
  crosstalk::bscols(widths = c(6, 6), p, map)
)

FIGURE 16.25: Linking views between plotly and leaflet to explore the relation between magnitude and geographic location of earthquakes around Fiji.

Figure 16.26 uses another experimental feature of querying leaflet polygons in response to direct manipulation of a plotly graph.

# requires an experimental version of leaflet
# devtools::install_github("rstudio/leaflet#391")
library(leaflet)
library(sf)

nc <- system.file("shape/nc.shp", package = "sf") %>%
  st_read() %>% 
  st_transform(4326) %>%
  highlight_key()

map <- leaflet(nc) %>%
  addTiles() %>%
  addPolygons(
    opacity = 1,
    color = 'white',
    weight = .25,
    fillOpacity = .5,
    fillColor = 'blue',
    smoothFactor = 0
  )

p <- plot_ly(nc) %>% 
  add_markers(x = ~BIR74, y = ~SID79) %>%
  layout(dragmode = "lasso") %>%
  highlight("plotly_selected")

crosstalk::bscols(map, p)

FIGURE 16.26: Querying polygons on a leaflet map in response to direct manipulation of a plotly graph.

16.4.6 Generalized pairs plots

Section 13.1.2.1 introduced the generalized pairs plot made via GGally::ggpairs() which, like ggplot(), partially supports graphical queries. The brushing in Figure 16.27 demonstrates how the scatterplots can respond to a graphical queries (allowing us to see how these relationships behave in specific subsections of the data space), but for the same reasons outlined in 16.4.3, the statistical summaries (e.g., the density plots and correlations) don’t respond to the graphical query.

highlight_key(iris) %>%
  GGally::ggpairs(aes(color = Species), columns = 1:4) %>%
  ggplotly() %>%
  highlight("plotly_selected")

FIGURE 16.27: Brushing a scatterplot matrix via the ggpairs() function in the GGally package. A video demonstrating the graphical queries can be viewed here https://vimeo.com/307788027

16.4.7 Querying diagnostic plots

In addition to the ggpairs() function for generalized pairs plots, the GGally packages also has a ggnostic() function which generates a matrix of diagnostic plots from a model object using ggplot2. Each column of this matrix represents a different explanatory variable and each row represents a different diagnostic measure. Figure 16.28 shows the default display for a linear model, which includes residuals (resid), estimates of residual standard deviation when a particular observation is excluded (sigma), diagonals from the projection matrix (hat), and cooks distance (cooksd).

library(dplyr)
library(GGally)

mtcars %>%
  # for better tick labels
  mutate(am = recode(am, `0` = "automatic", `1` = "manual")) %>%
  lm(mpg ~ wt + qsec + am, data = .) %>%
  ggnostic(mapping = aes(color = am)) %>%
  ggplotly()

FIGURE 16.28: Graphical queries applied to multiple diagnostic plots of a linear model. The ggplotly() function has a special method for ggnostic() that adds graphical queries automatically with support for both individual observations (e.g. points) as well as meaningful groups (e.g., automatic vs manual).

Injecting interactivity into ggnostic() via ggplotly() enhances the diagnostic plot in at least two ways. Coloring by a factor variable in the model allows us to highlight that region of the design matrix by selecting a relevant statistical summary, which can help avoid overplotting when dealing with numerous factor levels. For example, in Figure 16.28, the user first highlights diagnostics for cars with manual transmission (in blue), then cars with automatic transmission (in red). Perhaps more widely useful is the ability to highlight individual observations since most of these diagnostics are designed to identify highly influential or unusual observations.

In Figure 16.28, there is one observation with a noticeably high value of cooksd, which suggests the observation has a large influence on the fitted model. Clicking on that point highlights its corresponding diagnostic measures, plotted against each explanatory variable. Doing so makes it obvious that this observation is influential since it has a unusually high response/residual in a fairly sparse region of the design space (i.e., it has a pretty high value of wt) and removing it would significantly reduce the estimated standard deviation (sigma). By comparison, the other two observations with similar values of wt have a response value very close to the overall mean, so even though their value of hat is high, their value of sigma is low.

16.4.7.1 Subset queries via list-columns

All the graphical querying examples thus far use highlight_key() to attach values from atomic vector of a data frame to graphical marker(s), but what non-atomic vectors (i.e., list-columns)? When it comes to emitting events, there is no real difference – plotly will “inform the world” of a set of selection values, which is the union of all data values in the graphical query. However, as Figure 16.29 demonstrates, when plotly receives a list-column query, it will highlight graphical markers with data value(s) that are a subset of the selected values. For example, when the point [3, 3] is queried, plotly will highlight all markers that represent a subset of {A, B, C}, which is why both [1, 1] (representing the set {A}) and (2, 2) (representing the set {A, B}) are highlighted.

d <- tibble::tibble(
  x = 1:4, 
  y = 1:4,
  key = lapply(1:4, function(x) LETTERS[seq_len(x)]),
  txt = sapply(key, function(x) sprintf("{%s}", paste(x, collapse = ", ")))
)
highlight_key(d, ~key) %>%
  plot_ly(x = ~x, y = ~y, text = ~txt, hoverinfo = "text") %>%
  highlight("plotly_selected", color = "red") %>%
  layout(dragmode = "lasso")

FIGURE 16.29: A simple example of subset queries via a list-column.

One compelling use case for subset queries is dendograms. In fact, plotly provides a plot_dendro() function for making dendrograms with support for subset queries. Figure 16.30 gives an example of brushing a branch of a dendrogram to query leafs that are similar in some sense. Any dendrogram object can be provided to plot_dendro(), but this particular example visualizes the similarity of US states in terms of their arrest statistics via a hierarchical clustering model on the USArrests dataset.

hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)
plot_dendro(dend1, height = 600) %>% 
  hide_legend() %>% 
  highlight("plotly_selected", persistent = TRUE, dynamic = TRUE)

FIGURE 16.30: Leveraging hierarchical selection and persistent brushing to paint branches of a dendrogram.

Figure 16.31 links the dendrogram from Figure 16.30 to a map of the US and a grand tour of the arrest statistics to better understand and diagnose a hierarchical clustering methodology. By highlighting branches of the dendrogram, we can effectively choose a partitioning of the states into similar groups, and see how that model choice projects to the data space27 through a grand tour. The grand tour is a special kind of animation that interpolates between random 2D projections of numeric data allowing the viewer to perceive the shape of a high-dimensional point cloud (Asimov 1985). Note how the grouping portrayed in Figure 16.31 does a fairly good job of staying separated in the grand tour.

demo("animation-tour-USArrests", package = "plotly")

FIGURE 16.31: Linking a dendrogram to a grand tour and map of the USArrests data to visualize a classification in 5 dimensions.

16.5 Limitations

The graphical querying framework presented here is for posing database queries between multiple graphs via direct manipulation. For serious statistical analysis, one often needs to link other data views (i.e., text-based summaries, tables, etc) in other arbitrary ways. For these use cases, the R package shiny makes it very easy to build on concepts we’ve already covered to build more powerful client-server applications entirely in R, without having to learn any HTML, CSS, or JavaScript. The next chapter 17 gives a brief introduction to shiny, then dives right into concepts related to linking plotly graphics to other arbitrary views.

References

Shneiderman, Ben. 1996. “The Eyes Have It:A Task by Data Type Taxonomy for Information Visualizations.” VL Proceedings of the IEEE Symposium on Visual Languages, January, 1–9.

Richard A. Becker, Ming-Jen Shyu, William S. Cleveland. 1996. “The Visual Design and Control of Trellis Display.” Journal of Computational and Graphical Statistics 5 (2): 123–55. http://www.jstor.org/stable/1390777.

Tufte, Edward. 2001b. The Visual Display of Quantitative Information. Cheshire, Conn: Graphics Press.

Curley, James. 2016. Engsoccerdata: English and European Soccer Results 1871-2016. https://CRAN.R-project.org/package=engsoccerdata.

Unwin, A. R. 2016. “GDA of England (from Engsoccerdata).” Blog. http://www.gradaanwr.net/wp-content/uploads/2016/06/dataApr16.pdf.

Sievert, Carson. 2018b. “Learning from and Improving Upon Ggplotly Conversions.” Blog. https://blog.cpsievert.me/2018/01/30/learning-improving-ggplotly-geom-sf/.

Sievert, Carson. 2018c. “Visualizing Geo-Spatial Data with Sf and Plotly.” Blog. https://blog.cpsievert.me/2018/03/30/visualizing-geo-spatial-data-with-sf-and-plotly/.

Asimov, Daniel. 1985. “The Grand Tour: A Tool for Viewing Multidimensional Data.” SIAM J. Sci. Stat. Comput. 6 (1): 128–43. https://doi.org/10.1137/0906011.


  1. This sort of idea relates closely to the notion of generalized selections as described in Heer, Agrawala, and Willett (2008).

  2. The title that appears in the dropdown can be controlled via the group argument in the highlight_key() function. The primary purpose of the group argument is to isolate one group of linked plots from others.

  3. When using ggplotly(), you need to specify dynamicTicks = TRUE.

  4. All R packages with crosstalk support are currently listed here – https://rstudio.github.io/crosstalk/widgets.html

  5. Sievert (2018c) outlines the relative strengths and weaknesses of each approach.

  6. Typically statistical models are diagnosed by visualizing data in the model space rather than model(s) in the data space. As Wickham, Cook, and Hofmann (2015) points, the latter approach can be a very effective way to better understand and diagnosis statistical models.