A previous study has shown that moving 3D data visualizations to the physical world can improve users’ efficiency at information retrieval tasks. Here, we re-analyze a subset of the experimental data using a multiverse analysis approach. Results from this multiverse analysis are presented as explorable explanations, and can be interactively explored in this paper. The study’s findings appear to be robust to choices in statistical analysis.
Author Keywords
Physical visualization; multiverse analysis.
ACM Classification Keywords
H5.2 User Interfaces: Evaluation/Methodology
General Terms
Human Factors; Design; Experimentation; Measurement.
Introduction
Whereas traditional visualizations map data to pixels or ink, physical visualizations (or “data physicalizations”) map data to physical form. While physical visualizations are compelling as an art form, it is unclear whether they can help users carry out actual information visualization tasks.
Five years ago, a study was published comparing physical to on-screen visualizations in their ability to support basic information retrieval tasks [1]. Interactive 2D representations were clearly the fastest, but a gain in speed was observed when transitioning from on-screen to physical 3D representations. Overall, the study suggested that features unique to physical objects – such as their ability to be directly touched – can facilitate information retrieval.
These results however only hold for the particular data analysis that was conducted. A group of statisticians and methodologists recently argued that results from a single analysis can be unreliable [2]. They recommended researchers to conduct instead multiverse analyses, i.e., to perform all reasonable analyses of their data and report a summary of their outcomes. While Steegen et al. show how to summarize all outcomes using p-values, here we use an interactive approach based on Bret Victor’s concept of explorable explanation[3].
3D bar chart, on-screen and physical.
Study
The study consisted of two experiments. In the first experiment, participants were presented with 3D bar charts showing country indicator data, and were asked simple questions about the data. The 3D bar charts were presented both on a screen and in physical form (see Figure 1). The on-screen bar chart could be rotated in all directions with the mouse. Both a regular and a stereoscopic display were tested. An interactive 2D bar chart was also used as a control condition. Accuracy was high across all conditions, but average completion time was lower with physical 3D bar charts than with on-screen 3D bar charts.
Here we only re-analyze the second experiment, whose goal was to better understand why physical visualizations appear to be superior. The experiment involved an “enhanced” version of the on-screen 3D chart and an “impoverished” version of the physical 3D chart. The enhanced on-screen chart was rotated using a 3D-tracked tangible prop instead of a mouse. The impoverished physical chart consisted of the same physical object but participants were instructed not to use their fingers for marking. There were 4 conditions:
physical touch: physical 3D bar charts where touch was explicitly encouraged in the instructions.
physical no touch: same charts as above except subjects were told not to use their fingers to mark points of interest (labels and bars).
virtual prop: on-screen 3D bar charts with a tangible prop for controlling 3D rotation.
virtual mouse: same charts as above, but 3D rotation was mouse-controlled.
These manipulations were meant to answer three questions: 1) how important is direct touch in the physical condition? 2) how important is rotation by direct manipulation? 3) how important is visual realism? Visual realism referred to the higher perceptual richness of physical objects compared to on-screen objects, especially concerning depth cues. Figure 2 summarizes the three effects of interest.
Effects of interest.
Sixteen participants were recruited, all of whom saw the four conditions in counterbalanced order. For more details about the experiment, please refer to [1].
Results
Like the original paper we use an estimation approach, meaning that we report and interpret all results based on (unstandardized) effect sizes and their interval estimates [4]. We explain how to translate the results into statistical significance language to provide a point of reference, but we warn the reader against the pitfalls of dichotomous interpretations [5].
Average task completion time
(arithmetic mean)(geometric mean)
for each condition. Error bars are 95%t-basedBCa bootstrap
CIs.
We focus our analysis on task completion times, reported in Figures 3 and 4. Dots indicate sample means, while error bars are
95%dragdrag
confidence intervals computed on
untransformed datalog-transformed data [6]
using the
t-distributionBCa bootstrap [7]
method. Strictly speaking, all we can assert about each interval is that it comes from a procedure designed to capture the population mean 95% of the time across replications, under some assumptions [8]. In practice, if we assume we have very little prior knowledge about population means, each interval can be informally interpreted as a range of plausible values for the population mean, with the midpoint being
about 7 times
more likely than the endpoints [9].
Figure 3 shows the
mean(geometric) mean
completion time for each condition. At first sight, physical touch appears to be faster than the other conditions.
However, since condition is a within-subject factor, it is preferable to examine within-subject differences [9], shown in Figure 4.
Figure 4 shows the pairwise
differencesratios
between mean completion times. A value lower than
01
(i.e., on the left side of the dark line) means the condition on the left is faster than the condition on the right. The confidence intervals are
not corrected for multiplicity.Bonferroni-correctedSince the individual confidence level is 95%,meaning they are effectively 95% CIs [10]. Thus,
an interval that does not contain
the value 0the value 1
incidates a statistically significant difference at the α=.05 level. The probability of getting at least one such interval if all 3 population means were zero (i.e., the familywise error rate) is α=.14. Likewise, the simultaneous confidence level is 86%, meaning that if we replicate our experiment over and over, we should expect the 3 confidence intervals to capture all 3 population means 86% of the time.
Differences between mean completion times (arithmetic means)Ratios between average task completion times (geometric means) between conditions. Error bars are
Bonferroni-corrected 95%t-basedBCa bootstrap
CIs.
Figure 4 provides good evidence that i)physical touch is faster on average than physical no touch, and that ii)physical no touch is faster than virtual prop. This suggests that both visual realism (e.g., rich depth cues) and physical touch can facilitate basic information retrieval. Importantly, these two properties are unique to physical objects and are hard to faithfully reproduce in virtual setups. In contrast, we could not find evidence that physical object rotation (as opposed to mouse-operated rotation) provides a performance advantage for information retrieval.
Discussion and Conclusion
Our findings for experiment 2 are in line with the previously published study [1]. In the present article, the default analysis options reflect the choices made in the previously published analysis – thus, the figures are by default identical. On top of this, we consider alternative choices in statistical analysis and presentation, which together yield a total 56 unique analyses and results. The conclusions are largely robust to these choices. Results are less clean with untransformed data, likely because abnormally high completion times are given as much weight as other observations. The use of a log transformation addresses this issue without the need for outlier removal [6].
Meanwhile, the use of bootstrap CIs makes the results slightly stronger, although this is likely because bootstrap CIs are slightly too liberal for small sample sizes [7].
We did not re-analyze experiment 1 to keep this article simple. Since experiment 1 used four conditions and the reported analysis included a figure with seven comparisons [1], it is possible that some of the effects become much less conclusive after correcting for multiplicity. Multiplicity correction is however a contested practice [10], thus it is generally best to consider both corrected and uncorrected interval estimates.
The goal of this article was to illustrate how the ideas of multiverse analysis[2] and of explorable explanation[3] can be combined to produce more transparent and more compelling statistical reports. We only provided a few analysis options, and many more options could have been included. In addition, our choice of analysis options was highly personal and subjective. Steegen et al. have argued that multiverse analyses are necessarily incomplete and subjective, but are nonetheless way more transparent than conventional analyses where no information is provided about the robustness or fragility of researchers’ findings [2].
References
Evaluating the Efficiency of Physical Visualizations[link] Jansen, Y., Dragicevic, P. and Fekete, J., 2013. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2593—2602. ACM. DOI: 10.1145/2470654.2481359
Increasing Transparency Through a Multiverse Analysis[link] Steegen, S., Tuerlinckx, F., Gelman, A. and Vanpaemel, W., 2016. Perspectives on psychological science: a journal of the Association for Psychological Science, Vol 11(5), pp. 702—712. DOI: 10.1177/1745691616658637
Fair Statistical Communication in HCI[link] Dragicevic, P., 2016. Modern Statistical Methods for HCI, pp. 291—330. Springer International Publishing. DOI: 10.1007/978-3-319-26633-6\_13
The earth is flat (p> 0.05): significance thresholds and the crisis of unreplicable research[link] Amrhein, V., Korner-Nievergelt, F. and Roth, T., 2017. PeerJ, Vol 5, pp. e3544. PeerJ Inc.
Average task times in usability tests: what to report?[link] Sauro, J. and Lewis, J.R., 2010. {Proceedings of the SIGCHI Conference on Human Factors in Computing Systems}, pp. 2347—2350. ACM. DOI: 10.1145/1753326.1753679
BootES: an R package for bootstrap confidence intervals on effect sizes[link] Kirby, K.N. and Gerlanc, D., 2013. Behavior research methods, Vol 45(4), pp. 905—927. Springer. DOI: 10.3758/s13428-013-0330-5
The fallacy of placing confidence in confidence intervals[link] Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D. and Wagenmakers, E., 2016. Psychonomic bulletin \& review, Vol 23(1), pp. 103—123. DOI: 10.3758/s13423-015-0947-8
Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis[link] Cumming, G., 2012. Routledge.
Serious Stats: A Guide to Advanced Statistics for the Behavioral Science. Baguley, T., 2012. Palgrave Macmillan.
References
Evaluating the Efficiency of Physical Visualizations[link] Jansen, Y., Dragicevic, P. and Fekete, J., 2013. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2593—2602. ACM. DOI: 10.1145/2470654.2481359
Increasing Transparency Through a Multiverse Analysis[link] Steegen, S., Tuerlinckx, F., Gelman, A. and Vanpaemel, W., 2016. Perspectives on psychological science: a journal of the Association for Psychological Science, Vol 11(5), pp. 702—712. DOI: 10.1177/1745691616658637
Fair Statistical Communication in HCI[link] Dragicevic, P., 2016. Modern Statistical Methods for HCI, pp. 291—330. Springer International Publishing. DOI: 10.1007/978-3-319-26633-6\_13
The earth is flat (p> 0.05): significance thresholds and the crisis of unreplicable research[link] Amrhein, V., Korner-Nievergelt, F. and Roth, T., 2017. PeerJ, Vol 5, pp. e3544. PeerJ Inc.
Average task times in usability tests: what to report?[link] Sauro, J. and Lewis, J.R., 2010. {Proceedings of the SIGCHI Conference on Human Factors in Computing Systems}, pp. 2347—2350. ACM. DOI: 10.1145/1753326.1753679
BootES: an R package for bootstrap confidence intervals on effect sizes[link] Kirby, K.N. and Gerlanc, D., 2013. Behavior research methods, Vol 45(4), pp. 905—927. Springer. DOI: 10.3758/s13428-013-0330-5
The fallacy of placing confidence in confidence intervals[link] Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D. and Wagenmakers, E., 2016. Psychonomic bulletin \& review, Vol 23(1), pp. 103—123. DOI: 10.3758/s13423-015-0947-8
Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis[link] Cumming, G., 2012. Routledge.
Serious Stats: A Guide to Advanced Statistics for the Behavioral Science. Baguley, T., 2012. Palgrave Macmillan.
Evaluating the Efficiency of Physical Visualizations[link] Y. Jansen, P. Dragicevic, J. Fekete. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2593—2602. ACM. 2013. DOI: 10.1145/2470654.2481359
Increasing Transparency Through a Multiverse Analysis[link] S. Steegen, F. Tuerlinckx, A. Gelman, W. Vanpaemel. Perspectives on psychological science: a journal of the Association for Psychological Science, Vol 11(5), pp. 702—712. 2016. DOI: 10.1177/1745691616658637
Explorable explanations B. Victor. Bret Victor, Vol 10. 2011.
Evaluating the Efficiency of Physical Visualizations[link] Y. Jansen, P. Dragicevic, J. Fekete. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2593—2602. ACM. 2013. DOI: 10.1145/2470654.2481359
Fair Statistical Communication in HCI[link] P. Dragicevic. Modern Statistical Methods for HCI, pp. 291—330. Springer International Publishing. 2016. DOI: 10.1007/978-3-319-26633-6\_13
The earth is flat (p> 0.05): significance thresholds and the crisis of unreplicable research[link] V. Amrhein, F. Korner-Nievergelt, T. Roth. PeerJ, Vol 5, pp. e3544. PeerJ Inc. 2017.
Average task times in usability tests: what to report?[link] J. Sauro, J.R. Lewis. {Proceedings of the SIGCHI Conference on Human Factors in Computing Systems}, pp. 2347—2350. ACM. 2010. DOI: 10.1145/1753326.1753679
BootES: an R package for bootstrap confidence intervals on effect sizes[link] K.N. Kirby, D. Gerlanc. Behavior research methods, Vol 45(4), pp. 905—927. Springer. 2013. DOI: 10.3758/s13428-013-0330-5
The fallacy of placing confidence in confidence intervals[link] R.D. Morey, R. Hoekstra, J.N. Rouder, M.D. Lee, E. Wagenmakers. Psychonomic bulletin \& review, Vol 23(1), pp. 103—123. 2016. DOI: 10.3758/s13423-015-0947-8
Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis[link] G. Cumming. Routledge. 2012.
Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis[link] G. Cumming. Routledge. 2012.
Serious Stats: A Guide to Advanced Statistics for the Behavioral Science. T. Baguley. Palgrave Macmillan. 2012.
Evaluating the Efficiency of Physical Visualizations[link] Y. Jansen, P. Dragicevic, J. Fekete. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2593—2602. ACM. 2013. DOI: 10.1145/2470654.2481359
Average task times in usability tests: what to report?[link] J. Sauro, J.R. Lewis. {Proceedings of the SIGCHI Conference on Human Factors in Computing Systems}, pp. 2347—2350. ACM. 2010. DOI: 10.1145/1753326.1753679
BootES: an R package for bootstrap confidence intervals on effect sizes[link] K.N. Kirby, D. Gerlanc. Behavior research methods, Vol 45(4), pp. 905—927. Springer. 2013. DOI: 10.3758/s13428-013-0330-5
Evaluating the Efficiency of Physical Visualizations[link] Y. Jansen, P. Dragicevic, J. Fekete. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2593—2602. ACM. 2013. DOI: 10.1145/2470654.2481359
Serious Stats: A Guide to Advanced Statistics for the Behavioral Science. T. Baguley. Palgrave Macmillan. 2012.
Increasing Transparency Through a Multiverse Analysis[link] S. Steegen, F. Tuerlinckx, A. Gelman, W. Vanpaemel. Perspectives on psychological science: a journal of the Association for Psychological Science, Vol 11(5), pp. 702—712. 2016. DOI: 10.1177/1745691616658637
Explorable explanations B. Victor. Bret Victor, Vol 10. 2011.
Increasing Transparency Through a Multiverse Analysis[link] S. Steegen, F. Tuerlinckx, A. Gelman, W. Vanpaemel. Perspectives on psychological science: a journal of the Association for Psychological Science, Vol 11(5), pp. 702—712. 2016. DOI: 10.1177/1745691616658637