Teaching modeling in introductory statistics: A comparison of formula and tidyverse syntaxes (2201.12960v5)
Abstract: There are many pedagogical considerations for incorporating programming into a statistics course. When using the programming language R, one consideration is the particular R syntax that will be used. This paper reports on a head-to-head comparison run in a pair of introductory statistics labs, one conducted fully in the formula syntax, the other in tidyverse. Analysis of pre- and post-survey data show minimal differences between the two labs, with students reporting a positive experience regardless of section. Analysis of data from YouTube and RStudio Cloud show interesting distinctions. The formula section appeared to watch a larger proportion of pre-lab YouTube videos, but spend less time computing on RStudio Cloud. Conversely, the tidyverse section watched a smaller proportion of the videos and spent more time computing. Analysis of lab materials showed tidyverse labs tended to be slightly longer in terms of lines in the provided RMarkdown materials and minutes of the associated YouTube videos. The tidyverse labs exposed students to more distinct R functions, but reused functions more frequently. Both labs relied on a relatively small vocabulary of consistent functions, which can provide a starting point for instructors interested in teaching introductory statistics in R. The instructor experience of teaching in the two syntaxes diverged primarily when discussing relationships between categorical variables, as well as when working with summary statistics for numeric variables. This work provides additional evidence for instructors looking to choose between syntaxes for introductory statistics teaching.