4 minute read

For about a decade now, I’ve been teaching on and off some form of introductory statistics for a wide variety of students across different institutions. This is an immensely popular course at the undergraduate level which should come as no surprise. The ability to analyze data and draw some basic inferences and conclusions is incredibly important across numerous disciplines. Students have a wide variety of interests, goals and prior experience, and not everyone comes to this course with a love or experience with technical mathematics.

Those who have taught some form of introductory statistics will recognize that the mathematical aspects of this course pose one of the greatest barriers to effective instruction. It’s worth it to take a moment and acknowledge that conceptually, the ideas in intro stats are hard! The central idea that distinguishes statistics from other branches of mathematics is the notion of randomness and uncertainty, and this can be a difficult concept to illustrate. In other math courses like Calculus, one may present or have students work through an example to elucidate some concept. However when dealing with uncertainty and probabilities, any single example or instance of a random variable would be insufficient to highlight the essential behavior. A lot of the relevant behavior only manifests after dozens, or hundreds, or thousands of occurences. More to the point, any specific example is, by it’s very nature not random, and in many ways defeating the whole point! (This is a pet peeve I have with every intro stats textbook I’ve ever read, including very good ones!)

Luckily, with the use of technology, we can use randomization and simulation via languages like R. As mentioned in my previous post, this is also an oppourtunity to introduce active or inquiry elements into the course!

For example in this activity from my IBL based statistcs course, we are easing ourselves into discussing the Central Limit Theorem. The CLT is one of those results that only makes sense at scale. Even restricting to the binomial case, typically (n=30+) trials for the binomial variable would have to be observed in order for the CLT to reasonably apply. But to see that the distributions of this variable is normal, one needs many many more instances of these (30+) trials, far too many coins than can be flipped in class! However, with R, this is no issue.

In the above R cell, we flip coinflips = 10 coins, record the number of heads, and repeat this (1000) times, and plot the distribution of the outcomes in a histogram. The students are then prompted to repeat this for increasing values of coinflips, (20, 30, 50, 100, 200) etc.. One can observe then that the histogram conforms to a normal curve-like shape. By editing the prob = c(0.5, 0.5) vector, one can adjust this experiment to any binomial variable, and regardless the normal curve manifests itself. We can observe all this without stating the CLT, and students may even conjecture as to the general principle at play here. We may then further illustrate this idea by superimposing the curve on the histogram:

The students can confirm their conjecture, and by adjusting coinflips and p, they can verify the result for other binomial variables. A proof of the CLT is way way beyond the scope of an intro stats class, but using these simulations, students can still engage in the act of inquiry, conjecture, and at least experiential verification.

Another concept which may be difficult to describe without demonstration is the confidence interval. Given some sample proportion, one generates an interval which has the probability (also a proportion) of containing yet another (the population) proportion. This is an intricate definition, and students can find the task of unpacking what a confidence interval actually is to be daunting. However in this activity, the students assuming that 23% of Americans like their steaks medium rare, and then simulate what random samples of the population may produce as samples and corresponding confidence intervals. Simulating 100 potential confidence intervals at random, they can see how much variation there may be in samples of size n=50 and the corresponding confidence intervals may vary as well. They can also see that about 95% of the confidence intervals will contain the actual true proportion of 0.23:

These are just a couple of the statistical concepts which lend themselves well to discovery and inquiry via simulation and experimentation. Having these cells loaded and ready in a PreTeXt workbook cuts down on the barrier to entry for the students to benefit from these sorts of activities. The structure of the activities are highly informed by Team Based Inquiry Learning pedagogy, the content follows from the excellent OpenIntro OER. To talk more about tech in math education, you can find me and many others on the MathTech.org Discord!

Comments