Sixth Sense
Last month, we sketched – almost literally – a possible route through all of the material with which students need to be fluent if they are to understand conceptually the Binomial Distribution; this month we’re going to look at Binomial Hypothesis Testing. For some students of the current specifications, this idea is not covered until the S2 module, but the new specifications will include this in AS Mathematics so all students of the subject, even those not intending to sit an A Level, will be expected to carry out and interpret hypothesis tests.
From the statistician’s point of view, a hypothesis test has five stages:
 propose hypotheses
 set significance level
 calculate critical region
 carry out experiment
 conclude.
With a carefullydesigned initial experiment (practical, ideally, but thoughtexperiment if necessary) it should be possible to elicit the key ideas from your students without giving them this list.
For example, you could tell your students that you are going to give them a die, which you may or may not have tampered with. If you have tampered with it, you have done so in such a way as to make “5” come up less frequently than they would think it should. How might they find out whether or not you have tampered with it? Alternatively, you could propose implied hypotheses which require the collection of data from, or by, the students. For example, “birth rates are lower in the first 8 months of the year” could lead to hypotheses testing the value of p where p = P(a person is born between 1 January and 31 August). Whatever you decide, it is important to try to start with a situation that will require a “onetailed test” at the lower end – we don’t want to make life too complicated right from the start!
Let’s explore the first example in a bit more detail. The conversation never goes quite like this, but with a bit of steering by you, the key ideas can be elicited from your students!
Teacher: 

I’m going to give you a hypothetical die. 
Student 1: 

What’s a die? 
Teacher: 

I’m being pedantic – we often call it a dice, but that’s technically the plural. Anyway, either it “behaves fairly” or I’ve tampered with it so that “5” comes up less frequently than I think you think it should. How will you find out? 
Student 2: 

Roll it lots of times and count the 5’s. 
Teacher: 

Good: how many times? 
Student 3: 

If we roll it sixty times, we’d expect “5” to be rolled ten times. 
Teacher: 

What do you mean by that? Will we get exactly ten 5’s? 
Student 3: 

No. Well, we might. But I mean that rolling ten 5’s is the most likely result, but that there are plenty of other things that could happen 
Student 4: 

So if we roll it sixty times and get “5” nine times, will we say “you have tampered with this dice?” I don’t think we should, because this could actually happen. Then again, anything could happen, but if we’re way off getting ten 5’s then we’re going to be suspicious! 
Teacher: 

Interesting – so at what point are you prepared to stand by an accusation of tampering? What’s “way off?” 
Student 5 

I’d say if we get “5” six times. 
Student 6: 

And if we get “5” fewer times than that, surely? 
Teacher: 

OK, so we're saying that our decision process is:
1. Roll the die sixty times.
2. Count the number of 5’s.
3. And if we get six or fewer 5’s, then you’ll accuse me of dietampering.
Let’s think about this happening. If you roll a fair die sixty times, what is the probability that we roll “5” six times or fewer? 
Student 7: 

Let me consult my tables. Oh no, they only go up to n = 20. 
Teacher: 

Fine, let’s build a 60table on a spreadsheet. [This doesn’t take long and is an easy formula for students to see. Use two columns: one contains the integers from 0 (in cell A1) to 60 (in cell A61), then type “=binom.dist(A1,60,1/6,1)” into cell B1 and drag this down column B] 
Student 8: 

Under the modelling assumption that “no. of 5’s rolled” we can see from the spreadsheet that P(no. of 5’s rolled ) = 0.1081 (4d.p.) 
Teacher: 

OK. So when we roll the die and get only four 5’s, you’re going to accuse me of dietampering. Will you be certain that I’ve done so? 
Student 9: 

No, because we might just be unlucky. Fair coins do come up “Heads” 5 times in a row, it’s not impossible. Fair dice can be rolled and not give the scores you’re expecting. Even if we get no 5’s at all that’s not PROOF that you’ve tampered with the die. 
Teacher: 

Correct, so this means that if you accuse me of tampering, you know you might be making a false accusation – in fact the probability of you making a false accusation is… 
Student 10: 

…0.1081, because I’ll falsely accuse you if it is a FAIR dice, modelled with P(roll a 5) = 1/6, and genuinely I roll “5” six or fewer times out of sixty, and the probability of that is what we just worked out. 
Student 11: 

But that’s way too high for the risk of a FALSE accusation like this – I’m not comfortable with that. I’d rather risk a false accusation with a probability closer to 1% than 10%. 
Student 12: 

In which case, looking at our table, we’d have to make an accusation if “three or fewer” 5’s are rolled – according to the table, the probability of this happening is 0.0063 which is under 1%, but if we decided to accuse if “four or fewer” the probability is nearly 2%. 
Teacher: 

Great, lots of wriggle room: you’ve just made it easier for me to get away with giving you a tampered die! So clearly there’s a bit of a balancing act here between “making a false accusation too frequently” and “making it too easy for someone to get away with tampering”. Clear the table, let’s start rolling! 
INTERLUDE 


Student 13: 

So we rolled the die sixty times, and rolled a 5 (drum roll please) five times. We said we’d accuse you of tampering if we got three or fewer 5’s, which we haven't, so we won’t accuse you. 
Teacher: 

Hooray, I’m a free man. But do you KNOW that I haven’t tampered with the die? 
Student 14: 

No. Maybe you did, and were lucky and got away with it, like a “false negative” when you have a test for a nasty disease. 
Teacher: 

If we’d only rolled a 5 twice, would you KNOW that I had tampered with the die? 
Student 15: 

No. Maybe you hadn’t, but you were unlucky, like a “false positive” in a medical test. 
At this point, your students are ready to use some notation and terminology, and to formalise what they have learned so far.
 A statistical “hypothesis test” starts with two hypotheses. In our example, they are
The null hypothesis –
The alternative hypothesis 
where represents what we think is the probability of rolling a 5.
 Let’s decide that in this instance, we accept the risk of making a false accusation (of the type “you’ve tampered with the die when actually (unknown to us) you haven’t”) with probability 0.1. We call this the significance level. So, here we’re setting a “10% significance level”. This seems acceptable (but not to everyone in the dialogue above) but clearly this will vary from experiment to experiment. Statisticians have to make the decision based on context; students are likely to be told what to use in the exam question.
 The critical region is the set of outcomes that result in our rejecting : that is, making an accusation (which may or may not be a correct accusation). It is, therefore, the set of outcomes that lead us to make a FALSE accusation if is, in fact and unknown to us, true.
So, in our experiment, assuming that “no. of 5’s rolled” , we can see from our table (as discussed earlier) that “P(no. of 5’s rolled ) = 0.1081”.
This is a potential sticking point, but a good example – having picked a significance level of 10%, this figure is actually too high. We could change our level to 11%, but we ought not to be influenced by the figures involved. Having set the level to 10%, we revisit the table to see that “P(no. of 5’s rolled ) = 0.0512”. This means that our proposed significance level of 10% is actually a significance level of 5.12% in this experiment, because the probabilities accumulate “chunkily”.
Thus our critical region is “no. of 5’s rolled is 5 or fewer”, which we can write in set notation as {X ≤ 5}, defining X to be (the random variable) the number of 5’s rolled when the die is rolled sixty times.
 Now, and only now, are we ready to “do the experiment”. All this thinking should have happened so as not to influence our decision at the end. If we do the experiment first, we could be tempted to tweak our critical region to fit the experimental outcome.
So we roll the die sixty times and get…
 (a) …7 “5’s” and 53 “not 5’s”.
Conclusion: 7 IS NOT in the critical region. There is insufficient evidence, at the 10% significance level, to suggest that the die has been tampered with. Therefore, we are not comfortable making an accusation. We say that “we accept ”. This doesn’t mean we are sure that the die is fair, but we haven’t got enough evidence to be comfortable suggesting otherwise.
(b) …5 “5’s” and 55 “not 5’s”.
Conclusion: 5 IS in the critical region. There is sufficient evidence, at the 10% significance level, to suggest that the die has been tampered with. Therefore, we are comfortable making this accusation. We say that “we reject .”
Students should now be ready to tackle some questions. It is a good idea to set, at least initially, questions that follow the above structure – exam questions will necessarily contain the experimental result but we want students to set up the test first, before being influenced by this. Consequently, a bit of editing may be useful, so that questions look something like:
It is believed that threequarters of people living in Manchester use the trams at least once a week. A researcher wishes to find out whether or not a campaign to encourage more people to walk, cycle and use the bus has reduced the proportion of Manchesterinhabitants using the trams. As such, she decides to carry out a survey of 20 such people.
(a) What are her hypotheses?
(b) If she chooses a 5% significance level, what is her critical region?
12 of the 20 people surveyed say they use the tram at least once a week.
(c) What should she conclude?
One of them contacts the researcher to say that he was mistaken, and he should have said “no, I do not use the tram at least once a week”.
(d) What should the statistician conclude now?
Only once students have grasped the procedural fluency of the “five step” structure is it worth altering the questions so that they have to cope with the irritating exam question format that gives the experimental result in the first sentence!
Further lessons on upper tail tests, and twotail tests are, of course, necessary but it isn’t worth rushing into these. Even students who are happy with the process, and who have previously mastered the art of calculating all sorts of probabilities from the tables, find the “upper tail” argument difficult:
For hypotheses
and an experiment involving 16 trials, we find ourselves looking at the following table:

0.4 
8 
0.857730282 
9 
0.941681055 
10 
0.980858082 
11 
0.995104274 
12 
0.99906155 
13 
0.999873298 
14 
0.999989263 
15 
0.999999571 
16 
1 
Clearly so we have to consider the “opposite” of this – we deduce that, because % it must be the case that % which is too big given our significance level.
Instead, students must argue that %
so %
and thus the critical region here is , the thing being counted, needs to happen 11 or more times in 16 trials – i.e. is the critical region.
Not straightforward, and certainly it would be good to teach and embed lower tails, then have a few lessons on something else, and then come back to upper and twotails later!
Image credit
Page header by takomabibelot (adapted), some rights reserved
