P=0.05: Is There No Alternative?

Written by Connor Zahler:

The standard for statistical significance that we’ve consistently been taught in higher level education is p<0.05. Most questions on exams won’t even list a p-value because you are meant to assume .05 is the cutoff. But where did this seemingly arbitrary number come from and can it really be used in every situation? In this article, Connor explores how .05 came to be and why or why not we should continue using it today.

Introduction:

If you’ve taken a QMSS class or any intro statistics class, you hopefully remember the importance of 0.05. Yeah, we hear about .01 or .10, now and then. But, in the social sciences, it’s .05 that generally gives us bragging rights to that ever-so-satisfying claim of ‘statistical significance’. Basically, if you have a p-value of less than 0.05, you can be at least 95% confident that the results you’ve obtained are not due to random chance and thus, are real. This is an over-simplification, but it serves the purposes of this article. The .05 cutoff is very ubiquitous, but why is that? Why not 0.1 or 0.01? Does a non-significant result mean that no effect was observed? What is p-hacking? In this article, we’ll go through a history and explanation of p = 0.05 and the possible issues with it.

The history of .05 is certainly a long one, and we see its power has reigned supreme for some time. A 1967 article in The American Sociologist discussed this point saying, “Causal examination of the literature discloses the common, arbitrary, and virtually sacred levels of .05, .01, and .001 are almost universally selected regardless of the nature of the problem. Of these three, .05 is perhaps most sacred…..The current obsession with .05, it would seem, has the consequence of differentiating significant research findings and those best forgotten, published studies from unpublished ones, and renewal of grants from  termination” (Skipper, Guenther, and Nass, 1967).  Our friend .05 has real-world consequences. Let’s take a further look.

Why 0.05?:

There is no real reason to use 0.05 rather than any other significance threshold. The number was first proposed around 1925 by R.A. Fisher, father of the p-value as we understand it today, and his reasoning was self-admittedly arbitrary. This is often a hard pill to swallow: the cornerstone of research in many fields is perhaps based on little other than “well, looks good.” It would be misleading, however, to reduce the value to that little. It has shown its worth in decades of research. While there are questions about it, it’s not like the number is wholly useless; however, an important piece of this conversation is understanding that 0.05 is more historical convention than anything else. 

Other Thresholds:

0.05 is hardly the only game in town. For one, there have been repeated proposals to change it, such as in this 2018 article, which proposes the lower 0.005 based on its greater correspondence to Bayes factors as well as the simple fact that a lower p-value means a lower chance of a false positive. Other scientists, in contrast, have proposed disavowing the whole concept in favor of a more holistic approach to evaluating evidence. While it is rare, other values (such as 0.1 and 0.01) are used in some articles, especially in cases where a false positive could pose greater risks. Certain fields, such as genetics, have already adopted much stricter thresholds for significance.

P-Hacking:

Chances are you’ve heard the phrase p-hacking in the context of social science research, but you might not know more about it, other than hearing that it’s bad science and it can make people angry. Basically, p-hacking means changing and designing your analysis in pursuit of a lower p-value, rather than organically obtaining one through your experimental and analytic design. 

Sometimes with p-hacking, rather than testing for evidence of a predetermined, theoretically-grounded hypothesis, an analyst  may look at relationship after relationship until they find one that meets the conventional .05 standard and then they back into an argument for that relationship’s relevance.  (Recall that a .05 level means you have a 1 in 20 chance of having a false positive in your test. If you run 100’s of analyses then you will likely find some that meet .05 significance even if they are false. Are you trying to investigate an idea or are you just trying to ‘find something’ that you can report?)  

Having an arbitrary cut-off like 0.05 will potentially encourage some kind of gaming of the results. If you obtain a result of something like 0.051, and you find that you can hit 0.049 by switching to a different sort of test, why wouldn’t you make the switch? Especially when significant results can be the difference between success in a field and languishing in obscurity. This isn’t said to justify p-hacking; it can have incredibly negative real-world consequences. Acting like p-hacking is solely the work of entirely bad actors, however, ignores structural issues that can encourage it.

Bayesian Methods:

One major alternative to the 0.05 value is not a replacement, but an addition. Recently, there has been a drive to incorporate Bayesian analysis into hypothesis testing. While these methods are well-known in statistics classes (many students here will probably have at least a passing familiarity), there hasn’t been a corresponding adoption into social sciences research. Proponents argue that Bayesian analysis would switch the paradigm from “effect/no effect” to “magnitude of effect.” A lower-key example is the odds ratio, which students encounter in QMSS courses. By showing how associated two events are, the odds ratio can add in important information to help researchers understand how strong the effect they’re seeing is, regardless of significance. Bayesian statistics is highly complex, but the rewards to research may be more than worthwhile. Interested students can read more here.

What is to be Done?:

Abolishing or replacing the 0.05 value could never be done overnight. It would require a lot of work, consulting with scientists, research on the possible effects, and scholarly debate. In the end, it may be the best solution to the problem of a standard for verifiable knowledge. Whatever the case, it’s important to understand p < 0.05, its history, and issues with it. It’s a cornerstone of social science, and anyone working in the field will encounter it again and again.

Leave a Reply

Your email address will not be published. Required fields are marked *