I am a co-author of the Lakens et al. paper, Justify Your Alpha. In it, we argued that it was critical for scientists to justify their choice of significance level. In doing so, we asked to return to the original practice of significance testing, where the goal is not (and should never have become) to choose which results are noteworthy or publishable. Instead, “without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.” (Neyman and Pearson, 1933) Given the goal of finding methods with an upper bound to the in long-run probability of being wrong in our conclusion, our paper clearly shows that the threshold for statistical significance as currently used should not be uniform. But I now think this is partly a red herring.
It is critical to note that the original conclusion of Neyman and Pearson was a “criteria suitable for testing any given statistical hypothesis.” The criteria they found was only one part of the goal, which was to allow statistical evidence to be used to converge to beliefs which were (with high probability) correct. In 1933, when the users of statistical methods were restricted to those who could calculate the answers manually, and the discussions in a given field were among only a few sophisticated users of the methods, p-values were a reasonable choice for such a criteria.
But as noted by Crane and Martin, the current question of interest is how the statistical methods that are demanded by journals and peer review affect the ability for science as a discipline to converge to correct answers in the long-run. The principles that they suggest, along with those of open science, are more likely to help achieve Neyman and Pearson’s initial goals than redefining the p-value, especially given that all involved in the discussion agree that it currently fails to achieve its original purpose of keeping the error rate below 1-in-20. Preregistration, public availability of data and code, and justification of study design choices (like which alpha is used) are clearly attempts to further promote the sociological goals of science.
If we hope to find rules that fulfill Neyman and Pearson’s original goals, focusing on study design and statistical methods narrowly seems doomed to failure. Instead, we need to focus on the principles, techniques, and practices that each discipline demands of contributors. This conversation has started, and continues, but perhaps the goals we are looking to achieve need to be more clearly stated (and justified) before further discussion of statistical methods and goalposts is useful.