Wait. 100 "true things", 5% error rate...

Ah! Where's the rigor? Shouldn't we be testing more than once, if we have a known error rate?

Something is just not right about the middle part of that figure. It needs some Bayes-jutsu.

This is what it is saying: there are 1000 test cases. 10% are "yes". There are 100 true "yeses" and 900 true "nos". A power of .8 means 80% of the true "yeses" will be captured by the test. That means there will be 20 apparent "nos" that are really "yeses". There is also a .05 false positive rate. That means that out of 900 TRUE "nos", 45 will appear to be "yeses". False positives look exactly like true positives.

However, although their math works out just fine if they are talking about a 5% false positive rate, they seem to have confused confidence level with false positive rate. That is not what a .05 confidence level is. A .05% confidence level is a measure of how likely the test is to have produced data this far or farther from a no-change mean by chance alone. *It is not an error rate.* Therefore all their numbers are hopelessly borked and meaningless.

Actually, they're right about that. Alpha is not only the p-value needed for significance, it is also the chance of a Type I error. That is, if the null hypothesis is true, it is the fraction of replications you would expect, by chance alone, the estimation of the parameter (usually the mean) to fall outside that interval and be considered significant. And...I think I just explained it.

The problem is, the article words the situation weirdly. Instead, it should be like this.

1. Of 1000 hypotheses, perhaps 100 of these will reject the null hypothesis in favor of the alternative.

2. Given an alpha of 0.5, we expect 1/20th of those tests that should not have rejected the null hypothesis to actually do so. 1/20th of 900 is 45. These are the false positives. Given a power of 0.8 [I'm not sure how they got a Power of 0.8, I just have to take their word for it, since calculating Power by hand is complicated, since calculating beta is complicated, and Power is 1 minus beta.] , our beta is 0.2, which means 20% of the time when we /should have/ rejected the null hypothesis, we will not. This is the type two error, and means that 0.2*80 = 20 negative tests are actually false negatives, they should have rejected the null hypothesis.

3. If researchers only publish positive test results, that means that by chance alone there will be 9 false positives published for every 16 positives, meaning more than half. The ratio for false negatives to negatives is .02, which means the random chance for a false negative is much lower than false positive, if all tests are equally published.

Now, I have a few problems with this, The first is that people are generally not interested in detecting sameness, they are interested in detecting differences. At least in hypothesis testing And furthermore, those alpha and beta values? They are /tailored/ to a high standard of detecting differences. We could easily design a test where the rate of false negatives is higher, and all we have to do is decrease the alpha value. Make it tiny. Make it small enough, and the false negative level will skyrocket. (ETA: Really why we use 0.05 as our alpha is based around something called the central limit theorem, which has to do with central tendencies of variability and the rareness of extreme values. It assumes data are normally distributed. They aren't always.)

But here's the main problem, and that's the premise of assuming a very uneven ratio of negative to positive results. People do not run around testing hypotheses at random. It is, frankly, a waste of time. When the article assumes a 10 to 1 ratio of negatives and positives, it is exactly that, an assumption. What if we make it 50:50? Well then, 1/20th of 500 is 25, and 0.2 times 500 is 100. Which makes the ratio of false negatives to negatives 25:400 or 0.06, and the ratio of false positives to positives 100 to 475 or 0.2, which is a HELL of a lot lower than more than half.

This means the whole figure is nonsensical, because it is based on an untested assumption which is the ratio of negatives to positives in hypothesis testing. It does the /math/ right, but it starts from a flimsy premise. These post-hoc power tests have been looked down upon for years, this is not how you use power.

What you use power for is to decide on an appropriate sample size for the effect size you are looking for. In other words, if you are going to test a fertilizer, and you only care if the tree growth difference is larger than a foot (this is the effect size), power calculations can help tell you what an appropriate sample size would be to detect that difference between your control and treatment, given the natural variability and the desired alpha (again, the probability of a Type I error, usually 0.5). You can then rest assured that, if you have properly estimated the inherent variability, that the appropriate sample size will give you a significant p-value if and only if the effect size is as large as you would want it.

This is turning into a tangent, but it must be said: a significant p-value is /meaningless/ without knowing the effect size. You could say that the difference between those two tree fertilizers is significant, but if the actual difference is only a change in inches, who gives a shit? When you see a significant p-value in a paper, always always always check what the actual difference is, what the units are, and if the difference even matters.

And I think that's all for now. This message has been brought to you by the statistics software program R and the number 0.05.

ETA2: Oh, I think I just repeated what you said, except more complicated, and with more flailing at the end.