Journalist submits fake paper, passes peer review.

Mesozoic Mister Nigel · October 25, 2013, 12:23:32 AM

Quote from: Kai on October 24, 2013, 06:29:14 PM
Okay. I'm done with the butthurt. Here's some food for Germans.

http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble

QuoteAcademic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. There are errors in a lot more of the scientific papers being published, written about and acted on than anyone would normally suppose, or like to think.

Various factors contribute to the problem. Statistical mistakes are widespread. The peer reviewers who evaluate papers before journals commit to publishing them are much worse at spotting mistakes than they or others appreciate. Professional pressure, competition and ambition push scientists to publish more quickly than would be wise. A career structure which lays great stress on publishing copious papers exacerbates all these problems. "There is no cost to getting things wrong," says Brian Nosek, a psychologist at the University of Virginia who has taken an interest in his discipline's persistent errors. "The cost is not getting them published."

The whole article is on mistakes and falsehoods in scientific publishing, and why replications (which are a kind of post publication peer review) are absolutely necessary and not happening. And you know what? This DOES upset me. I accept completely that peer reviewed journals are going to slip up sometimes, that peer reviewers are going to fail, that mistakes and falsehoods are going to be published. It happens, it's going to continue to happen, there's not a damn thing anyone can do to eliminate it completely. Which is why follow ups are so damn important.

Maybe Science really /is/ broken/short circuit, and if it IS, then the broken part is that it's become like media. The entire point is to pour out stories, with not a bit of thought to questioning whether the stories that just got poured out were any good. THATS the supposed self correcting, and since we've been letting the journalists do it FOR us, the letters are still PR but pronounced "public relations" and not "peer review". This is disturbing. And I don't know fuck all I can do about it.

Also, I've been wondering who the hell that guy in the picture is.

Fantastic article, Kai! One of the questions that's been brought up somewhere around here is why negative findings are so rarely published, even though negative findings can stand to tell us more, more definitively, about a question than positive findings. They aren't sexy, they aren't speculative, but sometimes a solid "Nope!" (if you'll forgive me for the expression) can be more meaningful than a bright and shiny "Maybe".

I find it a bit troublesome that apparently not all scientists are required to take statistics. I admit that I hated statistics; that's no secret. I was bored to tears. But as time goes on I am finding that I am really really glad that I took them because it makes what I'm looking at make so much more sense when I'm trying to interpret and understand the results of a paper, including being able to look at powers and levels of significance and say "hmm, that is far too high an error rate with far too low an n for me to take these findings very seriously without a great deal of further investigation".

Mesozoic Mister Nigel · October 25, 2013, 12:35:45 AM

Quote from: Kai on October 24, 2013, 11:18:04 PM
Quote from: LMNO, PhD (life continues) on October 24, 2013, 10:28:05 PM
Wait. 100 "true things", 5% error rate...

Ah! Where's the rigor? Shouldn't we be testing more than once, if we have a known error rate?

Something is just not right about the middle part of that figure. It needs some Bayes-jutsu.

This is what it is saying: there are 1000 test cases. 10% are "yes". There are 100 true "yeses" and 900 true "nos". A power of .8 means 80% of the true "yeses" will be captured by the test. That means there will be 20 apparent "nos" that are really "yeses". There is also a .05 false positive rate. That means that out of 900 TRUE "nos", 45 will appear to be "yeses". False positives look exactly like true positives.

However, although their math works out just fine if they are talking about a 5% false positive rate, they seem to have confused confidence level with false positive rate. That is not what a .05 confidence level is. A .05% confidence level is a measure of how likely the test is to have produced data this far or farther from a no-change mean by chance alone. It is not an error rate. Therefore all their numbers are hopelessly borked and meaningless.

Kai · October 25, 2013, 01:54:40 AM

Quote from: Mrs. Nigelson on October 25, 2013, 12:35:45 AM
Quote from: Kai on October 24, 2013, 11:18:04 PM
Quote from: LMNO, PhD (life continues) on October 24, 2013, 10:28:05 PM
Wait. 100 "true things", 5% error rate...

Ah! Where's the rigor? Shouldn't we be testing more than once, if we have a known error rate?

Something is just not right about the middle part of that figure. It needs some Bayes-jutsu.

This is what it is saying: there are 1000 test cases. 10% are "yes". There are 100 true "yeses" and 900 true "nos". A power of .8 means 80% of the true "yeses" will be captured by the test. That means there will be 20 apparent "nos" that are really "yeses". There is also a .05 false positive rate. That means that out of 900 TRUE "nos", 45 will appear to be "yeses". False positives look exactly like true positives.

However, although their math works out just fine if they are talking about a 5% false positive rate, they seem to have confused confidence level with false positive rate. That is not what a .05 confidence level is. A .05% confidence level is a measure of how likely the test is to have produced data this far or farther from a no-change mean by chance alone. It is not an error rate. Therefore all their numbers are hopelessly borked and meaningless.

Actually, they're right about that. Alpha is not only the p-value needed for significance, it is also the chance of a Type I error. That is, if the null hypothesis is true, it is the fraction of replications you would expect, by chance alone, the estimation of the parameter (usually the mean) to fall outside that interval and be considered significant. And...I think I just explained it.

The problem is, the article words the situation weirdly. Instead, it should be like this.

1. Of 1000 hypotheses, perhaps 100 of these will reject the null hypothesis in favor of the alternative.

2. Given an alpha of 0.5, we expect 1/20th of those tests that should not have rejected the null hypothesis to actually do so. 1/20th of 900 is 45. These are the false positives. Given a power of 0.8 [I'm not sure how they got a Power of 0.8, I just have to take their word for it, since calculating Power by hand is complicated, since calculating beta is complicated, and Power is 1 minus beta.] , our beta is 0.2, which means 20% of the time when we /should have/ rejected the null hypothesis, we will not. This is the type two error, and means that 0.2*80 = 20 negative tests are actually false negatives, they should have rejected the null hypothesis.

3. If researchers only publish positive test results, that means that by chance alone there will be 9 false positives published for every 16 positives, meaning more than half. The ratio for false negatives to negatives is .02, which means the random chance for a false negative is much lower than false positive, if all tests are equally published.

Now, I have a few problems with this, The first is that people are generally not interested in detecting sameness, they are interested in detecting differences. At least in hypothesis testing And furthermore, those alpha and beta values? They are /tailored/ to a high standard of detecting differences. We could easily design a test where the rate of false negatives is higher, and all we have to do is decrease the alpha value. Make it tiny. Make it small enough, and the false negative level will skyrocket. (ETA: Really why we use 0.05 as our alpha is based around something called the central limit theorem, which has to do with central tendencies of variability and the rareness of extreme values. It assumes data are normally distributed. They aren't always.)

But here's the main problem, and that's the premise of assuming a very uneven ratio of negative to positive results. People do not run around testing hypotheses at random. It is, frankly, a waste of time. When the article assumes a 10 to 1 ratio of negatives and positives, it is exactly that, an assumption. What if we make it 50:50? Well then, 1/20th of 500 is 25, and 0.2 times 500 is 100. Which makes the ratio of false negatives to negatives 25:400 or 0.06, and the ratio of false positives to positives 100 to 475 or 0.2, which is a HELL of a lot lower than more than half.

This means the whole figure is nonsensical, because it is based on an untested assumption which is the ratio of negatives to positives in hypothesis testing. It does the /math/ right, but it starts from a flimsy premise. These post-hoc power tests have been looked down upon for years, this is not how you use power.

What you use power for is to decide on an appropriate sample size for the effect size you are looking for. In other words, if you are going to test a fertilizer, and you only care if the tree growth difference is larger than a foot (this is the effect size), power calculations can help tell you what an appropriate sample size would be to detect that difference between your control and treatment, given the natural variability and the desired alpha (again, the probability of a Type I error, usually 0.5). You can then rest assured that, if you have properly estimated the inherent variability, that the appropriate sample size will give you a significant p-value if and only if the effect size is as large as you would want it.

This is turning into a tangent, but it must be said: a significant p-value is /meaningless/ without knowing the effect size. You could say that the difference between those two tree fertilizers is significant, but if the actual difference is only a change in inches, who gives a shit? When you see a significant p-value in a paper, always always always check what the actual difference is, what the units are, and if the difference even matters.

And I think that's all for now. This message has been brought to you by the statistics software program R and the number 0.05.

ETA2: Oh, I think I just repeated what you said, except more complicated, and with more flailing at the end.

Mesozoic Mister Nigel · October 25, 2013, 02:20:52 AM

Never mind, I am properly dizzied!

Mesozoic Mister Nigel · October 25, 2013, 02:21:22 AM

Quote from: Kai on October 25, 2013, 01:54:40 AM
ETA2: Oh, I think I just repeated what you said, except more complicated, and with more flailing at the end.

Ahhhh OK thanks, my head was spinning a bit there!

Kai · October 25, 2013, 02:35:11 AM

Quote from: Mrs. Nigelson on October 25, 2013, 02:21:22 AM
Quote from: Kai on October 25, 2013, 01:54:40 AM
ETA2: Oh, I think I just repeated what you said, except more complicated, and with more flailing at the end.

Ahhhh OK thanks, my head was spinning a bit there!

Sorry! My head was spinning too, trying to figure out the math. But you have the right of it; these proportions of error are not meant for determining after the fact what the possibility of statistical error is. Alpha, beta, and power are supposed to be used for individual hypothesis tests, not for judging the error rate of a large number of different tests, and are supposed to be computed before the test, not after.

The Good Reverend Roger · October 25, 2013, 02:36:30 AM

I have been gibbering monkey noises for the last 3 posts.

And I was once a math/physics major.

Kai · October 25, 2013, 02:42:27 AM

Quote from: Mrs. Nigelson on October 25, 2013, 12:23:32 AM
Quote from: Kai on October 24, 2013, 06:29:14 PM
Okay. I'm done with the butthurt. Here's some food for Germans.

http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble

QuoteAcademic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. There are errors in a lot more of the scientific papers being published, written about and acted on than anyone would normally suppose, or like to think.

Various factors contribute to the problem. Statistical mistakes are widespread. The peer reviewers who evaluate papers before journals commit to publishing them are much worse at spotting mistakes than they or others appreciate. Professional pressure, competition and ambition push scientists to publish more quickly than would be wise. A career structure which lays great stress on publishing copious papers exacerbates all these problems. "There is no cost to getting things wrong," says Brian Nosek, a psychologist at the University of Virginia who has taken an interest in his discipline's persistent errors. "The cost is not getting them published."

The whole article is on mistakes and falsehoods in scientific publishing, and why replications (which are a kind of post publication peer review) are absolutely necessary and not happening. And you know what? This DOES upset me. I accept completely that peer reviewed journals are going to slip up sometimes, that peer reviewers are going to fail, that mistakes and falsehoods are going to be published. It happens, it's going to continue to happen, there's not a damn thing anyone can do to eliminate it completely. Which is why follow ups are so damn important.

Maybe Science really /is/ broken/short circuit, and if it IS, then the broken part is that it's become like media. The entire point is to pour out stories, with not a bit of thought to questioning whether the stories that just got poured out were any good. THATS the supposed self correcting, and since we've been letting the journalists do it FOR us, the letters are still PR but pronounced "public relations" and not "peer review". This is disturbing. And I don't know fuck all I can do about it.

Also, I've been wondering who the hell that guy in the picture is.

Fantastic article, Kai! One of the questions that's been brought up somewhere around here is why negative findings are so rarely published, even though negative findings can stand to tell us more, more definitively, about a question than positive findings. They aren't sexy, they aren't speculative, but sometimes a solid "Nope!" (if you'll forgive me for the expression) can be more meaningful than a bright and shiny "Maybe".

I find it a bit troublesome that apparently not all scientists are required to take statistics. I admit that I hated statistics; that's no secret. I was bored to tears. But as time goes on I am finding that I am really really glad that I took them because it makes what I'm looking at make so much more sense when I'm trying to interpret and understand the results of a paper, including being able to look at powers and levels of significance and say "hmm, that is far too high an error rate with far too low an n for me to take these findings very seriously without a great deal of further investigation".

To get back to this: yes, negative findings can tell us things. The really important thing is to follow up on both positive and negative results, repeat experiments, and question the authority of the literature. It takes time, but it must be done.

As for statistics...the necessity of statistics is determined by how little variability your data have, and how large your effect size is. If your effect size is huge, and your variability is low, then statistics is pretty much unnecessary. You just /look/ at the thing. A lot of time physicists don't use statistics. But biology, for example, is messy. There's a great deal of variability in biological systems, and the effect sizes are often small and still meaningful. Therefore, statistics is standard. In our PhD program, everyone is required to take at least one statistics course, sometimes multiple.

Mesozoic Mister Nigel · October 25, 2013, 03:27:18 AM

Quote from: Kai on October 25, 2013, 02:35:11 AM
Quote from: Mrs. Nigelson on October 25, 2013, 02:21:22 AM
Quote from: Kai on October 25, 2013, 01:54:40 AM
ETA2: Oh, I think I just repeated what you said, except more complicated, and with more flailing at the end.

Ahhhh OK thanks, my head was spinning a bit there!

Sorry! My head was spinning too, trying to figure out the math. But you have the right of it; these proportions of error are not meant for determining after the fact what the possibility of statistical error is. Alpha, beta, and power are supposed to be used for individual hypothesis tests, not for judging the error rate of a large number of different tests, and are supposed to be computed before the test, not after.

Cool, we're on the same page then. I actually didn't get to what was wrong with it until I started walking through it. Their math is right only if their logic is right, and their logic is wrong so it's all fucked.

Mesozoic Mister Nigel · October 25, 2013, 03:46:38 AM

Posting to remind me to post ITT tomorrow, when my brain decides to come online again.

Principia Discordia

News:

Journalist submits fake paper, passes peer review.

Mesozoic Mister Nigel

Mesozoic Mister Nigel

Kai

Mesozoic Mister Nigel

Mesozoic Mister Nigel

Kai

The Good Reverend Roger

Kai

Mesozoic Mister Nigel

Mesozoic Mister Nigel