Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Widening the search for suspect data – is the flood of retractions about to become a tsunami?
21
Zitationen
2
Autoren
2017
Jahr
Abstract
'Unfortunately the statistical war against fraud is now over and the cheaters have won'. 1. This pessimistic assertion is from a 1988 textbook by T. W. Körner in which he discussed the statistical methods suggested by J. B. S. Haldane 2 for recognising possibly fraudulent data. He continued his remarks thus: "The kind of tests proposed by Haldane depended on the fact that "higher order faking" required a great deal of computational work. The invention and accessibility of the computer means that the computational work involved has ceased to be a problem for the dishonest scientist" 1. However, three decades on, we are seeing a number of high-profile cases in which dishonest scientists, apparently unaware of Körner's 'advice', have been caught out faking results, exposed by the aberrant statistical distributions of their fraudulent data. As previously noted in an editorial 3 accompanying the exposure of one author's prolific body of fraudulent papers 4, methods being used now to detect fraud are similar to Philip and Haldane's 1939 analysis of the subsequently discredited genetic experiments of Franz Moewus 5. As a result of the recent cases, editors and other interested parties are now becoming far more aware of the potential for dishonest authors to submit fraudulent data. This follows a similar increase in awareness of the problem of plagiarism, and many editors are now taking a closer look at aspects like data distributions as well as textual similarity. It could be argued that journals, editors and other bodies charged with the oversight of research have been slow to learn the lessons of history, and to apply newer statistical methods to detect and analyse spurious or suspicious data, but this deficiency is now being addressed. Anaesthesia as a specialty, and particularly the journal Anaesthesia, can rightly claim with vicarious pride that one of its own, John Carlisle, is at the forefront of this effort. Carlisle's first statistical exposé, involving data from the randomised, controlled trials (RCTs) of Yoshitaka Fujii, made the research world stand up and take notice 4. After further refinement of the method 6, it was similarly applied to the RCTs of one of Fujii's regular collaborators, Yuhji Saitoh 7. Carlisle has now completed a further project of remarkable scale with arguably even more important implications – an analysis of 5087 RCTs, spanning eight journals and 16 years – published in this issue of Anaesthesia 8. The method of Carlisle's analysis has been published 6 and explained in detail elsewhere 3. Briefly, in a properly conducted and accurately reported RCT, differences in baseline characteristics between groups are, by definition, due to chance. For this reason, reporting p values for demographic and other baseline data is usually discouraged. The p value is the probability of random sampling resulting in a difference as large or larger than the observed difference so, because we already know that differences in baseline characteristics occurred by chance, it is uninformative to calculate a p value. Carlisle, however, has developed and refined a novel use for the statistical analysis of baseline data to identify instances where sampling in clinical trials may not have been random, suggesting the trial was either not properly conducted or was inaccurately reported. Essentially, Carlisle's method identifies papers in which the baseline characteristics (e.g. age, weight) exhibit either too narrow or too wide a distribution than expected by chance, resulting in an excess of p values close to either one or zero. Possible causes for anomalous distribution of baseline data include honest mistakes in the published information and fraudulently concocted data. A particularly extreme p value for baseline data distribution in a single study brings the accuracy or veracity of that study into question. Less extremely outlying p values for individual studies might or might not indicate a problem, but if such 'moderately aberrant' p values are found across a number of RCTs by the same author(s), then an overall p value for all those studies combined can be calculated and again, if particularly extreme, a serious problem with the data as a whole becomes likely. Particularly extreme p values for baseline distributions in both single-published RCTs and across entire bodies of work have been found, using the Carlisle method, for a number of authors involved in RCTs that have been retracted for fraud: Reuben, Boldt; Fujii; Saitoh (all anaesthetists); and Sato (an osteoporosis researcher) 4, 6, 9, 10. In the case of Fujii, publication of this statistical information became crucial evidence in bringing about the investigation of what is now one of the most infamous cases of scientific fraud in history. Carlisle's latest analysis of RCTs across six anaesthetic and two general medical journals has found a much higher than expected proportion of outliers. Carlisle has calculated a p value for this maldistribution of p values and it is vanishingly small, meaning that the body of RCTs in these journals is, essentially without question, corrupted by studies where the differences in baseline data are not due to random sampling, as claimed. For some individual studies in the survey, the p values are also vanishingly small, meaning there is almost certainly a very serious problem with those particular studies' data, either through error or misconduct. Importantly perhaps for anaesthesia as a specialty, which until now might have been seen as having a particularly bad problem with fraudsters 9, 11, 12, Carlisle found the same statistical smoking gun in RCTs published by two major non-anaesthesia journals and we are therefore probably not any worse than other medical specialties in this respect. We do not yet know, however, if the problem of data fabrication is better or worse in medical compared with non-medical scientific research. In fields of research that do not require random sampling or involve randomised trials, the Carlisle method will not give us an answer. The Carlisle method can be used to screen newly submitted RCTs before publication, as already implemented by Anaesthesia from the beginning of this year. In his editorial 9 accompanying publication of the other recent Carlisle analysis 7, Klein recommended all journals should follow suit in an effort to prevent publication of fraudulent material. This strategy is effective at detecting aberrant data within an individual paper when the distribution is particularly extreme, and in at least 1–2% of the RCTs included in Carlisle's latest analysis this was indeed the case, but for individual RCTs there is a problem of both sensitivity and specificity that has been illustrated well, as it happens, by Carlisle's own data 8. For example, his arbitrary threshold of p < 0.0001 would have only flagged seven of 18 Fujii trials, all now retracted for fraud. The use of a less extreme threshold will significantly compromise the specificity. A threshold of p < 0.01, for example, would incorrectly flag 1% of legitimate submissions. However, this threshold could help detect fraud if an author submits more than one such mildly outlying RCT to a single journal, assuming the journal keeps a record (13 of 18 Fujii trials might have been identified by increasing the sensitivity this way). A strategy of comparing an author's baseline data distributions across multiple submissions would not be of much help in the scheme of things unless all journals can share this information prospectively, something which is not at the moment considered appropriate 13. Carlisle himself has noted that, using his method, corrupted data 'might only be revealed on pooling of data from multiple trials by the same author' 8. There is almost certainly a much bigger problem with the concept of screening all future RCTs using the Carlisle method. The reason baseline data in fabricated RCTs deviate from the expected distribution is probably because humans are very poor at simulating randomness, but if Carlisle's method is adopted routinely for screening, dishonest authors could employ techniques to produce data that would avoid detection. We believe this would be quite easy to achieve although, for obvious reasons, we prefer not to describe the likely methodology here. It already appears that people are using software to reword text in order to circumvent plagiarism detection tools 14 and there is no reason to think miscreants would not employ similar avoidance techniques when it comes to data. A stochastic arms-race could quickly render the Carlisle method ineffective. Körner was of the same opinion in 1988: 'I believe that the determined cheat can now produce fraudulent data which, from the statistical point of view, are indistinguishable from the real thing' 15. It is worth noting that fraudulent data might also be detected if they are not biologically plausible. Expanding from Haldane, Pandit pointed out that faking data so that distributions resemble random sampling is only 'lower order' fraud 3. The data also have to match what would be expected from certain laws of nature such as relative distributions of alleles or other characteristics and this 'higher order' fraud is much more difficult to achieve. However, higher order data problems may be more difficult to detect unless one is looking very carefully indeed and simulation of random sampling is probably all that would be required to avoid detection by the Carlisle Method. The implications of Carlisle's latest findings are far-reaching. He has produced very strong evidence for non-random sampling and/or significant errors in papers published by every journal surveyed. These will all need to be investigated. Some will involve simple enough error corrections, but where fraud is suspected the investigation of individuals or groups will not be a minor undertaking and will potentially take several years of considerable effort – mostly by the editors concerned – for each and every case. The Committee On Publication Ethics provides useful guidelines with respect to this 16. The consistency of the findings across journals suggests that all other journals are likely to be similarly affected. A strong argument could be made that every journal in the world now needs to apply Carlisle's method to all the RCTs they've ever published, and that all the data from these analyses should then be pooled somehow to cross-reference results from individual authors. Carlisle analysed 16 years of publications. Should there be a retrospective time limit? Or should we keep going all the way back to the first RCTs? The amount of work involved is likely to become very quickly prohibitive but with nearly 2% of papers in this analysis breaching Carlisle's arbitrary 1 in 10,000 probability threshold, the need clearly exists. Things tend to move slowly in publication, however, with some of Fujii's papers flagged as fraudulent remaining un-retracted 5 years after that investigation was concluded. It could, therefore, be a long time before Carlislian analysis of journal back-catalogues is undertaken on a wider scale, particularly outside of the specialty anaesthesia perhaps. We note, however, that at least one non-anaesthesia researcher has now been successfully investigated in this way, further validating the method 10. As already noted, we are concerned the Carlisle method lacks the sensitivity and specificity required for prospective screening of newly submitted RCTs, and the possibility of circumvention remains, but if one in every 50 or so new papers continue to fall into the more extreme category then perhaps Klein is justified in exhorting all journals to follow suit. In order to implement this strategy, however, a very large number of people will need to familiarise themselves with the code and method for conducting a Carlisle analysis. Even if it can somehow be automated, the educated eye of a human is still going to be required for confirmation of any suspicions so raised, as is currently also the case when screening for plagiarism and duplication. In the first instance, the editors of each of the journals included in Carlisle's study urgently need to follow up the RCTs that have been identified as most likely problematic, whether due to error or less innocent reasons, and corrections or retractions published where necessary. Appendix 1 of Carlisle's paper contains the information required for the purpose 8. Carlisle has helpfully separated the data into individual worksheets for each journal, ordered such that the most potentially problematic papers appear at the top. Each editor only has to work his/her way down the list. We cannot say at what point in the list editors should desist, and the journals will need to exercise their own discretion. One of the more nebulous aspects of all this, as discussed by Carlisle 8, is where to draw the line in the sand, but we have provided the number of papers for each journal breaching various one-sided Carlisle p thresholds (Table 1). While some papers clearly stand out, other papers that would not be flagged by Carlisle's arbitrary p < 0.0001 nevertheless have problems that clearly indicate a need for examination. We note one example, with a Carlisle p of 0.999 (Anesthesiology 2004, Vol. 101, 576–582), that reported numerically impossible baseline data – even with rounding, the number of women in two groups of 20 cannot be 28% and 31%, and the total cohort of 40 cannot comprise 29% women. Relevant stakeholders need to start considering the wider implications of this new analysis, as already discussed. In particular, if journals do put in place a prospective screening system, there will need to be a debate about the value of and justification for the sharing of information so derived with other journals: for example, whether or not authors of studies which fall outside an arbitrary probability threshold can be placed on some sort of shared watch-list. This, however, would require a change in position by authorities such as the Committee On Publication Ethics, and we encourage ongoing debate on the topic. Finally, there should of course be some further scrutiny of those authors/groups Carlisle has already identified with more than one RCT outside the threshold of p < 0.01. As this threshold is also arbitrary and non-specific, public revelation of the information is probably not appropriate at this stage, but given Reuben, Boldt, Fujii and Saitoh all met this particular criterion, further analysis of other studies by authors so identified are no doubt already in progress and will presumably be published if, or more probably when, additional concerns come to light. It is worth noting that Carlisle, using this criterion, has identified a further 21 authors of 65 potentially problematic trials, not including Reuben, Boldt and Fujii, by matching only the first or corresponding authors. A search within his database for additional authors (not first or corresponding) with more than one RCT falling outside p < 0.01 should be undertaken, since the possibility of fraud networks has recently been raised 7. Carlisle notes it would be more laborious 8, and that is no doubt an understatement, but combining p values from all the trials for each included author also needs to be considered as it appears that this method has already raised valid suspicion for at least one such individual 17. It remains to be seen whether or not prospective application of Carlisle's method will prove useful in the longer term to help prevent future publication of fraudulent material but, for the time being at least, we have the benefit of a new tool. With the proven utility of applying the method to previous studies we have no doubt more authors of already published RCTs will eventually be getting their tap on the shoulder. We have not yet heard the last word from John Carlisle! JL is an editor of Anaesthesia and Intensive Care. No external funding or other competing interests declared.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.200 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.051 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.416 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.410 Zit.