Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The “State of the Art” in MR Image Reconstruction? Knowledge, Culture, and What We Leave Behind in An Era of Big Data and Machine Learning

2026·0 Zitationen·Magnetic Resonance in Medicine

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

This is an abridged version of a longer article. The unabridged version, which is available as Supporting Information, is recommended to readers who prefer more depth. That version also describes and discusses viewpoints offered by other experts in response to an earlier manuscript draft. The main impetus for this editorial is that, for the past several years, I have noticed that many MR image reconstruction researchers have started to define “state of the art” methods differently than I would. This may seem like a minor issue at first, but I think it actually represents something much deeper – an epistemological difference reflecting real (although potentially subconscious) divergences in values, priorities, and other aspects of culture. Even if it were purely a matter of semantics, I think the words we use can have larger consequences. Healthcare policymakers, consumers of MRI technology, and MRI trainees will generally not have the context to interpret “state of the art” in a nuanced way – they instead place faith in the experts from our community to provide them with clear, trustworthy, and balanced information. I believe that we have an important obligation to them and to ourselves to make the nuances transparent and help avoid downstream misinterpretations. Before delving into specifics, let me first step back to note some broader context. While I am presenting this discussion through the lens of a narrow subfield of MRI (i.e., image reconstruction), the core issue appears to be a symptom of larger cultural trends. As such, I suspect that my comments may resonate far beyond just MR image reconstruction. Specifically, many fields of science currently appear to be in a state of rapid metamorphosis, spurred by recent developments in data science and machine learning, including the unprecedented availability of large-scale datasets and major growth in open-source code sharing. Various influential voices [1, 2] have likened this moment in time to a “singularity,” a “precipice,” a “Cambrian explosion,” or a “Gutenberg revolution.” The shifts I've observed in the definition of “state of the art” within our field may reverberate from these larger developments. While rapid changes are exciting, they can also create certain tensions. For example, David Donoho [2] has eloquently said that: “We all get old and will be forgotten. This is also true of former science practices. As new empirical science is done using [these emerging new] principles, it will soon enough eclipse older science done by previous rules – even science that had survived for a very long time.” With that context in mind, another aim of this editorial is to identify (and hopefully facilitate preservation of) certain scientific, philosophical, and ethical principles that are dear to me, but which I fear may be at risk of being left behind – a casualty of the collision between longstanding scientific culture and the excitement of awe-inspiring new technologies. Let's quickly jump into the heart of my concern. If you look at recent issues of technical imaging journals, you will see many new methods that are each claimed to be the new “state of the art” in MR image reconstruction. Such statements are generally based on leaderboard results, where methods are ranked by quantitative performance via black-box comparisons on large public databases of raw k-space data – in alignment with the modern approaches to empirical science that Donoho was referring to. I have the impression that many segments of the community have begun to feel that black-box leaderboard rank is paramount, with the significance of a new method proportionate to its rank. I largely do not question leaderboard standings (my group has independently replicated similar quantitative performance results many times). However, in spite of (or perhaps because of) my extensive work on highly accelerated MRI over the past couple decades, I have a hard time viewing a method's leaderboard rank as the only thing that matters. Indeed, my personal reality is that in any medical situation, I would not want one of the current top-ranked highly-accelerated approaches from the leaderboard used on me or anyone I love. Instead, I would strongly prefer a minimally-accelerated acquisition with images reconstructed using well-characterized techniques, which often rank poorly on leaderboards. Why do I feel this way? It's not that I deny the potential of new approaches or the recent progress that has been achieved. Rather, it is because I remain unconvinced that the current leaderboard rankings will reliably translate to real-world applications when it matters. I am especially concerned about the reliability of modern leaderboard methods (which are often trained and evaluated to achieve good average performance measures across a collection of “typical” subjects) for people towards the edges of the statistical distribution – a group I suspect I belong to for multiple reasons (including a prior radiological false positive that nearly led to an unnecessary surgery). At the same time, I question whether we're still truly doing “imaging” when using leaderboard methods that rely heavily on opaque, population-derived priors – methods that may be prone to inserting definitive-looking features of “normalcy” when the measurements are inconclusive, and which may struggle with special cases or new situations (in contrast to methods grounded in transparent, measurement-driven inversion principles with more easily predicted capabilities, limitations, and failure modes). Indeed, although modern leaderboard methods frequently use strong, data-derived priors that are framed in Bayesian terms, classical Bayesian philosophy emphasizes that it is often better to adopt “uninformative” priors when prior knowledge is uncertain. For example, the late physicist and Bayesian thought-leader Edwin Jaynes expressed that a well-constructed prior “agrees with what is known, but expresses a ‘maximum uncertainty’ with respect to all other matters, and thus leaves a maximum possible freedom for our final decisions to be influenced by the subsequent sample data.” [3] Current leaderboard methods also often seem to diverge from another key Bayesian attitude (which is also related to the non-Bayesian principle of Occam's Razor) that “there must be evidence in the data for any structure which is seen.” [4] Heavy reliance on population-derived priors, especially when the priors are strong enough to produce hallucinations, could be of particular concern when we cannot know whether an individual case is an outlier – and this can be even more worrisome after observing that even some of the simplest MRI reconstruction problems have a discouragingly high degree of inherent ambiguity [5]. In essence, the concern is that some of these approaches may leave us imaging with eyes that are sometimes blind, but with the illusory faith that we can always see. Notably, popular modern leaderboard rankings generally only capture the first of these facets, while neglecting the others.1 This is concerning, because it means that current leaderboards largely ignore aspects of performance that I believe are essential for many imaging tasks. While focusing on a single dimension of performance may be convenient, it is also somewhat contradictory to conventional engineering practice. Engineers are trained to understand that achieving a good balance between multiple competing performance goals generally requires compromises and trade-offs. This is formalized in theoretical results such as “no free lunch” theorems [23], which remind us that no method can perform well universally across all possible scenarios – good performance in one regime generally implies poor performance in other regimes. Unfortunately, by placing so much attention on leaderboard performance and neglecting other facets, it is natural that we may lose sight of the sacrifices that are made in the pursuit of a high leaderboard rank – we know (hopefully) that our lunch was not free, but we may not understand the price that we paid for it. Specifically, I worry that we're putting ourselves in the situation illustrated in Figure 1, where a “state of the art” method may excel in one facet of imaging performance but is relatively poor in others. Personally, I'd much prefer the “jack-of-all-trades” in this example. Indeed, it feels like a bit of a step backwards if we call a method “state of the art” when its true resolving power – particularly with respect to previously unseen aspects of ourselves and our world – may actually be substantially blunted relative to earlier approaches. I, for one, want us to still be able to put arbitrary objects in the scanner – from humans, plants, and animals to alien lifeforms, ancient artifacts, and manufactured constructs – and have confidence that our state of the art imaging systems will produce images that depict reality as it actually exists, with a warning label when reconstructed features are derived more from population statistics than from the subject data. Importantly, these concerns are not completely unfounded, and there is evidence of methods from the NRMSE and SSIM leaderboards exhibiting undesirable behaviors. For example, leaderboard methods can have relatively poor performance on tests of image resolution [24, 25], can have a relatively blurry visual appearance despite good image quality metrics [26, 27], and can be prone to hallucinating false but convincing image structure [26, 28]. None of these behaviors should be surprising – indeed, they are a simple manifestation of an old adage known as Goodhart's law, which is commonly stated as “when a measure becomes a target, it ceases to be a good measure.” The practical interpretation of this law is that tuning a method's parameters to optimize a single performance metric can have unintended (and often undesirable) consequences. This issue is also fundamental to the leaderboard approach – the result of treating multi-objective optimization as scalar optimization – and is not easily addressed by simply searching for a better metric. In other words (and with apologies to the late statistician George Box whose scaffold I'm borrowing): all metrics are flawed, but some are useful. If we focus our attention on understanding how well a method performs in “easy/typical” scenarios, we may become insensitive to the fact that all methods will fail if we push them hard enough – and we won't necessarily know how far we can trust a method unless we actively look for its failure points. An illustration of the problematic behavior that can be exhibited by “state-of-the-art” methods when idealized assumptions do not hold, reproduced from Reference [29], is shown in Figure 2. The need for skepticism about theoretical and experimental results, especially self-skepticism, is one of the keystones of traditional science. The late Nobel-laureate physicist Richard Feynman describes self-skepticism as a moral imperative: “It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty – a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid [ … $$ \dots $$ ] Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can – if you know anything at all wrong, or possibly wrong – to explain it. If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it” [30]. There is tension between these conventional scientific attitudes and modern leaderboard culture. Traditional science and engineering place a premium on theories and methods that have survived persistent attempts to make them fail in different ways, encouraging a creative exploration into how to break things so we can either feel reassured when our tests are met or learn from the wreckage when they are not. Leaderboard culture does just the opposite, rewarding a sandboxed approach in which a stagnant performance test is treated as infallible, regardless of the test's real-world limitations, with limited incentive to think outside the sandbox. The late chemistry professor and popular author Isaac Asimov expresses another traditional scientific ideal that appears to be somewhat waning in modern MRI reconstruction research – the need for independent testing. He writes that “No scientific observation is really allowed to enter the account books of science until it has been independently confirmed. The reason is that every observer and every instrument has built-in imperfections and biases so that, even assuming perfect integrity, the observation may be flawed. If another observer, with another instrument, and with other imperfections and biases, makes the same observation, then that observation has a reasonable chance of possessing objective truth” [31]. Importantly, Asimov is not describing the common modern approach to “reproducible research,” which involves code-sharing and data-sharing so that it is easy for any reader to exactly reproduce the results in a published paper. Data- and code-sharing can play a role in ensuring transparency and reducing the amount of expertise required2 for others to implement and build on existing work, and are integral to common leaderboard practices. However, applying the original code to the original data is not an “independent” test of the underlying concepts and principles, since it leaves all the original imperfections and biases intact. As such, modern leaderboard culture and corresponding sharing practices can implicitly discourage traditional independent testing, thereby weakening our understanding of principles, by instead encouraging the community to use the same standardized testing methods, the same standardized testing data, and even the same code implementations. (Note that I do not intend to discourage resource sharing, which I value and have done myself for many years – the concerns are about the downstream practices that can emerge in this ecosystem.) It is natural and appropriate that we celebrate apparent progress when it occurs, such as when a new method achieves a new high score. However, I think it's also important for us to be honest with ourselves about the limits of our knowledge, and be mindful that there can be an unfortunate human cost if we overvalue a method before we fully understand what it can and cannot do – this is not a video game where it's fine to singlemindedly pursue high scores without regard to the lives that are lost along the way. The danger in too-easily anointing a method as the “state of the art” (particularly without qualifiers) is that such strong framing can encourage excessive confidence and unwarranted faith, and can dull our collective ability to discern the gap between fledgling optimism and sober reality. This can have severe consequences if these methods end up being deployed in high-stakes applications by people who take the superlatives at face value and do not recognize the potential pitfalls. The late astronomy professor and science communicator Carl Sagan expresses a similar view, writing that “It is the particular task of scientists, I believe, to alert the public to possible dangers, especially those emanating from science or foreseeable through the use of science. [ … $$ \dots $$ ] Clearly the warnings need to be judicious and not more flamboyant than the dangers require; but if we must make errors, given the stakes, they should be on the side of safety” [32]. An important cautionary example is the space shuttle Challenger disaster, in which the shuttle broke apart shortly after its launch on January 28, 1986 – killing all seven members of its crew, with the world watching live on TV. Richard Feynman was a member of the Commission tasked with investigating the disaster, and his account [33] provides valuable lessons and is worth reading for anyone interested in developing methods for life-altering applications. As Feynman describes, the disaster resulted from a confluence of major lapses. On one hand, the administrators and decision makers had unwarranted faith that the launch would be successful. On the other hand, while the shuttle engineers had specific concerns about the day of the launch (with its lower temperatures than previous launches) as well as persistent general concerns that certain components were not operating as desired, these concerns were not adequately communicated or taken seriously by higher-level decision makers. The consequence of this was the tragic and unnecessary loss of life and a major setback for the US space program. From a modern perspective, one can imagine that a data-driven machine learning model, equipped with information on past shuttle launch statistics but lacking an understanding of lower-level principles, may have been just as overconfident as the human decision-makers. After all, the 24 previous shuttle launches had a 100% success rate before the disaster occurred.3 But and unprecedented and it would be a to approach such based purely on statistics or data while the underlying principles, especially when the principles to the risk of a I can understand that it is not always easy to be and about the of an particularly one that is after a of and and whose can have a major on its But it is an obligation that we all and it is that methods with a when they have concerns I am with this If we fail to when it matters, then we become in the consequences. As Feynman a technology, reality must take over public for cannot be I think it is important that our community about our and priorities, including how we approach research problems and how we results to each other and to the broader I also to that imaging performance is and that the very of the “state of the art” is not so It is perhaps to a objective of different methods, especially when each approach necessarily represents a between competing As such, I would like to encourage the use of methods that are more and with an towards understanding and (which are I also to for methods that rank poorly with respect to common leaderboard metrics but may be in other This editorial is an experiment, and I am to see the I have been these issues behind with many people for some time but have started to recognize that it is perhaps better for the to be – an that the community can with a that will remain available for you or disagree with the expressed I and look to your and that the of may be a good for the sharing not to this as no datasets were or the current Supporting The is not for the or of any information by the than should be to the corresponding author for the article.

Autoren

Justin P. Haldar

Institutionen

University of Southern California(US)

Themen

Advanced MRI Techniques and ApplicationsArtificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical Imaging

Volltext beim Verlag öffnen

The “State of the Art” in MR Image Reconstruction? Knowledge, Culture, and What We Leave Behind in An Era of Big Data and Machine Learning

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen