Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Medical Vision-Language Pre-training with Multimodal Variational Masked Autoencoder for Robust Medical VQA
0
Zitationen
8
Autoren
2025
Jahr
Abstract
Medical Visual Question Answering (Medical VQA) plays an important role in medical informatics. However, the robustness of existing medical VQA models is severely challenged by adversarial attacks. Current methods (e.g. adversarial training and noise-based reasoning) heavily rely on additional data or complex procedures and often ignore model-level robustness. To address these issues, we propose Multimodal Variational Masked Autoencoder (MVMAE), a novel pre-training framework designed to enhance the robustness of the medical VQA task. MVMAE leverages masked modeling and variational inference to extract robust multimodal features. The framework introduces a low-cost multimodal bottleneck fusion module and employs reparameterization to sample robust latent representations, ensuring effective feature fusion and reconstruction. Extensive experiments on public medical VQA datasets demonstrate that MVMAE significantly improves resistance to various adversarial attacks and outperforms other medical multimodal pre-training methods.
Ähnliche Arbeiten
Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study
2020 · 22.609 Zit.
La certeza de lo impredecible: Cultura Educación y Sociedad en tiempos de COVID19
2020 · 19.271 Zit.
A Multi-Modal Distributed Real-Time IoT System for Urban Traffic Control (Invited Paper)
2024 · 14.253 Zit.
UNet++: A Nested U-Net Architecture for Medical Image Segmentation
2018 · 8.498 Zit.
Review of deep learning: concepts, CNN architectures, challenges, applications, future directions
2021 · 7.114 Zit.