NLM DIR Seminar Schedule

UPCOMING SEMINARS

RECENT SEMINARS

Scheduled Seminars on April 28, 2026

Speaker
Niccolo Marini
PI/Lab
Sameer Antani
Time
11 a.m.
Presentation Title
From Unimodal Datasets to Multimodal Foundation Models: Synthetic Clinical Notes for Dermatology AI
Location
Hybrid
In-person: Building 38A/B2N14 NCBI Library or Meeting Link

Contact NLMDIRSeminarScheduling@mail.nih.gov with questions about this seminar.

Abstract:

Foundation models and Large Language Models (LLMs) have recently reached significant advancements in biomedical AI, achieving strong performance on clinical tasks such as image classification, report generation, and decision support. In dermatology, multimodal (MM) systems that jointly process images and clinical text are particularly promising, as they enable richer representations and support flexible applications like zero-shot classification and cross-modal retrieval.
However, building such systems requires large paired image-text datasets, which are scarce in dermatology. Most publicly available datasets are unimodal, pairing images with structured labels or sparse metadata rather than descriptive clinical notes. The few existing large-scale image-text collections are mostly scraped from the internet and contain noisy, unreliable content. LLMs offer a potential solution, but their tendency to hallucinate clinically inaccurate information makes naive text synthesis unreliable for MM training.
This line of research aims to alleviate this gap introducing a framework that converts unimodal dermatology datasets into multimodal image-text pairs, without requiring manual annotation or pairing. The framework synthesizes clinical notes to pair with real dermatology images, combining existing unimodal datasets, structured metadata, and LLMs, accelerating the development of multimodal dermatology models in a scalable way.
The framework involves to stages: 1) the synthetic notes generation, focusing on methods to reduce hallucinations, leading to a controlled set of image-text pairs; 2) their application to develop foundation models, focusing on two possible applications (i.e. diffusion models for image synthesis and teacher-student architecture to exploit unpaired samples).
Models trained within the framework consistently outperform state-of-the-art medical foundation models on cross-modal retrieval and zero-shot classification, when evaluated across fifteen dermatology datasets, including nine external benchmarks and over 37,000 samples. These results demonstrate that carefully controlled synthetic text is an effective bridge across modalities, offering a practical path toward robust dermatology foundation models even in the absence of large real paired datasets.