Medical AI models are only as good as the data they learn from. Whether you’re building a breast cancer detection system, a clinical NLP pipeline, or a radiology report generator, choosing the right training corpus saves months of data work and produces models that actually generalize in clinical settings. Here’s our curated list of the 12 leading medical AI training datasets going into 2026, including size, license, and what each is best suited for.
|
# |
Dataset |
Size/Scale |
License |
Use Case |
| 1 |
3D Mammogram Dataset |
Curated DBT scans |
Commercial |
Breast cancer detection (DBT AI) |
| 2 |
NIH Chest X-ray14 |
112,120 images |
CCo |
Chest pathology classification |
| 3 |
MIMIC-CXR |
227,827 studies |
PhysioNet |
Radiology report generation |
| 4 |
The Cancer Imaging Archive (TCIA) |
30M+ images |
CC by 3.0 |
Oncology imaging research |
| 5 |
CBIS-DDSM (Mammography) |
2,620 cases |
CC by 3.0 |
2D mammogram classification |
| 6 |
VinDr-Mammo |
5,000 exams |
PhysioNet |
Breast finding detection |
| 7 |
MIMIC-IV |
~300K patients |
PhysioNet |
Clinical NLP & predictive models |
| 8 |
BraTS (Brain Tumor Segmentation) |
2,000+ MRI scans |
CC by 4.0 |
Brain tumor segmentation |
| 9 |
Retinal Fundus (DRIVE/STARE) |
~450 images |
Research |
Retinal vessel segmentation |
| 10 |
MedQA (USMLE) |
12,723 questions |
MIT |
Medical Q&A / LLM benchmarking |
| 11 |
PubMedQA |
211,269 QA pairs |
MIT |
Biomedical NLP fine-tuning |
| 12 |
HAM10000 (Skin Lesion) |
10,015 image |
CC BY-NC 4.0 |
Dermatology classification |
1. 3D Mammogram (DBT) Dataset
A collaboration between Advocate Health, iMerit, and Segmed, this 3D Mammogram Dataset consists of Digital Breast Tomosynthesis (DBT) scans, the gold-standard imaging modality that captures volumetric slice data, enabling AI models to detect subtle breast cancer lesions that 2D mammograms miss entirely. This commercially available DBT dataset is intended for breast cancer AI development.
Each study is fully annotated by iMerit’s clinical team with findings including masses, calcifications, architectural distortions, and asymmetries, structured for HIPAA and FDA compliance, and immediately usable for CADe/CADx pipelines without building a de-identification pipeline from scratch.
2. NIH Chest X-ray14
Released by the National Institutes of Health, this dataset contains over 112,000 frontal-view chest X-rays from more than 30,000 unique patients, labeled across 14 thoracic disease categories including pneumonia, atelectasis, effusion, and cardiomegaly. Labels were generated via NLP mining of radiology reports. It remains one of the most widely cited benchmarks for chest pathology classification and is freely available in the public domain. Teams use it for pretraining image encoders, multi-label classification tasks, and evaluating zero-shot medical vision models.
3. MIMIC-CXR
MIMIC-CXR, maintained by MIT’s Laboratory for Computational Physiology via PhysioNet, pairs chest X-ray images with free-text radiology reports, making it the premier resource for training report generation and clinical NLP models. With 227,000+ radiograph studies spanning multiple views per patient, it supports multi-modal learning at scale. The structured MIMIC-CXR-JPG subset adds structured labels derived from the reports. Access requires completing PhysioNet’s credentialing process, limiting casual use but ensuring responsible data stewardship.
4. The Cancer Imaging Archive (TCIA)
TCIA is an NCI-funded repository housing over 30 million medical images across dozens of cancer types, CT, MRI, PET, digital pathology, and more. Each collection comes with associated clinical data and metadata, enabling multi-modal oncology AI research. Notable collections include TCGA-LUAD (lung adenocarcinoma), LIDC-IDRI (lung nodule detection), and NLST (lung cancer screening CT). TCIA’s breadth makes it indispensable for teams developing cancer screening, staging, and treatment response models. Its permissive CC BY license allows broad academic and commercial use with attribution.
5. CBIS-DDSM (Curated Breast Imaging Subset)
CBIS-DDSM is a curated, standardized update to the classic DDSM dataset, the original benchmark for mammography AI research. It includes 2,620 scanned film mammography studies with expert-verified ROI annotations for masses and calcifications, including pathology labels (benign/malignant) confirmed by biopsy. CBIS-DDSM is the standard starting point for 2D mammography classification research and remains a key baseline benchmark even as 3D (DBT) datasets like iMerit’s grow in importance. Teams often train on CBIS-DDSM first, then fine-tune on higher-quality 3D data for production deployments.
6. VinDr-Mammo
Released in 2022 by VinBigData and available via PhysioNet, VinDr-Mammo contains 5,000 full-field digital mammography examinations sourced from two major hospitals in Vietnam. Each study is annotated by a panel of experienced radiologists with findings including masses, calcifications, asymmetries, and architectural distortions, along with BI-RADS assessment categories. VinDr-Mammo is particularly valuable for detection model development and radiologist-level benchmark comparison. Its geographic diversity (Southeast Asian population cohort) makes it useful for evaluating generalization across ethnicities and scanner types.

7. MIMIC-IV (Clinical EHR)
MIMIC-IV is the fourth iteration of the Medical Information Mart for Intensive Care, covering ~300,000 ICU and ED patients at Beth Israel Deaconess Medical Center between 2008 and 2019. It includes structured EHR data (vitals, labs, medications, diagnoses, procedures) and unstructured clinical notes. MIMIC-IV powers a wide range of clinical AI tasks: mortality prediction, length-of-stay estimation, sepsis onset detection, and clinical NLP pretraining. Teams building healthcare LLMs routinely use it to ground models in real clinical language and reasoning patterns.
8. BraTS (Brain Tumor Segmentation Challenge)
BraTS is an annual challenge and growing dataset for brain tumor MRI segmentation, hosted under the RSNA/ASNR/MICCAI umbrella. It provides multi-parametric MRI scans (T1, T1ce, T2, FLAIR) with expert voxel-level annotations of tumor sub-regions: enhancing tumor, tumor core, and whole tumor. BraTS has become the canonical benchmark for medical image segmentation model evaluation; nearly every major architecture (U-Net, nnU-Net, Swin UNETR) reports BraTS results. Its permissive CC BY 4.0 license and consistent annotation protocol make it a reliable foundation for neuro-oncology AI development.
9. Retinal Fundus Datasets (DRIVE & STARE)
DRIVE (Digital Retinal Images for Vessel Extraction) and STARE (Structured Analysis of the Retina) are the foundational benchmarks for retinal image analysis. Though relatively small by modern standards, their high-quality manual vessel annotations by multiple graders set the bar for segmentation model evaluation in ophthalmology AI. Both datasets are used to train and evaluate models for diabetic retinopathy screening, glaucoma detection, and vascular disease assessment. Modern teams often use them as fine-tuning targets after pretraining on larger, unlabeled fundus image corpora.
10. MedQA (USMLE-Style Q&A)
MedQA is a multi-choice question dataset drawn from the United States Medical Licensing Examination (USMLE), covering clinical reasoning, pathophysiology, pharmacology, and diagnosis. It is the standard benchmark for evaluating medical language models, GPT-4, Med-PaLM 2, and nearly every clinical LLM is evaluated against it. Teams use MedQA for both instruction tuning (teaching models to answer clinical questions in USMLE format) and capability evaluation. The English, Simplified Chinese, and Traditional Chinese variants also support multilingual medical AI research.
11. PubMedQA
PubMedQA is a biomedical question-answering dataset constructed from PubMed abstracts, where questions are derived from research paper titles and answers are grounded in the abstract text. It includes 1,000 expert-labeled and 211,000 artificially generated QA pairs with yes/no/maybe answers and long-form reasoning. PubMedQA is a key fine-tuning resource for building LLMs that can reason over scientific literature, a critical capability for clinical decision support, systematic review automation, and drug discovery AI. Its MIT license makes it freely usable for both research and commercial fine-tuning.
12. HAM10000 (Human Against Machine – Skin Lesions)
HAM10000 (“Human Against Machine with 10,000 training images”) is a large collection of multi-source dermoscopic images of common pigmented skin lesions, released as part of the ISIC 2018 challenge. It covers seven diagnostic categories including melanoma, basal cell carcinoma, and benign keratosis. HAM10000 introduced several images of challenging lesions with varied dermoscopy techniques, skin tones, and body locations, making it a strong generalization benchmark. It remains the go-to dataset for dermatology AI classification models and has been used to demonstrate AI performance matching or exceeding board-certified dermatologists on specific tasks.
Ready to Build With the Best Medical AI Training Datasets?
These 12 datasets cover the full spectrum of medical AI development, from imaging and radiology to clinical NLP and dermatology. For teams building breast cancer detection systems, iMerit’s 3D Mammogram Dataset provides the clinically annotated, commercially licensable DBT data that production-grade AI demands.
Request access: https://www.segmed.ai/resources/ai-for-impact-on-breast-cancer-dataset-form
