Evaluating generic and domain-specific large visual models for T staging of esophageal cancer using CT: a study of zero-shot performance and the impact of prompt engineering(PDF)
《中国医学物理学杂志》[ISSN:1005-202X/CN:44-1351/R]
- Issue:
- 2025年第11期
- Page:
- 1532-1540
- Research Field:
- 医学人工智能
- Publishing date:
Info
- Title:
- Evaluating generic and domain-specific large visual models for T staging of esophageal cancer using CT: a study of zero-shot performance and the impact of prompt engineering
- Author(s):
- ZHU Dabing1; GAO Wei2; LIN Yanghao1; LAI Wuhao3; LIANG Zhichao3; ZENG Xianyi2; DENG Xikai2; AN Jun3; 4
- 1. Department of Radiology, Yuedong Hospital, the Third Affiliated Hospital of Sun Yat-sen University, Meizhou 514700, China 2. China Unicom Digital Intelligent Medical Technology Co., Ltd., Guangzhou 511457, China 3. Department of Cardiothoracic Surgery, Yuedong Hospital, the Third Affiliated Hospital of Sun Yat-sen University, Meizhou 514700, China 4. Department of Cardiothoracic Surgery, the Third Affiliated Hospital of Sun Yat-sen University, Guangzhou 510630, China
- Keywords:
- Keywords: esophageal cancer large vision model computed tomography staging prompt engineering domain-specific model zero-shot learning diagnostic accuracy
- PACS:
- R318;R735.1
- DOI:
- DOI:10.3969/j.issn.1005-202X.2025.11.019
- Abstract:
- Abstract: Background Accurate T-staging is critical for esophageal cancer therapy, but CT-based assessment has significant limitations. Large vision models (LVMs) hold promise, yet their zero-shot clinical diagnostic capability without fine-tuning remains unvalidated. Methods A retrospective analysis was conducted on the chest CT images from 98 esophageal cancer patients and 50 normal controls. Using radiologist-consensus as the gold standard, the zero-shot T-staging performance of 3 LVMs (GPT-5, Gemini, and MedGemma) was evaluated with prompts of varying complexity. Results GPT-5 exhibited the highest accuracy and stability. Significant biases were observed among models: Gemini tended to over-stage, while MedGemma showed a tendency to under-stage. All models faced challenges in identifying early-stage tumors, but structured prompts improved diagnostic performance for mid-to-late stage lesions. Conclusion LVMs have potential for zero-shot T-staging, but their performance highly depends on model choice and prompt design. The generic model GPT-5 show superior zero-shot generalization. However, current model performance is not yet clinically viable, especially for early diagnosis. Future work should focus on fine-tuning with high-quality clinical data and developing standardized prompt frameworks.
Last Update: 2025-12-01