AbstractPurposeThis study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.
MethodsThis study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.
ResultsWith imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).
IntroductionThe emergence of large language models (LLMs) has marked a significant advancement in artificial intelligence and generated considerable interest due to their potential to transform medical practice [1,2]. LLMs, such as generative pretrained transformers (GPT) like ChatGPT (OpenAI), exhibit exceptional capabilities in natural language processing tasks, including clinical question answering, text summarization, and contextual analysis [3-9]. These models are trained on comprehensive datasets that incorporate scientific literature, medical publications, and diverse digital resources across multiple disciplines and languages [2,10,11].
Recent developments have introduced multimodal LLMs that extend beyond text analysis to include the interpretation of audio, visual, and video data [12,13]. These advanced systems have shown promising results in various medical applications, including diagnostic assessment, clinical documentation, and disease identification, achieving these outcomes without additional medical-specific training [12-16].
In radiology, LLMs have been successfully applied to data mining and structured reporting [17-19]. Recent technological progress has led to enhanced multimodal LLMs, including OpenAI's GPT, Anthropic's Claude, and Google's Gemini [20]. These updated versions demonstrate improved diagnostic accuracy compared to their previous iterations, underscoring the importance of continuous model development [21,22]. However, as new systems continue to emerge, a systematic evaluation of their performance and clinical utility remains essential to ensure proper implementation and to minimize potential risks of misuse. Notably, a previous study using New England Journal of Medicine Image Challenge cases suggested that LLMs could provide correct answers even without image inputs, and that their performance was influenced more by text input length than by image interpretation [21].
Therefore, a study was designed to examine the radiological imaging interpretation capabilities of three widely used multimodal LLMs using cases with multiple imaging inputs while minimizing clinical text information. The study utilized image-based diagnostic challenges with multiple-choice questions from the publicly available educational repository of the Korean Society of Ultrasound in Medicine (KSUM), which provides bi-monthly content to subscribers. Additionally, this study examined various factors affecting model performance, including the effects of prompt engineering, question types, case rarity, difficulty levels, and the knowledge cutoff dates of the LLMs.
Materials and MethodsCompliance with Ethical StandardsThis study utilized publicly accessible educational datasets, for which institutional review board approval and informed consent were not required. The study was conducted in accordance with the MI-CLEAR-LLM guidelines for reporting research involving LLMs in medical imaging [23].
Data CollectionA total of 303 case discussion quizzes from the KSUM digital platform (https://www.ultrasound.or.kr/) between July 28, 2000 and November 25, 2024 were initially considered. To ensure a standardized assessment of diagnostic accuracy, 236 cases that did not utilize a multiple-choice format were excluded, leaving 67 quiz cases for final inclusion in this study (Fig. 1). A radiologist (T.H. with 3 years of experience in radiology) systematically extracted data, including imaging data, question content with multiple-choice options, imaging information, and reference answer descriptions. To focus on image analysis, clinical information for all cases was standardized to include only patient demographics (age and sex) and the chief complaint (e.g., "52-year-old female patient with left breast swelling"). In 42 cases that lacked image descriptions in the reference data, a radiologist (W.K.J. with 26 years of experience in radiology) composed case image descriptions while remaining blinded to the reference answers. Cases were classified by subspecialty, diagnostic category, and rarity based on the KSUM digital platform case discussion interface. Human performance metrics were established using KSUM subscriber response statistics, with difficulty levels stratified into quartiles based on correct response rates.
Multimodal LLM AnalysisThree multimodal LLMs were evaluated: (1) GPT-4o (Alias: 2024-11-20) (knowledge cutoff: October 2023; OpenAI, San Francisco, CA, USA), (2) Claude 3.5 Sonnet (Alias: 2024-06-20) (knowledge cutoff: April 2024; Anthropic, San Francisco, CA, USA), and (3) Gemini-1.5-Pro-002 (Alias: 2024-09-24) (knowledge cutoff: September 2024; Google, Mountain View, CA, USA). Application programming interfaces were used to access each model between December 1 and 24, 2024. Generation parameters were standardized with a temperature setting of 1.0, which previously demonstrated the highest accuracy [24]. Independent sessions were conducted for each case to avoid sequential bias. Performance evaluation included comparisons based on pre- and post-knowledge cutoff dates and assessments of accuracy across various factors (tumor versus non-tumor, rare versus non-rare cases, and difficulty levels). Accuracy was measured using responses from the first attempt, with JSON-formatted textual outputs obtained for analysis. To evaluate repeatability, the answering process was repeated across five distinct sessions.
Prompt Engineering ProtocolThe experimental protocol incorporated user prompts that consisted of structured question text (including the primary question and five multiple-choice options), imaging details (modality, plane, and acquisition parameters), and radiological images extracted from the KSUM case discussion database, without any supplementary instructions. To assess the influence of prompt engineering on diagnostic performance across the three multimodal LLMs, six distinct zero-shot system prompts were implemented based on previous studies [25,26]: (1) Basic prompt: The control condition without a system prompt. (2) Original prompt: Contained specific instructions for radiological interpretation and diagnostic assessment. (3) Chain-of-thought prompt: Included the instruction, "…Must use a chain-of-thought approach: clearly outline your reasoning step by step…". (4) Reflection prompt: Contained the directive, "…Self-Reflection Process: To ensure accuracy and comprehensiveness, engage in a self-reflection process after generating the initial answer…". (5) Multiagent prompt: Employed a multiagent workflow with instructions such as, "…MULTIPLE AGENT WORKFLOW ROLE: …Role 1: Clinical Context Analysis… Role 2: Radiologic Image Analysis…Role 3: Reflection and Chain-of-Thought Final Answer…". (6) Artificial intelligence (AI)–generated prompt: Utilized Claude’s prompt generation tool to create optimized prompt templates for specialized diagnostic tasks (https://console.anthropic.com/dashboard). Comprehensive details for all six system prompts are provided in Supplementary Table 1.
Subgroup Analysis of Image-Only vs. Combined Imaging–Descriptive Text InputTo evaluate the impact of supplementary descriptive text input, cases were analyzed under two distinct conditions: (1) Imaging-Only protocol, which included radiological images with text input containing the question elements and imaging information, and (2) Combined protocol, which incorporated radiological images with text input containing imaging information, question elements, and comprehensive radiologic image descriptions drawn from both the KSUM case discussion quiz reference answer section (25 cases) and radiologist-written descriptions (42 cases) (Fig. 2). The assessment employed all six prompt engineering strategies (basic, original, chain-of-thought, reflection, multiagent, and AI-generated prompts). Model responses were obtained in standardized JSON format for each case.
Statistical AnalysisStatistical comparisons of diagnostic accuracy among the three LLMs across six system prompts were performed using the Cochran('s) Q test. For significant findings (P<0.1) of the Cochran('s) Q test [27], subsequent post hoc analyses were performed using the McNemar test. For multiple comparisons, P-values were adjusted using the Bonferroni correction. The association between LLM diagnostic performance and categorical variables (tumor vs. non-tumor, rare vs. non-rare cases, difficulty levels, and knowledge cutoff date) was evaluated using the chi-square test or the Fisher exact test. To identify determinants of diagnostic accuracy across the three multimodal LLMs, multivariable logistic regression analysis was conducted using the original and AI-generated prompts that demonstrated the highest performance. Results were expressed as odds ratios and 95% confidence intervals. Statistical significance was established at P<0.05, except for the Cochran('s) Q test. Repeatability was evaluated with the Fleiss κ statistic. All analyses were performed using SPSS statistical software (version 27.0 for Windows, IBM Corp., Armonk, NY, USA) and MedCalc version 22.02 (MedCalc Software, Ostend, Belgium).
ResultsCharacteristics of KSUM Case Discussion DatasetAfter applying exclusion criteria, 67 quiz cases with radiological images were selected from an initial pool of 303 cases, with 236 cases excluded due to the absence of a multiple-choice format (Fig. 1). Each case included multiple imaging inputs (mean±standard deviation, 3.6±1.1; median [range], 4 [1-5] images per case) across various modalities. The cases spanned diverse radiological subspecialties: breast (n=9), cardiovascular (n=6), gastrointestinal (n=9), genitourinary (n=9), hepatobiliary (n=8), head and neck (n=6), musculoskeletal (n=9), pediatric (n=9), and thyroid (n=2). The mean accuracy rate from KSUM subscriber responses was 55.2%, with a median of 57.0% (range, 6.0% to 88.0%). The distribution of correct answers showed a predominance of option D (n=22), followed by option B (n=13), option C (n=11), option D (n=11), and option E (n=10). Among the study cohort, 37 cases (55.2%) were classified as tumor-related, and 15 cases (22.4%) were categorized as rare conditions. Detailed case characteristics are provided in Table 1.
Diagnostic Performance of Multimodal LLMs Using Various System Prompts
Fig. 2 illustrates a representative user prompt that includes both textual and imaging inputs for the three multimodal LLMs. The text component comprised imaging information, clinical queries, multiple-choice options, and radiologic image descriptions used for the subgroup analysis.
Diagnostic accuracy varied among the three multimodal LLMs under different prompt conditions. GPT-4o achieved 43.3% accuracy with both the original and chain-of-thought prompts, while the AI-generated prompt yielded the highest accuracy at 46.3%, although the differences were not statistically significant (P=0.765). Gemini-1.5-Pro-002 demonstrated optimal performance with the original and reflection prompts, but this did not reach statistical significance (P=0.635). Claude 3.5 Sonnet showed the lowest accuracy (41.8%) with the reflection prompt and the highest accuracy (53.7%) with the AI-generated prompt, representing a significant difference of 11.9% (P=0.039). The AI-generated prompt achieved the highest overall accuracy (46.3%) in the combined accuracy of all three models, showing improvements compared to basic 40.8% (difference: 5.5%, P=0.035), chain-of-though 42.3% (difference: 4.0%, P = 0.169), and multi-agent prompts 42.8% (difference: 3.5%, P=0.248). None of the three LLMs exceeded the human accuracy benchmark of 55.2% (Fig. 3). Repeatability across all LLMs was confirmed by Fleiss’ kappa values (0.79 for GPT-4o, 0.82 for Gemini-1.5-Pro-002, and 0.82 for Claude 3.5 Sonnet), as shown in Supplementary Table 2.
Comparative analysis across system prompts revealed significant inter-model differences in diagnostic accuracy when using AI-generated prompts (P=0.096). Post hoc analysis demonstrated that Claude 3.5 Sonnet outperformed Gemini-1.5-Pro (53.7%, 36/67 vs. 38.8%, 26/67; P=0.041), although no significant differences were observed in other multiple comparisons. In the combined analysis of all six prompts, Claude 3.5 Sonnet achieved significantly higher accuracy (46.3%, 186/402) compared to Gemini-1.5-Pro-002 (39.8%, 160/402; P=0.014) (Table 2).
Performance Analysis under Different Input Factors
Table 3 presents the diagnostic accuracy of all three LLMs using the original and AI-generated prompts, stratified by tumor versus nontumor status, human accuracy rate, case rarity, and knowledge cutoff dates.
In the classification analysis (tumor vs. non-tumor), no significant performance differences were observed across the LLMs with either prompt type. All models demonstrated enhanced accuracy in cases with higher human accuracy rates. Regarding case rarity, GPT-4o with the original prompt showed significantly higher accuracy in non-rare cases (53.9%, 28/52) compared to rare cases (6.7%, 1/15; P=0.001). This disparity was mitigated using AI-generated prompts (non-rare: 40.0%, 6/15; P=0.796). Claude 3.5 Sonnet maintained significantly higher accuracy in non-rare cases (61.5%, 32/52) versus rare cases (26.7%, 4/15) when using AI-generated prompts (P=0.021).
Regarding knowledge cutoff dates, Claude 3.5 Sonnet demonstrated significantly higher accuracy for pre-cutoff cases (original prompt: 56.0%, 28/50; AI-generated prompt: 64.0%, 32/50) compared to post-cutoff cases (23.5%, 4/17 for both prompts; P=0.026 and P=0.005, respectively). GPT-4o and Gemini-1.5-Pro-002 exhibited no significant temporal variations in performance.
Multivariable logistic regression analysis (Supplementary Table 3) identified the human accuracy rate as a consistent predictor of diagnostic performance across all LLMs. With original prompts, the odds ratios were significantly associated with performance: GPT-4o (odds ratio [OR], 5.8 [95% confidence interval (CI), 1.5 to 22.4]), Gemini-1.5-Pro-002 (OR, 6.4 [95% CI, 1.8 to 22.6]), and Claude 3.5 Sonnet (OR, 5.9 [95% CI, 1.6 to 22.2]). Similar patterns were observed with AI-generated prompts: GPT-4o (OR, 4.3 [95% CI, 1.4 to 13.5]), Gemini-1.5-Pro-002 (OR, 3.7 [95% CI, 1.1 to 12.2]), and Claude 3.5 Sonnet (OR, 6.4 [95% CI, 1.6 to 26.6]).
Additional significant determinants included case rarity, which notably reduced diagnostic accuracy for GPT-4o with the original prompt (OR, 0.1 [95% CI, 0.0 to 0.6]). The knowledge cutoff date emerged as a significant factor specifically for Claude 3.5 Sonnet when using the AI-generated prompt (OR, 0.2 [95% CI, 0.1 to 0.9]).
Subgroup Analysis of Image-Only vs. Combined Imaging–Descriptive Text Input
Fig. 4 displays the performance variations between imaging-only and combined imaging-descriptive text inputs. The human accuracy benchmark was 55.2%. GPT-4o showed significant improvements with combined inputs using the basic prompt (58.2% vs. 40.3%, P=0.002), chain-of-thought prompt (61.2% vs. 43.3%, P=0.004), multiagent prompt (56.7% vs. 43.3%, P=0.031), and AI-generated prompt (58.2% vs. 46.3%, P=0.049). Gemini-1.5-Pro-002 demonstrated significant improvements across all prompt types, with an overall accuracy of 60.4% compared to 39.8% (P<0.001), and the most substantial improvement was observed with the AI-generated prompt (62.7% vs. 38.8%, P<0.001). Claude 3.5 Sonnet achieved the highest overall performance with combined inputs (66.2% vs. 46.3%, P<0.001), particularly with the AI-generated prompt (71.6% vs. 53.7%, P=0.004), and showed significant improvements across all prompt types. Overall, the addition of descriptive text significantly improved diagnostic accuracy across all models and prompt types (61.4% vs. 43.2%, P<0.001).
Furthermore, the improvement in diagnostic accuracy with descriptive text was significant across all prompt strategies: basic (20.9% improvement, P<0.001), original (16.9%, P<0.001), chain-of-thought (18.9%, P<0.001), reflection (16.4%, P<0.001), multiagent (17.9%, P<0.001), and AI-generated prompts (17.9%, P<0.001) (Table 4). Detailed performance metrics for descriptive text inputs are provided in Supplementary Table 4.
DiscussionThis study evaluated the diagnostic performance of three multimodal LLMs—Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002—in interpreting radiological cases from the KSUM case discussion repository. Among all tested models and prompt engineering strategies, Claude 3.5 Sonnet with AI-generated prompts achieved the highest numerical accuracy. Nevertheless, the performance of all models remained below human diagnostic benchmarks, highlighting the current limitations of autonomous radiological image interpretation by LLMs. The integration of descriptive text inputs led to significant improvements in diagnostic accuracy across most scenarios. These findings underscore the critical role of contextual information in enhancing interpretative accuracy [28,29].
Multiple factors influenced diagnostic accuracy, including the prompt engineering methodology, various input factors, and model pretraining characteristics [29-31]. Original prompts yielded superior performance compared to basic prompts, aligning with previous research [30,32]. However, chain-of-thought prompts produced similar accuracy to original prompts. GPT-4o, known for its capabilities in specialized scientific reasoning [33], along with the other evaluated LLMs, naturally employed a step-by-step approach in their diagnostic processes during the conversational testing. This suggests that explicit chain-of-thought instructions in system prompts may be redundant [26], as these advanced models have already integrated analytical reasoning into their base responses [34].
Analysis of reflection prompts revealed distinct behavioral patterns among models. While GPT-4o and Gemini-1.5-Pro-002 exhibited minimal answer revision, Claude 3.5 Sonnet demonstrated a 17.9% revision rate. However, its revised responses showed decreased accuracy (41.8%, 28/67) compared to the initial answers (49.3%, 33/67). Only two cases improved from incorrect to correct answers, whereas seven cases changed from correct to incorrect following revisions, typically due to recalibration between general and atypical scenario interpretations. Notably, with the integration of descriptive text, the revised accuracy remained essentially unchanged (64.2%, 43/67), with two cases improving and two cases worsening. This pattern suggests that diagnostic uncertainty in imaging analysis may stem from insufficient pretraining on radiological images relative to textual data [35]. Moreover, the implementation of multiagent prompts did not replicate previously reported performance enhancements, possibly due to lengthy prompt structures (293 words) and limitations imposed by single-session chat and single-system prompts, which may have constrained effective agent collaboration [25,26,32]. This finding indicates opportunities for future research and development in optimizing multiagent architectures.
The AI-generated prompt, developed as an enhancement of the original prompt structure, consistently demonstrated superior performance despite its substantial length (295 words), indicating a reduced dependence on human-engineered prompts. This effect was particularly pronounced with Claude 3.5 Sonnet, which exhibited enhanced performance with Anthropic-developed technology. This suggests that prompts generated using the same underlying architecture may be more effective for that specific model. Furthermore, as shown in Supplementary Table 5, in the analysis excluding AI-generated prompts, performance differences between models were not statistically significant for imaging-only input, although Claude maintained numerically higher accuracy (44.8%). This finding implies that the superior performance initially observed may be partially attributed to the alignment between Claude’s architecture and its AI-generated prompts, highlighting the importance of platform-specific prompt optimization strategies.
Regarding case rarity, GPT-4o initially performed poorly on rare cases with the original prompt (6.7%, 1/15) compared to non-rare cases (53.9%, 28/52; P=0.001). However, the implementation of AI-generated prompts substantially improved rare case performance (40.0%, 6/15), making it comparable to non-rare cases (48.1%, 25/52; P=0.796). Additionally, the multivariable analysis revealed that while the correction rate was consistently positively associated with diagnostic accuracy (OR, 4.3 to 6.4; P<0.05), the negative impact of case rarity (OR, 0.1; P=0.014) was effectively mitigated through AI-generated prompts (OR, 1.0; P=0.984). This suggests that although the model’s pretraining and supervised fine-tuning data likely emphasized typical cases, optimized prompt engineering can help overcome this limitation. For future development, post-training strategies should deliberately incorporate more atypical and rare cases to improve model performance across a broader spectrum of radiologic presentations.
Temporal analysis revealed varying patterns in model performance relative to knowledge cutoff dates. GPT-4o and Gemini-1.5-Pro-002 showed no significant differences between pre- and post-cutoff cases. In contrast, Claude 3.5 Sonnet demonstrated significantly higher accuracy for pre-cutoff cases (64.0%, 32/50) compared to post-cutoff cases (23.5%, 4/17; P=0.005). However, this finding should be interpreted cautiously due to the inherent complexity in determining effective knowledge cutoffs. Recent research has suggested that reported cutoff dates may not accurately reflect the temporal alignment of various resources within LLM training data, implying that the actual temporal boundaries affecting model performance may differ from the reported dates [36]. These limitations, combined with proprietary restrictions on accessing detailed pretraining data, underscore the need for more comprehensive investigations into the temporal effects on LLM performance in medical imaging applications.
The present study demonstrated that adding descriptive text inputs improved model performance, which aligns with previous studies in radiological image analysis [37]. They confirm that radiologic image descriptions and medical history are strong contributors to LLM performance in imaging analysis. It is acknowledged that the quality and specificity of these text descriptions may influence model performance. Standardized methods for investigating how different levels of text input quality—regarding imaging descriptions, medical history, and structure—affect model accuracy are needed in future studies to provide valuable insights for optimizing multimodal LLM applications in radiology.
This study had several limitations. First, the relatively small sample size may limit the generalizability of the findings and may have affected the ability to detect statistically significant differences between individual models after multiple comparison corrections. Future studies with larger datasets across various radiological conditions would be valuable to validate the findings and potentially identify additional patterns in model performance. Second, the analysis did not include a detailed evaluation of the LLMs’ reasoning processes behind their answer selections, making it difficult to determine whether correct responses resulted from genuine understanding or mere pattern recognition. Third, the study’s reliance on multiple-choice questions for performance evaluation may not fully represent real-world clinical scenarios that typically require free-text responses. Future studies should incorporate free-text radiological interpretations and reporting that more closely reflect clinical practice. Fourth, methodological constraints prevented the evaluation of text-only inputs (description-only) due to variations in user and system prompts, which would have compromised direct comparisons. Fifth, human performance metrics from the KSUM website may be subject to reporting bias, as they only reflect responses from users who voluntarily submitted answers online, potentially affecting the representativeness of the comparative analysis. Lastly, due to the lack of open technologies for prompt generation across different platforms, it was not possible to test how various model architectures might respond to platform-specific prompt optimization techniques. Future studies should explore this aspect to better understand the relationship between model architecture and prompt engineering effectiveness.
Claude 3.5 Sonnet, when utilizing AI-generated prompts, demonstrated the highest diagnostic accuracy among the evaluated multimodal LLMs, although it did not reach human performance benchmarks. Consequently, autonomous radiological image interpretation by LLMs remains limited for direct clinical implementation. Given the significant enhancement in performance achieved through the integration of descriptive text inputs, combining radiologist-generated descriptive content with LLM analysis holds potential as a supportive diagnostic tool.
NotesAuthor Contributions Conceptualization: Han T, Jeong WK, Shin J. Data acquisition: Han T, Jeong WK. Data analysis or interpretation: Han T, Jeong WK. Drafting of the manuscript: Han T. Critical revision of the manuscript: Han T, Jeong WK, Shin J. Approval of the final version of the manuscript: all authors. Supplementary MaterialSupplementary Table 1.System prompts used in the study (https://doi.org/10.14366/usg.25012). Supplementary Table 2.Analysis of model response repeatability using Fleiss’ kappa statistics across three large language models (https://doi.org/10.14366/usg.25012). Supplementary Table 3.Multivariable logistic regression analysis of factors affecting diagnostic accuracy across different models using original and AI-generated system prompts (https://doi.org/10.14366/usg.25012). Supplementary Table 4.Comparison of diagnostic accuracy among multimodal LLMs using different prompt types with combined imaging-descriptive text input (https://doi.org/10.14366/usg.25012). Supplementary Table 5.Comparison of diagnostic accuracy among multimodal LLMs using total prompts excluding AI generate prompt with imaging only and combined imaging-descriptive text input (https://doi.org/10.14366/usg.25012). References1. Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 2023;25:bbad493.
![]() ![]() ![]() ![]() 2. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners [Internet]. OpenAI Blog, 2019 [cited 2024 Dec 10]. Available from: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
3. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 technical report. Preprint arXiv at: https://doi.org/10.48550/arXiv.2303.08774 (2023).
![]() 4. Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med 2023;6:158.
![]() ![]() ![]() ![]() 5. Jin Q, Leaman R, Lu Z. Retrieve, summarize, and verify: how will ChatGPT affect information seeking from the medical literature? J Am Soc Nephrol 2023;34:1302–1304.
![]() ![]() ![]() 6. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature 2023;620:172–180.
![]() ![]() ![]() ![]() 7. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. Preprint arXiv at: https://doi.org/10.48550/arXiv.2303.13375 (2023).
![]() 8. Lievin V, Hother CE, Motzfeldt AG, Winther O. Can large language models reason about medical questions? Patterns (N Y) 2024;5:100943.
![]() ![]() ![]() 9. Nori H, Lee YT, Zhang S, Carignan D, Edgar R, Fusi N, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. Preprint arXiv at: https://doi.org/10.48550/arXiv.2311.16452 (2023).
![]() 10. Lake BM, Ullman TD, Tenenbaum JB, Gershman SJ. Building machines that learn and think like people. Behav Brain Sci 2017;40:e253.
![]() ![]() 11. Wu T, He S, Liu J, Sun S, Liu K, Han QL, et al. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J Automat Sin 2023;10:1122–1136.
![]() 12. Zhou Y, Ong H, Kennedy P, Wu CC, Kazam J, Hentel K, et al. Evaluating GPT-V4 (GPT-4 with vision) on detection of radiologic findings on chest radiographs. Radiology 2024;311:e233270.
![]() 13. Sun Z, Ong H, Kennedy P, Tang L, Chen S, Elias J, et al. Evaluating GPT4 on impressions generation in radiology reports. Radiology 2023;307:e231259.
![]() ![]() ![]() 14. Yan Z, Zhang K, Zhou R, He L, Li X, Sun L. Multimodal ChatGPT for medical applications: an experimental study of GPT-4V. Preprint arXiv at: https://doi.org/10.48550/arXiv.2310.19061 (2023).
![]() 15. Su X, Wang Y, Gao S, Liu X, Giunchiglia V, Clevert DA, et al. KGARevion: an AI agent for knowledge-intensive biomedical QA. Preprint arXiv at: https://doi.org/10.48550/arXiv.2410.04660 (2024).
![]() 16. Kitamura FC, Topol EJ. The initial steps of multimodal AI in radiology. Radiology 2023;309:e232372.
![]() ![]() ![]() 17. Fink MA, Bischoff A, Fink CA, Moll M, Kroschke J, Dulz L, et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology 2023;308:e231362.
![]() ![]() 18. Adams LC, Truhn D, Busch F, Kader A, Niehues SM, Makowski MR, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 2023;307:e230725.
![]() ![]() 19. Lehnen NC, Dorn F, Wiest IC, Zimmermann H, Radbruch A, Kather JN, et al. Data extraction from free-text reports on mechanical thrombectomy in acute ischemic stroke using ChatGPT: a retrospective analysis. Radiology 2024;311:e232741.
![]() ![]() 20. Chen D, Huang RS, Jomy J, Wong P, Yan M, Croke J, et al. Performance of multimodal artificial intelligence chatbots evaluated on clinical oncology cases. JAMA Netw Open 2024;7:e2437711.
![]() ![]() ![]() 21. Suh PS, Shim WH, Suh CH, Heo H, Park KJ, Kim PH, et al. Comparing large language model and human reader accuracy with New England Journal of Medicine image challenge case image inputs. Radiology 2024;313:e241668.
![]() ![]() 22. Morishita M, Fukuda H, Yamaguchi S, Muraoka K, Nakamura T, Hayashi M, et al. An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination. Saudi Dent J 2024;36:1577–1581.
![]() 23. Park SH, Suh CH, Lee JH, Kahn CE, Moy L. Minimum reporting items for clear evaluation of accuracy reports of large language models in healthcare (MI-CLEAR-LLM). Korean J Radiol 2024;25:865–868.
![]() ![]() ![]() ![]() 24. Suh PS, Shim WH, Suh CH, Heo H, Park CR, Eom HJ, et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and Gemini Pro vision using image inputs from diagnosis please cases. Radiology 2024;312:e240273.
![]() ![]() 25. Li Y, Zhang S, Wu R, Huang X, Chen Y, Xu W, et al. MATEval: a multi-agent discussion framework for advancing open-ended text evaluation. In: Database systems for advanced applications. DASFAA 2024. Lecture notes in computer science, Vol. 14856. Singapore: Springer, 2024;415-426.
26. Lee JH, Shin J. How to optimize prompting for large language models in clinical research. Korean J Radiol 2024;25:869–873.
![]() ![]() ![]() ![]() 27. Higgins JP, Green S. Cochrane handbook for systematic reviews of interventions. Chichester: Cochrane Collaboration and John Wiley & Sons Ltd., 2008.
28. Gunes YC, Cesur T, Camur E, Gunbey Karabekmez L. Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5(th) edition. Diagn Interv Radiol 2025;31:111–129.
![]() ![]() 29. Mukherjee P, Hou B, Suri A, Zhuang Y, Parnell C, Lee N, et al. Evaluation of GPT large language model performance on RSNA 2023 case of the day questions. Radiology 2024;313:e240609.
![]() ![]() 30. Cesur T, Gunes YC. Optimizing diagnostic performance of ChatGPT: the impact of prompt engineering on thoracic radiology cases. Cureus 2024;16:e60009.
![]() ![]() ![]() 31. Schramm S, Preis S, Metz MC, Jung K, Schmitz-Koep B, Zimmer C, et al. Impact of multimodal prompt elements on diagnostic performance of GPT-4(V) in challenging brain MRI cases. Preprint medRxiv at: https://doi.org/10.1101/2024.03.05.24303767 (2024).
![]() 32. Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of GPT-4 with vision on text- and image-based ACR diagnostic radiology in-training examination questions. Radiology 2024;312:e240153.
![]() ![]() 33. Hurst A, Lerer A, Goucher AP, Perelman A, Ramesh A, Clark A, et al. Gpt-4o system card. Preprint arXiv at: https://doi.org/10.48550/arXiv.2410.21276 (2024).
![]() 34. Shahriar S, Lund BD, Mannuru NR, Arshad MA, Hayawi K, Bevara RV, et al. Putting GPT-4o to the sword: a comprehensive evaluation of language, vision, speech, and multimodal proficiency. Appl Sci 2024;14:7782.
![]() 35. Agbareia R, Omar M, Soffer S, Glicksberg BS, Nadkarni GN, Klang E. Visual-textual integration in LLMs for medical diagnosis: a quantitative analysis. Preprint medRxiv at: https://doi.org/10.1101/2024.08.31.24312878 (2024).
![]() 36. Cheng J, Marone M, Weller O, Lawrie D, Khashabi D, Van Durme B. Dated data: tracing knowledge cutoffs in large language models. Preprint arXiv at: https://doi.org/10.48550/arXiv.2403.12958 (2024).
![]() Workflow of the selection and analysis of study cases.Flowchart shows the selection process of quiz cases from the Korean Society of Ultrasound in Medicine database and subsequent analysis methodology. AI, artificial intelligence.
![]() Fig. 1.Representative example of multimodal input format.The standardized input format combining radiological images with structured text components (question, imaging information, and descriptive informations) used for model evaluation is demonstrated.
![]() Fig. 2.Diagnostic accuracy across different prompt engineering strategies.These graphs show diagnostic accuracy of three multimodal large language models (A, GPT-4o; B, Gemini-1.5-Pro; C, Claude 3.5 Sonnet; D, total) using six distinct prompt engineering approaches. The horizontal line indicates the human performance benchmark (55.2%). AI, artificial intelligence. *P<0.05.
![]() Fig. 3.Impact of descriptive text integration on model performance.These graphs show side-by-side comparison of diagnostic accuracy between imaging only input versus combined imaging and descriptive text inputs across different prompt engineering strategies for each multimodal large language model (A, GPT-4o; B, Gemini-1.5-Pro; C, Claude 3.5 Sonnet; D, total). The horizontal line indicates the human performance benchmark (55.2%). AI, artificial intelligence. *P<0.05, **P<0.01.
![]() Fig. 4.Table 1.Characteristics of the case discussion quiz from the Korean Society of Ultrasound in Medicine
Table 2.Comparison of diagnostic accuracy among multimodal LLMs across different system prompt types
Values are presented as percentage (number/total). LLM, large language model; AI, artificial intelligence. Table 3.Diagnostic accuracy of multimodal LLMs under different input factors
Table 4.Comparison of diagnostic accuracy between image-only and combined imaging-descriptive text input across different system prompt types
|