Ultrasonography Search

CLOSE

Ultrasonography > Volume 44(3); 2025 > Article
Han, Jeong, and Shin: Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions

Abstract

Purpose

This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.

Methods

This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.

Results

With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).

Conclusion

Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.

Graphic Abstract

Introduction

The emergence of large language models (LLMs) has marked a significant advancement in artificial intelligence and generated considerable interest due to their potential to transform medical practice [1,2]. LLMs, such as generative pretrained transformers (GPT) like ChatGPT (OpenAI), exhibit exceptional capabilities in natural language processing tasks, including clinical question answering, text summarization, and contextual analysis [3-9]. These models are trained on comprehensive datasets that incorporate scientific literature, medical publications, and diverse digital resources across multiple disciplines and languages [2,10,11].
Recent developments have introduced multimodal LLMs that extend beyond text analysis to include the interpretation of audio, visual, and video data [12,13]. These advanced systems have shown promising results in various medical applications, including diagnostic assessment, clinical documentation, and disease identification, achieving these outcomes without additional medical-specific training [12-16].
In radiology, LLMs have been successfully applied to data mining and structured reporting [17-19]. Recent technological progress has led to enhanced multimodal LLMs, including OpenAI's GPT, Anthropic's Claude, and Google's Gemini [20]. These updated versions demonstrate improved diagnostic accuracy compared to their previous iterations, underscoring the importance of continuous model development [21,22]. However, as new systems continue to emerge, a systematic evaluation of their performance and clinical utility remains essential to ensure proper implementation and to minimize potential risks of misuse. Notably, a previous study using New England Journal of Medicine Image Challenge cases suggested that LLMs could provide correct answers even without image inputs, and that their performance was influenced more by text input length than by image interpretation [21].
Therefore, a study was designed to examine the radiological imaging interpretation capabilities of three widely used multimodal LLMs using cases with multiple imaging inputs while minimizing clinical text information. The study utilized image-based diagnostic challenges with multiple-choice questions from the publicly available educational repository of the Korean Society of Ultrasound in Medicine (KSUM), which provides bi-monthly content to subscribers. Additionally, this study examined various factors affecting model performance, including the effects of prompt engineering, question types, case rarity, difficulty levels, and the knowledge cutoff dates of the LLMs.

Materials and Methods

Compliance with Ethical Standards

This study utilized publicly accessible educational datasets, for which institutional review board approval and informed consent were not required. The study was conducted in accordance with the MI-CLEAR-LLM guidelines for reporting research involving LLMs in medical imaging [23].

Data Collection

A total of 303 case discussion quizzes from the KSUM digital platform (https://www.ultrasound.or.kr/) between July 28, 2000 and November 25, 2024 were initially considered. To ensure a standardized assessment of diagnostic accuracy, 236 cases that did not utilize a multiple-choice format were excluded, leaving 67 quiz cases for final inclusion in this study (Fig. 1). A radiologist (T.H. with 3 years of experience in radiology) systematically extracted data, including imaging data, question content with multiple-choice options, imaging information, and reference answer descriptions. To focus on image analysis, clinical information for all cases was standardized to include only patient demographics (age and sex) and the chief complaint (e.g., "52-year-old female patient with left breast swelling"). In 42 cases that lacked image descriptions in the reference data, a radiologist (W.K.J. with 26 years of experience in radiology) composed case image descriptions while remaining blinded to the reference answers. Cases were classified by subspecialty, diagnostic category, and rarity based on the KSUM digital platform case discussion interface. Human performance metrics were established using KSUM subscriber response statistics, with difficulty levels stratified into quartiles based on correct response rates.

Multimodal LLM Analysis

Three multimodal LLMs were evaluated: (1) GPT-4o (Alias: 2024-11-20) (knowledge cutoff: October 2023; OpenAI, San Francisco, CA, USA), (2) Claude 3.5 Sonnet (Alias: 2024-06-20) (knowledge cutoff: April 2024; Anthropic, San Francisco, CA, USA), and (3) Gemini-1.5-Pro-002 (Alias: 2024-09-24) (knowledge cutoff: September 2024; Google, Mountain View, CA, USA). Application programming interfaces were used to access each model between December 1 and 24, 2024. Generation parameters were standardized with a temperature setting of 1.0, which previously demonstrated the highest accuracy [24]. Independent sessions were conducted for each case to avoid sequential bias. Performance evaluation included comparisons based on pre- and post-knowledge cutoff dates and assessments of accuracy across various factors (tumor versus non-tumor, rare versus non-rare cases, and difficulty levels). Accuracy was measured using responses from the first attempt, with JSON-formatted textual outputs obtained for analysis. To evaluate repeatability, the answering process was repeated across five distinct sessions.

Prompt Engineering Protocol

The experimental protocol incorporated user prompts that consisted of structured question text (including the primary question and five multiple-choice options), imaging details (modality, plane, and acquisition parameters), and radiological images extracted from the KSUM case discussion database, without any supplementary instructions. To assess the influence of prompt engineering on diagnostic performance across the three multimodal LLMs, six distinct zero-shot system prompts were implemented based on previous studies [25,26]: (1) Basic prompt: The control condition without a system prompt. (2) Original prompt: Contained specific instructions for radiological interpretation and diagnostic assessment. (3) Chain-of-thought prompt: Included the instruction, "…Must use a chain-of-thought approach: clearly outline your reasoning step by step…". (4) Reflection prompt: Contained the directive, "…Self-Reflection Process: To ensure accuracy and comprehensiveness, engage in a self-reflection process after generating the initial answer…". (5) Multiagent prompt: Employed a multiagent workflow with instructions such as, "…MULTIPLE AGENT WORKFLOW ROLE: …Role 1: Clinical Context Analysis… Role 2: Radiologic Image Analysis…Role 3: Reflection and Chain-of-Thought Final Answer…". (6) Artificial intelligence (AI)–generated prompt: Utilized Claude’s prompt generation tool to create optimized prompt templates for specialized diagnostic tasks (https://console.anthropic.com/dashboard). Comprehensive details for all six system prompts are provided in Supplementary Table 1.

Subgroup Analysis of Image-Only vs. Combined Imaging–Descriptive Text Input

To evaluate the impact of supplementary descriptive text input, cases were analyzed under two distinct conditions: (1) Imaging-Only protocol, which included radiological images with text input containing the question elements and imaging information, and (2) Combined protocol, which incorporated radiological images with text input containing imaging information, question elements, and comprehensive radiologic image descriptions drawn from both the KSUM case discussion quiz reference answer section (25 cases) and radiologist-written descriptions (42 cases) (Fig. 2). The assessment employed all six prompt engineering strategies (basic, original, chain-of-thought, reflection, multiagent, and AI-generated prompts). Model responses were obtained in standardized JSON format for each case.

Statistical Analysis

Statistical comparisons of diagnostic accuracy among the three LLMs across six system prompts were performed using the Cochran('s) Q test. For significant findings (P<0.1) of the Cochran('s) Q test [27], subsequent post hoc analyses were performed using the McNemar test. For multiple comparisons, P-values were adjusted using the Bonferroni correction. The association between LLM diagnostic performance and categorical variables (tumor vs. non-tumor, rare vs. non-rare cases, difficulty levels, and knowledge cutoff date) was evaluated using the chi-square test or the Fisher exact test. To identify determinants of diagnostic accuracy across the three multimodal LLMs, multivariable logistic regression analysis was conducted using the original and AI-generated prompts that demonstrated the highest performance. Results were expressed as odds ratios and 95% confidence intervals. Statistical significance was established at P<0.05, except for the Cochran('s) Q test. Repeatability was evaluated with the Fleiss κ statistic. All analyses were performed using SPSS statistical software (version 27.0 for Windows, IBM Corp., Armonk, NY, USA) and MedCalc version 22.02 (MedCalc Software, Ostend, Belgium).

Results

Characteristics of KSUM Case Discussion Dataset

After applying exclusion criteria, 67 quiz cases with radiological images were selected from an initial pool of 303 cases, with 236 cases excluded due to the absence of a multiple-choice format (Fig. 1). Each case included multiple imaging inputs (mean±standard deviation, 3.6±1.1; median [range], 4 [1-5] images per case) across various modalities. The cases spanned diverse radiological subspecialties: breast (n=9), cardiovascular (n=6), gastrointestinal (n=9), genitourinary (n=9), hepatobiliary (n=8), head and neck (n=6), musculoskeletal (n=9), pediatric (n=9), and thyroid (n=2). The mean accuracy rate from KSUM subscriber responses was 55.2%, with a median of 57.0% (range, 6.0% to 88.0%). The distribution of correct answers showed a predominance of option D (n=22), followed by option B (n=13), option C (n=11), option D (n=11), and option E (n=10). Among the study cohort, 37 cases (55.2%) were classified as tumor-related, and 15 cases (22.4%) were categorized as rare conditions. Detailed case characteristics are provided in Table 1.

Diagnostic Performance of Multimodal LLMs Using Various System Prompts

Fig. 2 illustrates a representative user prompt that includes both textual and imaging inputs for the three multimodal LLMs. The text component comprised imaging information, clinical queries, multiple-choice options, and radiologic image descriptions used for the subgroup analysis.
Diagnostic accuracy varied among the three multimodal LLMs under different prompt conditions. GPT-4o achieved 43.3% accuracy with both the original and chain-of-thought prompts, while the AI-generated prompt yielded the highest accuracy at 46.3%, although the differences were not statistically significant (P=0.765). Gemini-1.5-Pro-002 demonstrated optimal performance with the original and reflection prompts, but this did not reach statistical significance (P=0.635). Claude 3.5 Sonnet showed the lowest accuracy (41.8%) with the reflection prompt and the highest accuracy (53.7%) with the AI-generated prompt, representing a significant difference of 11.9% (P=0.039). The AI-generated prompt achieved the highest overall accuracy (46.3%) in the combined accuracy of all three models, showing improvements compared to basic 40.8% (difference: 5.5%, P=0.035), chain-of-though 42.3% (difference: 4.0%, P = 0.169), and multi-agent prompts 42.8% (difference: 3.5%, P=0.248). None of the three LLMs exceeded the human accuracy benchmark of 55.2% (Fig. 3). Repeatability across all LLMs was confirmed by Fleiss’ kappa values (0.79 for GPT-4o, 0.82 for Gemini-1.5-Pro-002, and 0.82 for Claude 3.5 Sonnet), as shown in Supplementary Table 2.
Comparative analysis across system prompts revealed significant inter-model differences in diagnostic accuracy when using AI-generated prompts (P=0.096). Post hoc analysis demonstrated that Claude 3.5 Sonnet outperformed Gemini-1.5-Pro (53.7%, 36/67 vs. 38.8%, 26/67; P=0.041), although no significant differences were observed in other multiple comparisons. In the combined analysis of all six prompts, Claude 3.5 Sonnet achieved significantly higher accuracy (46.3%, 186/402) compared to Gemini-1.5-Pro-002 (39.8%, 160/402; P=0.014) (Table 2).

Performance Analysis under Different Input Factors

Table 3 presents the diagnostic accuracy of all three LLMs using the original and AI-generated prompts, stratified by tumor versus nontumor status, human accuracy rate, case rarity, and knowledge cutoff dates.
In the classification analysis (tumor vs. non-tumor), no significant performance differences were observed across the LLMs with either prompt type. All models demonstrated enhanced accuracy in cases with higher human accuracy rates. Regarding case rarity, GPT-4o with the original prompt showed significantly higher accuracy in non-rare cases (53.9%, 28/52) compared to rare cases (6.7%, 1/15; P=0.001). This disparity was mitigated using AI-generated prompts (non-rare: 40.0%, 6/15; P=0.796). Claude 3.5 Sonnet maintained significantly higher accuracy in non-rare cases (61.5%, 32/52) versus rare cases (26.7%, 4/15) when using AI-generated prompts (P=0.021).
Regarding knowledge cutoff dates, Claude 3.5 Sonnet demonstrated significantly higher accuracy for pre-cutoff cases (original prompt: 56.0%, 28/50; AI-generated prompt: 64.0%, 32/50) compared to post-cutoff cases (23.5%, 4/17 for both prompts; P=0.026 and P=0.005, respectively). GPT-4o and Gemini-1.5-Pro-002 exhibited no significant temporal variations in performance.
Multivariable logistic regression analysis (Supplementary Table 3) identified the human accuracy rate as a consistent predictor of diagnostic performance across all LLMs. With original prompts, the odds ratios were significantly associated with performance: GPT-4o (odds ratio [OR], 5.8 [95% confidence interval (CI), 1.5 to 22.4]), Gemini-1.5-Pro-002 (OR, 6.4 [95% CI, 1.8 to 22.6]), and Claude 3.5 Sonnet (OR, 5.9 [95% CI, 1.6 to 22.2]). Similar patterns were observed with AI-generated prompts: GPT-4o (OR, 4.3 [95% CI, 1.4 to 13.5]), Gemini-1.5-Pro-002 (OR, 3.7 [95% CI, 1.1 to 12.2]), and Claude 3.5 Sonnet (OR, 6.4 [95% CI, 1.6 to 26.6]).
Additional significant determinants included case rarity, which notably reduced diagnostic accuracy for GPT-4o with the original prompt (OR, 0.1 [95% CI, 0.0 to 0.6]). The knowledge cutoff date emerged as a significant factor specifically for Claude 3.5 Sonnet when using the AI-generated prompt (OR, 0.2 [95% CI, 0.1 to 0.9]).

Subgroup Analysis of Image-Only vs. Combined Imaging–Descriptive Text Input

Fig. 4 displays the performance variations between imaging-only and combined imaging-descriptive text inputs. The human accuracy benchmark was 55.2%. GPT-4o showed significant improvements with combined inputs using the basic prompt (58.2% vs. 40.3%, P=0.002), chain-of-thought prompt (61.2% vs. 43.3%, P=0.004), multiagent prompt (56.7% vs. 43.3%, P=0.031), and AI-generated prompt (58.2% vs. 46.3%, P=0.049). Gemini-1.5-Pro-002 demonstrated significant improvements across all prompt types, with an overall accuracy of 60.4% compared to 39.8% (P<0.001), and the most substantial improvement was observed with the AI-generated prompt (62.7% vs. 38.8%, P<0.001). Claude 3.5 Sonnet achieved the highest overall performance with combined inputs (66.2% vs. 46.3%, P<0.001), particularly with the AI-generated prompt (71.6% vs. 53.7%, P=0.004), and showed significant improvements across all prompt types. Overall, the addition of descriptive text significantly improved diagnostic accuracy across all models and prompt types (61.4% vs. 43.2%, P<0.001).
Furthermore, the improvement in diagnostic accuracy with descriptive text was significant across all prompt strategies: basic (20.9% improvement, P<0.001), original (16.9%, P<0.001), chain-of-thought (18.9%, P<0.001), reflection (16.4%, P<0.001), multiagent (17.9%, P<0.001), and AI-generated prompts (17.9%, P<0.001) (Table 4). Detailed performance metrics for descriptive text inputs are provided in Supplementary Table 4.

Discussion

This study evaluated the diagnostic performance of three multimodal LLMs—Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002—in interpreting radiological cases from the KSUM case discussion repository. Among all tested models and prompt engineering strategies, Claude 3.5 Sonnet with AI-generated prompts achieved the highest numerical accuracy. Nevertheless, the performance of all models remained below human diagnostic benchmarks, highlighting the current limitations of autonomous radiological image interpretation by LLMs. The integration of descriptive text inputs led to significant improvements in diagnostic accuracy across most scenarios. These findings underscore the critical role of contextual information in enhancing interpretative accuracy [28,29].
Multiple factors influenced diagnostic accuracy, including the prompt engineering methodology, various input factors, and model pretraining characteristics [29-31]. Original prompts yielded superior performance compared to basic prompts, aligning with previous research [30,32]. However, chain-of-thought prompts produced similar accuracy to original prompts. GPT-4o, known for its capabilities in specialized scientific reasoning [33], along with the other evaluated LLMs, naturally employed a step-by-step approach in their diagnostic processes during the conversational testing. This suggests that explicit chain-of-thought instructions in system prompts may be redundant [26], as these advanced models have already integrated analytical reasoning into their base responses [34].
Analysis of reflection prompts revealed distinct behavioral patterns among models. While GPT-4o and Gemini-1.5-Pro-002 exhibited minimal answer revision, Claude 3.5 Sonnet demonstrated a 17.9% revision rate. However, its revised responses showed decreased accuracy (41.8%, 28/67) compared to the initial answers (49.3%, 33/67). Only two cases improved from incorrect to correct answers, whereas seven cases changed from correct to incorrect following revisions, typically due to recalibration between general and atypical scenario interpretations. Notably, with the integration of descriptive text, the revised accuracy remained essentially unchanged (64.2%, 43/67), with two cases improving and two cases worsening. This pattern suggests that diagnostic uncertainty in imaging analysis may stem from insufficient pretraining on radiological images relative to textual data [35]. Moreover, the implementation of multiagent prompts did not replicate previously reported performance enhancements, possibly due to lengthy prompt structures (293 words) and limitations imposed by single-session chat and single-system prompts, which may have constrained effective agent collaboration [25,26,32]. This finding indicates opportunities for future research and development in optimizing multiagent architectures.
The AI-generated prompt, developed as an enhancement of the original prompt structure, consistently demonstrated superior performance despite its substantial length (295 words), indicating a reduced dependence on human-engineered prompts. This effect was particularly pronounced with Claude 3.5 Sonnet, which exhibited enhanced performance with Anthropic-developed technology. This suggests that prompts generated using the same underlying architecture may be more effective for that specific model. Furthermore, as shown in Supplementary Table 5, in the analysis excluding AI-generated prompts, performance differences between models were not statistically significant for imaging-only input, although Claude maintained numerically higher accuracy (44.8%). This finding implies that the superior performance initially observed may be partially attributed to the alignment between Claude’s architecture and its AI-generated prompts, highlighting the importance of platform-specific prompt optimization strategies.
Regarding case rarity, GPT-4o initially performed poorly on rare cases with the original prompt (6.7%, 1/15) compared to non-rare cases (53.9%, 28/52; P=0.001). However, the implementation of AI-generated prompts substantially improved rare case performance (40.0%, 6/15), making it comparable to non-rare cases (48.1%, 25/52; P=0.796). Additionally, the multivariable analysis revealed that while the correction rate was consistently positively associated with diagnostic accuracy (OR, 4.3 to 6.4; P<0.05), the negative impact of case rarity (OR, 0.1; P=0.014) was effectively mitigated through AI-generated prompts (OR, 1.0; P=0.984). This suggests that although the model’s pretraining and supervised fine-tuning data likely emphasized typical cases, optimized prompt engineering can help overcome this limitation. For future development, post-training strategies should deliberately incorporate more atypical and rare cases to improve model performance across a broader spectrum of radiologic presentations.
Temporal analysis revealed varying patterns in model performance relative to knowledge cutoff dates. GPT-4o and Gemini-1.5-Pro-002 showed no significant differences between pre- and post-cutoff cases. In contrast, Claude 3.5 Sonnet demonstrated significantly higher accuracy for pre-cutoff cases (64.0%, 32/50) compared to post-cutoff cases (23.5%, 4/17; P=0.005). However, this finding should be interpreted cautiously due to the inherent complexity in determining effective knowledge cutoffs. Recent research has suggested that reported cutoff dates may not accurately reflect the temporal alignment of various resources within LLM training data, implying that the actual temporal boundaries affecting model performance may differ from the reported dates [36]. These limitations, combined with proprietary restrictions on accessing detailed pretraining data, underscore the need for more comprehensive investigations into the temporal effects on LLM performance in medical imaging applications.
The present study demonstrated that adding descriptive text inputs improved model performance, which aligns with previous studies in radiological image analysis [37]. They confirm that radiologic image descriptions and medical history are strong contributors to LLM performance in imaging analysis. It is acknowledged that the quality and specificity of these text descriptions may influence model performance. Standardized methods for investigating how different levels of text input quality—regarding imaging descriptions, medical history, and structure—affect model accuracy are needed in future studies to provide valuable insights for optimizing multimodal LLM applications in radiology.
This study had several limitations. First, the relatively small sample size may limit the generalizability of the findings and may have affected the ability to detect statistically significant differences between individual models after multiple comparison corrections. Future studies with larger datasets across various radiological conditions would be valuable to validate the findings and potentially identify additional patterns in model performance. Second, the analysis did not include a detailed evaluation of the LLMs’ reasoning processes behind their answer selections, making it difficult to determine whether correct responses resulted from genuine understanding or mere pattern recognition. Third, the study’s reliance on multiple-choice questions for performance evaluation may not fully represent real-world clinical scenarios that typically require free-text responses. Future studies should incorporate free-text radiological interpretations and reporting that more closely reflect clinical practice. Fourth, methodological constraints prevented the evaluation of text-only inputs (description-only) due to variations in user and system prompts, which would have compromised direct comparisons. Fifth, human performance metrics from the KSUM website may be subject to reporting bias, as they only reflect responses from users who voluntarily submitted answers online, potentially affecting the representativeness of the comparative analysis. Lastly, due to the lack of open technologies for prompt generation across different platforms, it was not possible to test how various model architectures might respond to platform-specific prompt optimization techniques. Future studies should explore this aspect to better understand the relationship between model architecture and prompt engineering effectiveness.
Claude 3.5 Sonnet, when utilizing AI-generated prompts, demonstrated the highest diagnostic accuracy among the evaluated multimodal LLMs, although it did not reach human performance benchmarks. Consequently, autonomous radiological image interpretation by LLMs remains limited for direct clinical implementation. Given the significant enhancement in performance achieved through the integration of descriptive text inputs, combining radiologist-generated descriptive content with LLM analysis holds potential as a supportive diagnostic tool.

Notes

Author Contributions

Conceptualization: Han T, Jeong WK, Shin J. Data acquisition: Han T, Jeong WK. Data analysis or interpretation: Han T, Jeong WK. Drafting of the manuscript: Han T. Critical revision of the manuscript: Han T, Jeong WK, Shin J. Approval of the final version of the manuscript: all authors.

Woo Kyoung Jeong serves as Editor for the Ultrasonography, but has no role in the decision to publish this article. All remaining authors have declared no conflicts of interest.

Supplementary Material

Supplementary Table 1.
System prompts used in the study (https://doi.org/10.14366/usg.25012).
usg-25012-Supplementary-Table-1.pdf
Supplementary Table 2.
Analysis of model response repeatability using Fleiss’ kappa statistics across three large language models (https://doi.org/10.14366/usg.25012).
usg-25012-Supplementary-Table-2,3.pdf
Supplementary Table 3.
Multivariable logistic regression analysis of factors affecting diagnostic accuracy across different models using original and AI-generated system prompts (https://doi.org/10.14366/usg.25012).
usg-25012-Supplementary-Table-2,3.pdf
Supplementary Table 4.
Comparison of diagnostic accuracy among multimodal LLMs using different prompt types with combined imaging-descriptive text input (https://doi.org/10.14366/usg.25012).
usg-25012-Supplementary-Table-4,5.pdf
Supplementary Table 5.
Comparison of diagnostic accuracy among multimodal LLMs using total prompts excluding AI generate prompt with imaging only and combined imaging-descriptive text input (https://doi.org/10.14366/usg.25012).
usg-25012-Supplementary-Table-4,5.pdf

References

1. Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 2023;25:bbad493.
crossref pmid pmc pdf
2. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners [Internet]. OpenAI Blog, 2019 [cited 2024 Dec 10]. Available from: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

3. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 technical report. Preprint arXiv at: https://doi.org/10.48550/arXiv.2303.08774 (2023).
crossref
4. Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med 2023;6:158.
crossref pmid pmc pdf
5. Jin Q, Leaman R, Lu Z. Retrieve, summarize, and verify: how will ChatGPT affect information seeking from the medical literature? J Am Soc Nephrol 2023;34:1302–1304.
crossref pmid pmc
6. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature 2023;620:172–180.
crossref pmid pmc pdf
7. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. Preprint arXiv at: https://doi.org/10.48550/arXiv.2303.13375 (2023).
crossref
8. Lievin V, Hother CE, Motzfeldt AG, Winther O. Can large language models reason about medical questions? Patterns (N Y) 2024;5:100943.
crossref pmid pmc
9. Nori H, Lee YT, Zhang S, Carignan D, Edgar R, Fusi N, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. Preprint arXiv at: https://doi.org/10.48550/arXiv.2311.16452 (2023).
crossref
10. Lake BM, Ullman TD, Tenenbaum JB, Gershman SJ. Building machines that learn and think like people. Behav Brain Sci 2017;40:e253.
crossref pmid
11. Wu T, He S, Liu J, Sun S, Liu K, Han QL, et al. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J Automat Sin 2023;10:1122–1136.
crossref
12. Zhou Y, Ong H, Kennedy P, Wu CC, Kazam J, Hentel K, et al. Evaluating GPT-V4 (GPT-4 with vision) on detection of radiologic findings on chest radiographs. Radiology 2024;311:e233270.
pmid
13. Sun Z, Ong H, Kennedy P, Tang L, Chen S, Elias J, et al. Evaluating GPT4 on impressions generation in radiology reports. Radiology 2023;307:e231259.
crossref pmid pmc
14. Yan Z, Zhang K, Zhou R, He L, Li X, Sun L. Multimodal ChatGPT for medical applications: an experimental study of GPT-4V. Preprint arXiv at: https://doi.org/10.48550/arXiv.2310.19061 (2023).
crossref
15. Su X, Wang Y, Gao S, Liu X, Giunchiglia V, Clevert DA, et al. KGARevion: an AI agent for knowledge-intensive biomedical QA. Preprint arXiv at: https://doi.org/10.48550/arXiv.2410.04660 (2024).
crossref
16. Kitamura FC, Topol EJ. The initial steps of multimodal AI in radiology. Radiology 2023;309:e232372.
crossref pmid pmc
17. Fink MA, Bischoff A, Fink CA, Moll M, Kroschke J, Dulz L, et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology 2023;308:e231362.
crossref pmid
18. Adams LC, Truhn D, Busch F, Kader A, Niehues SM, Makowski MR, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 2023;307:e230725.
crossref pmid
19. Lehnen NC, Dorn F, Wiest IC, Zimmermann H, Radbruch A, Kather JN, et al. Data extraction from free-text reports on mechanical thrombectomy in acute ischemic stroke using ChatGPT: a retrospective analysis. Radiology 2024;311:e232741.
crossref pmid
20. Chen D, Huang RS, Jomy J, Wong P, Yan M, Croke J, et al. Performance of multimodal artificial intelligence chatbots evaluated on clinical oncology cases. JAMA Netw Open 2024;7:e2437711.
crossref pmid pmc
21. Suh PS, Shim WH, Suh CH, Heo H, Park KJ, Kim PH, et al. Comparing large language model and human reader accuracy with New England Journal of Medicine image challenge case image inputs. Radiology 2024;313:e241668.
crossref pmid
22. Morishita M, Fukuda H, Yamaguchi S, Muraoka K, Nakamura T, Hayashi M, et al. An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination. Saudi Dent J 2024;36:1577–1581.
crossref
23. Park SH, Suh CH, Lee JH, Kahn CE, Moy L. Minimum reporting items for clear evaluation of accuracy reports of large language models in healthcare (MI-CLEAR-LLM). Korean J Radiol 2024;25:865–868.
crossref pmid pmc pdf
24. Suh PS, Shim WH, Suh CH, Heo H, Park CR, Eom HJ, et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and Gemini Pro vision using image inputs from diagnosis please cases. Radiology 2024;312:e240273.
crossref pmid
25. Li Y, Zhang S, Wu R, Huang X, Chen Y, Xu W, et al. MATEval: a multi-agent discussion framework for advancing open-ended text evaluation. In: Database systems for advanced applications. DASFAA 2024. Lecture notes in computer science, Vol. 14856. Singapore: Springer, 2024;415-426.

26. Lee JH, Shin J. How to optimize prompting for large language models in clinical research. Korean J Radiol 2024;25:869–873.
crossref pmid pmc pdf
27. Higgins JP, Green S. Cochrane handbook for systematic reviews of interventions. Chichester: Cochrane Collaboration and John Wiley & Sons Ltd., 2008.

28. Gunes YC, Cesur T, Camur E, Gunbey Karabekmez L. Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5(th) edition. Diagn Interv Radiol 2025;31:111–129.
pmid pmc
29. Mukherjee P, Hou B, Suri A, Zhuang Y, Parnell C, Lee N, et al. Evaluation of GPT large language model performance on RSNA 2023 case of the day questions. Radiology 2024;313:e240609.
crossref pmid
30. Cesur T, Gunes YC. Optimizing diagnostic performance of ChatGPT: the impact of prompt engineering on thoracic radiology cases. Cureus 2024;16:e60009.
crossref pmid pmc
31. Schramm S, Preis S, Metz MC, Jung K, Schmitz-Koep B, Zimmer C, et al. Impact of multimodal prompt elements on diagnostic performance of GPT-4(V) in challenging brain MRI cases. Preprint medRxiv at: https://doi.org/10.1101/2024.03.05.24303767 (2024).
crossref
32. Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of GPT-4 with vision on text- and image-based ACR diagnostic radiology in-training examination questions. Radiology 2024;312:e240153.
crossref pmid
33. Hurst A, Lerer A, Goucher AP, Perelman A, Ramesh A, Clark A, et al. Gpt-4o system card. Preprint arXiv at: https://doi.org/10.48550/arXiv.2410.21276 (2024).
crossref
34. Shahriar S, Lund BD, Mannuru NR, Arshad MA, Hayawi K, Bevara RV, et al. Putting GPT-4o to the sword: a comprehensive evaluation of language, vision, speech, and multimodal proficiency. Appl Sci 2024;14:7782.
crossref
35. Agbareia R, Omar M, Soffer S, Glicksberg BS, Nadkarni GN, Klang E. Visual-textual integration in LLMs for medical diagnosis: a quantitative analysis. Preprint medRxiv at: https://doi.org/10.1101/2024.08.31.24312878 (2024).
crossref
36. Cheng J, Marone M, Weller O, Lawrie D, Khashabi D, Van Durme B. Dated data: tracing knowledge cutoffs in large language models. Preprint arXiv at: https://doi.org/10.48550/arXiv.2403.12958 (2024).
crossref
37. Schramm S, Preis S, Metz MC, Jung K, Schmitz-Koep B, Zimmer C, et al. Impact of multimodal prompt elements on diagnostic performance of GPT-4V in challenging brain MRI cases. Radiology 2025;314:e240689.
crossref pmid

Workflow of the selection and analysis of study cases.

Flowchart shows the selection process of quiz cases from the Korean Society of Ultrasound in Medicine database and subsequent analysis methodology. AI, artificial intelligence.
usg-25012f1.jpg
Fig. 1.

Representative example of multimodal input format.

The standardized input format combining radiological images with structured text components (question, imaging information, and descriptive informations) used for model evaluation is demonstrated.
usg-25012f2.jpg
Fig. 2.

Diagnostic accuracy across different prompt engineering strategies.

These graphs show diagnostic accuracy of three multimodal large language models (A, GPT-4o; B, Gemini-1.5-Pro; C, Claude 3.5 Sonnet; D, total) using six distinct prompt engineering approaches. The horizontal line indicates the human performance benchmark (55.2%). AI, artificial intelligence. *P<0.05.
usg-25012f3.jpg
Fig. 3.

Impact of descriptive text integration on model performance.

These graphs show side-by-side comparison of diagnostic accuracy between imaging only input versus combined imaging and descriptive text inputs across different prompt engineering strategies for each multimodal large language model (A, GPT-4o; B, Gemini-1.5-Pro; C, Claude 3.5 Sonnet; D, total). The horizontal line indicates the human performance benchmark (55.2%). AI, artificial intelligence. *P<0.05, **P<0.01.
usg-25012f4.jpg
Fig. 4.
usg-25012f5.jpg
Table 1.
Characteristics of the case discussion quiz from the Korean Society of Ultrasound in Medicine
Characteristic Value
Correction rate (%) 57.0 (6.0-88.0)
 <25 5 (7.5)
 25-49 23 (34.3)
 50-75 24 (35.8)
 >75 15 (22.4)
Image number per case 3.6±1.1
 1-3 29 (43.3)
 >3 38 (56.7)
Modality
 US 67
 Radiography 6
 CT 11
 MRI 20
 Nuclear imaging 5
 Othersa) 4
Answer distribution
 A 11 (16.4)
 B 13 (19.4)
 C 11 (16.4)
 D 22 (32.8)
 E 10 (14.9)
Subspecialty
 Breast 9 (13.4)
 Cardiovascular 6 (9.0)
 Gastrointestinal 9 (13.4)
 Genitourinary 9 (13.4)
 Hepatobiliary 8 (11.9)
 Head and neck 6 (9.0)
 Musculoskeletal 9 (13.4)
 Pediatrics 9 (13.4)
 Thyroid 2 (3.0)
Classification
 Tumor 37 (55.2)
 Non-tumor 30 (44.8)
Rarity
 Rare case 15 (22.4)
 Non-rare case 52 (77.6)

Values are presented as median (range), number (%), or mean±standard deviation.

a) Others include imaging for aspiration fluid (n=2) and endoscopy (n=2).

Table 2.
Comparison of diagnostic accuracy among multimodal LLMs across different system prompt types
System prompt type GPT-4o Gemini-1.5-Pro Claude 3.5 Sonnet Cochrane's Q testa) Post-hoc analysisb)
Basic prompt 40.3 (27/67) 40.3 (27/67) 41.8 (28/67) 0.961 -
Original prompt 43.3 (29/67) 41.8 (28/67) 47.8 (32/67) 0.648 -
Chain-of-thought prompt 43.3 (29/67) 37.3 (25/67) 46.3 (31/67) 0.311 -
Reflection promptc) 44.8 (30/67) 41.8 (28/67) 41.8 (28/67) 0.867 -
 Initial answer 44.8 (30/67) 41.8 (28/67) 49.3 (33/67) - -
 Reflection rate 0.0 (0/67) 1.5 (1/67) 17.9 (12/67) - -
Multiagent prompt 43.3 (29/67) 38.8 (26/67) 46.3 (31/67) 0.495 -
AI generate prompt 46.3 (31/67) 38.8 (26/67) 53.7 (36/67) 0.096 Ge vs. Cl (P=0.041), Ge vs. Gp (P=0.359), Gp vs. Cl (P=0.424)
Total prompt 43.5 (175/402) 39.8 (160/402) 46.3 (186/402) 0.046 Ge vs. Cl (P=0.014), Ge vs. Gp (P=0.195), Gp vs. Cl (P=0.343)

Values are presented as percentage (number/total).

LLM, large language model; AI, artificial intelligence.

a) Cochran('s) Q test significance level set at P<0.1.

b) P<0.017 was considered statistically significant following Bonferroni correction for multiple comparisons, Ge vs. Cl=Gemini-1.5-Pro vs. Claude 3.5 Sonnet, Ge vs. GP=Gemini-1.5-Pro vs. GPT-4o, Gp vs. Cl=GPT-4o vs. Claude 3.5 Sonnet.

c) Results shown are for revised answers.

Table 3.
Diagnostic accuracy of multimodal LLMs under different input factors
Model Classification
P-value Correction rate
P-value Rarity
P-value Knowledge cutoff date
P-value
Tumor Non-tumor ≤50% >50% Rare Non-rare Before After
Original prompt
 GPT-4o 40.5 (15/37) 46.7 (14/30) 0.798 21.4 (6/28) 59.0 (23/39) 0.005* 6.7 (1/15) 53.9 (28/52) 0.001* 48.7 (19/39) 35.7 (10/28) 0.418
 Gemini-1.5-Pro 43.2 (16/37) 40.0 (12/30) 0.985 21.4 (6/28) 56.4 (22/39) 0.009* 33.3 (5/15) 44.2 (23/52) 0.558 43.3 (26/60) 28.6 (2/7) 0.690
 Claude 3.5 Sonnet 51.4 (19/37) 43.3 (13/30) 0.684 28.6 (8/28) 61.5 (24/39) 0.016* 26.7 (4/15) 53.9 (28/52) 0.082 56 (28/50) 23.5 (4/17) 0.026*
 Cochran('s) Q testa) 0.465 0.794 0.670 0.861 0.039 0.382 - -
AI-generated prompt
 GPT-4o 37.8 (14/37) 56.7 (17/30) 0.197 25.0 (7/28) 61.5 (24/39) 0.007* 40.0 (6/15) 48.1 (25/52) 0.796 48.7 (19/39) 42.9 (12/28) 0.821
 Gemini-1.5-Pro 35.1 (13/37) 43.3 (13/30) 0.665 21.4 (6/28) 51.3 (20/39) 0.027* 26.7 (4/15) 42.3 (22/52) 0.372 40.0 (24/60) 28.6 (2/7) 0.697
 Claude 3.5 Sonnet 59.5 (22/37) 46.7 (14/30) 0.425 35.7 (10/28) 66.7 (26/39) 0.024* 26.7 (4/15) 61.5 (32/52) 0.021* 64.0 (32/50) 23.5 (4/17) 0.005*
 Cochran('s) Q testa) 0.014 0.420 0.368 0.229 0.368 0.060 - -

Values are presented as percentage (number/total).

LLM, large language model; AI, artificial intelligence.

a) Cochran('s) Q test significance level set at P<0.1,

* P<0.05.

Table 4.
Comparison of diagnostic accuracy between image-only and combined imaging-descriptive text input across different system prompt types
Prompt type GPT-4o
Gemini-1.5-Pro
Claude 3.5 Sonnet
Total multimodal LLMs
Image only Combined description P-value Image only Combined description P-value Image only Combined description P-value Image only Combined description P-value
Basic prompt 40.3 (27/67) 58.2 (39/67) 0.002 40.3 (27/67) 59.7 (40/67) 0.003 41.8 (28/67) 67.2 (45/67) <0.001 40.8 (82/201) 61.7 (124/201) <0.001
Original prompt 43.3 (29/67) 55.2 (37/67) 0.063 41.8 (28/67) 59.7 (40/67) 0.007 47.8 (32/67) 68.7 (46/67) 0.003 44.3 (89/201) 61.2 (123/201) <0.001
Chain-of-thought prompt 43.3 (29/67) 61.2 (41/67) 0.004 37.3 (25/67) 61.2 (41/67) 0.001 46.3 (31/67) 61.2 (41/67) 0.013 42.3 (85/201) 61.2 (123/201) <0.001
Reflection prompta) 44.8 (30/67) 55.2 (37/67) 0.115 41.8 (28/67) 58.2 (39/67) 0.008 41.8 (28/67) 64.2 (43/67) <0.001 42.8 (86/201) 59.2 (119/201) <0.001
Multiagent prompt 43.3 (29/67) 56.7 (38/67) 0.031 38.8 (26/67) 61.2 (41/67) 0.001 46.3 (31/67) 64.2 (43/67) 0.004 42.8 (86/201) 60.7 (122/201) <0.001
AI generate prompt 46.3 (31/67) 58.2 (39/67) 0.049 38.8 (26/67) 62.7 (42/67) <0.001 53.7 (36/67) 71.6 (48/67) 0.004 46.3 (93/201) 64.2 (129/201) <0.001
Total prompt 43.5 (175/402) 57.5 (231/402) <0.001 39.8 (160/402) 60.4 (243/402) <0.001 46.3 (186/402) 66.2 (266/402) <0.001 43.2 (521/1206) 61.4 (741/1206) <0.001

Values are presented as percentage (number/total).

a) Reflection prompt shows revised answer.

TOOLS
METRICS
0
Crossref
356
View
31
Download
Editorial Office
A-304 Mapo Trapalace, 53 Mapo-daero, Mapo-gu, Seoul 04158, Korea
TEL : +82-2-763-5627   FAX : +82-2-763-6909   E-mail : office@ultrasound.or.kr
About |  Browse Articles |  Current Issue |  For Authors and Reviewers
Copyright © Korean Society of Ultrasound in Medicine.                 Developed in M2PI
Zoom in Close layer