The application of artificial intelligence (AI) technology to medical imaging has recently brought about tremendous excitement, and AI is making its way into clinical practice, thanks to the technical prowess of current deep learning technology compared with the machine learning methods of the past, the wide availability of digital medical images, and the increased capabilities of computing hardware [1-4]. AI has been tried for ultrasonography in various organs and systems, such as the thyroid, musculoskeletal system, breast, and abdomen, as discussed in detail in the focused review articles of this special issue [5-8], albeit not as extensively as some other radiological imaging modalities such as chest X-rays [9]. The potential role of AI is anticipated to enhance the quality of ultrasonographic images, to provide various forms of diagnostic support (e.g., automated characterization of findings on ultrasonographic images; extraction of quantitative or predictive information from ultrasonographic images, which is difficult for a human examiner to do based on visual observations; and automated detection or segmentation of various structures on ultrasonographic images), and to improve workflow efficiency [10]. The list of specific examples of AI applications to ultrasonography is expected to grow in the future.
AI algorithms may augment the diagnostic accuracy and capability of ultrasonography examiners and are hoped to be particularly helpful for less-experienced examiners [11-15]. Ultrasonography is more widely used in clinical practice than computed tomography (CT) or magnetic resonance imaging (MRI), and it is performed by a more diverse range of medical professionals with varying levels of expertise, some of whom perform better than others. Typically, a single examiner interprets the findings and makes decisions on the fly while performing the examination. As a result, the greater operator-dependency and subjectivity of ultrasonography compared with CT or MRI are well-known issues. Therefore, one of the most eagerly anticipated benefits of applying AI in ultrasonography would be reduced variability between examiners. In this regard, AI may offer a unique opportunity to improve the performance of ultrasonography by removing variability between examiners. Nonetheless, it should be noted that the very nature of ultrasonography also poses challenges in the development and clinical implementation of AI for ultrasonography.
First, the operator-dependency and subjectivity of ultrasonography introduce additional variability in the acquisition of imaging data. These factors could exacerbate the limited generalizability of current AI systems built with deep learning [16]. The finally obtained ultrasonographic images are determined by how the examiner captures them. Thus, the results of AI depend on how the target structure is represented and defined by the examiner in the captured image [17] and, furthermore, by whether the target object is correctly identified and captured at all, unless an entire 3-dimensional volume scan is used, such as those obtained using automated breast ultrasound systems. For the same reason, considerable discrepancies may exist between the dataset collected to train an AI algorithm and the imaging data generated in real-world practice to be fed into the AI system. Therefore, even for a highly sophisticated AI system to work correctly, some degree of competency of the human examiner, at least sufficing to scan the patient properly, still matters [17]. Moreover, standardization of scanning and image acquisition, depending on the diagnostic task, would be critical for the successful application of AI to ultrasonography, which requires human expertise. In some sense, the successful application of AI to ultrasonography creates an impetus for standardizing and ensuring the quality of examinations performed by humans.
Second, the more widespread use of ultrasonography in clinical practice and its relatively easy accessibility require extra caution when interpreting the results of AI used with ultrasonography. The results given by AI, which capitalizes on the associations between input features and outcome states, are probabilistic. Therefore, unlike the results provided by tests based on cause-effect relationships, the results of AI algorithms should generally not be regarded as fixed results. A positive result from a test that finds a clear causal determinant to make the diagnosis can be accepted as a fixed result regardless of other factors. An illustrative example is the reverse transcription polymerase chain reaction (RT-PCR) test for severe acute respiratory syndrome coronavirus 2. A positive RT-PCR test result is an immutable proof of the presence of the virus, as this test finds the RNA of the virus, as long as extraordinary cases of residual RNA being detected in convalescent patients are excluded. In contrast, the interpretation of AI results is affected substantially by the pretest probability and the relevant spectrum of disease manifestation [18]. An AI algorithm typically applies a threshold to a probability-like internal raw algorithm output to generate the final categorical result shown to the user (e.g., cancer vs. benign) or may present the raw output in the form of probability (e.g., a 65% probability of cancer). Both the accuracy of the probability scale and the optimal threshold are profoundly affected by the pretest probability and disease manifestation spectrum, which are, in turn, determined by the baseline characteristics of the patient and the clinical setting.
It is critical for AI users to understand that the same AI result could be correct for one patient but not for another, right in one hospital but not in another hospital, and so on, depending on patients’ baseline characteristics and the clinical setting. The limited generalizability of AI algorithms for medical diagnosis and prediction (i.e., the substantial variability in AI accuracy across patients and hospitals) is a well-known phenomenon, often described as "overfitting" in a broad sense [2,18-23]. This problem is primarily due to epidemiological factors, as mentioned above (pretest probability and disease manifestation spectrum), or, more simply, a disparity between training data and real-world data, rather than technical/mathematical overfitting [2,18-20]. This pitfall may be especially pronounced for AI algorithms for ultrasonography, as ultrasonography examinations are often used in a wide range of clinical settings and patients, and are performed by a diverse range of medical professionals with varying expertise. Ultrasonography systems are also more diverse, with more vendors and versions, than CT or MRI. While one might expect AI to be more helpful for less-experienced examiners, ironically, less-experienced examiners may be more likely to have difficulties in appraising AI results and more vulnerable to developing a complacent attitude of merely accepting the AI results without the necessary appraisal. Such complacency would ultimately compromise the accuracy of ultrasonography examinations. The fact that ultrasonography is typically performed and interpreted on the fly by a single practitioner may further increase the risk. Consequently, the human expertise of the examiner, including adequate knowledge and experience in ultrasonography examinations, sound clinical and epidemiological knowledge, and ideally some knowledge about AI as well, would be crucial for maximizing the benefits that AI may provide.
The issue of overfitting underscores the importance of an adequate external validation of an AI algorithm in various real-world clinical settings where it is intended to be used [16,18,24-34]. For all the reasons explained above, perhaps, the importance of sufficient external validation should be even more strongly emphasized for AI applications to ultrasonography. A recent systematic review of studies that evaluated AI algorithms for the diagnostic analysis of medical imaging found that only 6% of such studies published in peer-review journals performed some form of external validation (whether they were otherwise methodologically adequate or not) [35]. Future research on AI for ultrasonography should emphasize the external validation of developed algorithms, in addition to the development of novel algorithms. Rigorous external validation helps to clarify the boundaries of when an AI algorithm maintains its anticipated accuracy and when it fails, and can thus help assure the users of conditions where the AI system can be used safely and effectively. Furthermore, establishing a mechanism to deliver such information to the end-users of AI more effectively and explicitly would also be an important next step [36].
Third, the operator-dependency of ultrasonography makes prospective research studies to validate AI even more essential. The effect of a computerized decision support system such as AI depends on not only its technical analytic capability, but also on how the computerized results are presented to and acted upon by human practitioners. Considering the expected operator-dependency and variability in generating the ultrasonography image data and in acting upon AI results in on-the-fly decision-making during real-time examinations, there could be meaningful differences between an analysis of retrospectively collected images and natural clinical practice. Studies on AI for ultrasonography have so far mostly been retrospective. More prospective studies that involve actual interactions between human examiners and AI systems should be performed.
AI research in healthcare is accelerating rapidly, with numerous potential applications being demonstrated. However, there are currently limited examples of such techniques being successfully deployed in clinical practice [1,16]. The introduction of AI into medicine is just beginning, and there remain multitudes of challenges to overcome, including difficulties in obtaining sufficiently large, curated, high-quality, representative datasets, deficiencies in robust clinical validation, and technical limitations such as the "black box" nature of AI algorithms, to name just a few [1,16,37]. These challenges are all relevant to AI for ultrasonography. This article highlighted a few additional points that are unique to AI as applied to ultrasonography and need to be addressed for the successful development and clinical implementation of AI for ultrasonography. In summary, the nature of how ultrasonography examinations are performed and utilized demands extra attention to the following issues regarding AI for ultrasonography. It is crucial to maintain the human expertise of examiners, in terms of both ultrasonography itself and the related clinical and epidemiological knowledge. Standardization of scanning and image acquisition, depending on the diagnostic tasks that AI is used to perform, is also critical. The importance of sufficient external validation of AI algorithms is especially significant for AI used with ultrasonography. Prospective research studies that involve actual interactions between human examiners and AI systems, rather than analyses of retrospectively collected images, should also be conducted.