The learning curve and difficult points of the O-RADS ultrasound risk stratification system in 54 trainees
Article information
Abstract
Purpose
This study aimed to evaluate the learning curve and explore the difficult points of the Ovarian-Adnexal Reporting and Data System (O-RADS) ultrasound risk stratification system.
Methods
One hundred adnexal masses (AMs) were randomly selected for five tests as training data. Two experienced trainers had an inter-rater agreement of 0.95 for the O-RADS scores. Fifty-four trainees (26 level I practitioners [group 1], 17 level II practitioners [group 2], and 11 experienced level II practitioners [group 3]) attended the training. Every trainee received assessment and feedback after 20 scored cases. The outcomes of the five tests were compared among the three groups using repeated-measurements analysis of variance.
Results
Of the 100 AMs, 52 were pathologically benign and 48 were malignant; the O-RADS scores were 2, 3, 4, and 5 in 22, 11, 48, and 19 AMs, respectively. The between-subjects effects test showed no significant differences between groups 1, 2, and 3 for the five tests (P=0.501). For each group, the differences among the five tests were significant (P<0.001, P=0.006, and P=0.044 for groups 1, 2, and 3, respectively). Test 2 was the worst. In 23 cases, more than 40% of trainees gave incorrect answers, which mainly related to classic benign lesions, the color flow score, and solid-appearing masses.
Conclusion
After training, junior doctors at different levels can reach a coincident O-RADS ultrasound risk stratification. The difficulties primarily related to subjective judgments of classic benign lesions, the color flow score, and solid-appearing masses. More experience is needed to improve the applicability of the system.
Introduction
Ultrasound (US) imaging is the first choice to describe ovarian adnexal masses (AMs) and estimate their malignancy risk [1]. US is low-cost and easily accessible, but highly operator-dependent. To improve the malignancy risk estimate and the management of AMs, many guidelines and structured reporting systems have been established, using subjective assessments, simple scoring, or statistically derived scoring [2-10].
The Ovarian-Adnexal Reporting and Data System (O-RADS) ultrasound risk stratification and management system is the only lexicon and classification system encompassing all risk categories of AMs, with a management recommendation for each risk category [10]. It may be the most complex AM diagnosis system, including six categories (O-RADS 0-5) and at least 21 detailed combined lexicon descriptors for scoring. Meanwhile, it is the most effective US system, as it improved the accuracy of assessments of the malignancy risk of AMs by providing a standardized reporting tool describing masses in terms of echogenicity, size, cystic wall, internal septum, boundary, shape, and blood flow [1,9,11].
The learning curve can reflect the difficulties and important steps of clinical diagnosis and treatment methods, and then strengthen clinicians’ cumulative experience. Analyses of the learning curve are widely used in various fields of medical imaging diagnosis [12-15].
The aim of the present study was to conduct an O-RADS US system training for junior doctors, draw the corresponding learning curve, and explore the difficult points of this system to provide a reference for clinical training.
Materials and Methods
Compliance with Ethical Standards
This study was approved by the Human Research Ethics Committee of Second Xiangya Hospital with a waiver of informed consent (No. 2021-038).
Patients
The diagnostic US images, clinical records, and pathological information of 642 women who underwent adnexal tumor resection at the Second Xiangya Hospital between June 2018 and June 2020 were collected.
The inclusion criteria were (1) a clear pathological diagnosis; (2) an intact clinical, ultrasonographic, and surgical record; (3) US images showing enough diagnostic signs without artifacts; and (4) an interval of less than 1 month between ultrasonography and surgery.
In total, 100 AMs were randomly selected for five average groups as the training data. The authors Wen and Zhao, two senior doctors with more than 10 years of gynecological US experience, read all the images blinded to pathological information. The intraclass correlation coefficients for inter-rater agreement were 0.95 (95% confidence interval, 0.93 to 0.96) for the O-RADS US score. The two authors determined all the O-RADS US scores together with lexicon descriptors.
Fifty-four doctors from 18 hospitals participated in the training in May 2021. Of the 54 doctors, 26 who had finished their second-year training for residents were included in group 1; 17 who had completed their 1-year attending doctor training in gynecological US were group 2, and 11 experienced attending doctors comprised group 3. The doctors in group 1 can be seen as level I practitioners, those in group 2 as level II practitioners, and those in group 3 as experienced level II practitioners, according to the European standard training requirements for gynecological US practice published by the European Federation of Societies for Ultrasound in Medicine and Biology, including standards for theoretical knowledge and practical skills [16,17]. Two trainers (the authors Wen and Zhao) were equivalent to level III practitioners (experts). All trainees consented to the use of their data for this research.
The author Wen conducted the training. First, the definition of all terms was explained in detail, including (1) normal ovaries; (2) simple cysts, unilocular cysts, and multilocular cysts; (3) typical benign lesions; (4) smooth or irregular inner margins or walls; (5) papillary projections, solid components, and solid-appearing masses; (6) ascites and peritoneal nodules; and (7) color scores of 1-4 [9,11]. After receiving the feedback that all doctors understood the above terms, the specific rules of O-RADS US scoring and classification were further explained with corresponding legends. All the legends used in the explanation of theoretical knowledge did not appear in the subsequent assessment.
In the subsequent image reading test and training, every trainee read the diagnostic images of each 20 cases independently. All O-RADS US scores with lexicon descriptors were listed on the answer sheet. The trainee only needed to tick the correct answer for each case. After the test, the trainee received feedback from the trainer, had sufficient communication with the trainer, and continued to the next 20 cases. All tests and training were finished within 1 week. All the answers were reviewed. One point was assigned for a correct answer and 0 for a wrong answer. The maximum possible score for each test was 20.
Statistical Analysis
Statistical analysis was performed using SPSS version 26.0 (IBM Corp., Armonk, NY, USA). Repeated-measurements analysis of variance was used to test the differences among the five tests and three comparison groups and to produce the learning curve. The Mauchly test was used to evaluate the distribution of variation in terms of sphericity, followed by the between-subjects effects test. The within-subjects effects test was used to analyze the interaction between the two factors, followed by the simple-effect test if the interaction was significant. A P-value <0.05 was considered to indicate statistical significance.
Results
Of the 100 AMs, the pathological findings showed that 52 were benign and 48 were malignant. Twenty-two had an O-RADS score of 2, 11 had an O-RADS score of 3, 48 had an O-RADS score of 4, and 19 had an O-RADS score of 5.
The outcomes of the five tests for groups 1 to 3 are shown in Table 1. The Mauchly test for sphericity yielded a P-value of 0.219 (F=11.921). The between-subjects effects test showed there was no significant difference between groups 1, 2, and 3 at the five test times (F=0.708, P=0.501), as shown on the learning curve (Fig. 1). The subsequent within-subjects effects test demonstrated significant differences among the five tests (F=10.849, P<0.001). No significant interaction was found between the test times and the comparison groups (F=1.944, P=0.059).
The simple-effect test showed no significant differences among the three groups for all five tests (P=0.060-0.910) (Table 1). For each test, no significant difference was found in the pairwise comparison between groups. For each group, the differences among the five tests were significant (P=0.001, P=0.006, and P=0.044 for groups 1, 2, and 3, respectively) (Table 2). The outcome of test 2 was the worst and was significantly poorer than test 5 for all groups.
More than 40% of the trainees failed to give a correct answer in 23 cases, which were reviewed to find the difficulty of the O-RADS US system (Table 3, Fig. 2). The main difficult points were (1) eight cases of classic benign lesions were wrongly read as unilocular or multilocular cystic masses; (2) there was a failure to differentiate unilocular and multilocular cysts in five cases; (3) solid-appearing masses, which had solid components of more than 80%, were confused with multilocular/unilocular cysts with a solid component in four cases; (4) the color scores of five multilocular cysts with solid components and one solid mass were incorrect.
An incorrect color score was a common mistake in tests 1 and 2. The cases in test 2 had the most difficult points. In test 5, the main points of difficulty were distinguishing classic benign lesions from unilocular/multilocular cysts.
Discussion
The O-RADS US system is the only lexicon and classification system that encompasses six risk categories (O-RADS 0-5), incorporating the range of normal to high risk of malignancy [1,10]. The system provides the necessary lexicon descriptors for AM malignancy risk stratification. Understanding the lexicon descriptors is the key to reaching an accurate and consistent interpretation for doctors at different levels [9,10]. The application needs to be tested through extensive clinical practice [11]. In this study, for the first time, the authors explored the difficult points for junior doctors in practice by drawing a learning curve.
The most common difficulty point in the training was the subjective color flow grading in the system. The color flow score seems to provide a quantitative assessment of the blood flow of AMs, but it is a subjective parameter, with grades of minimal, moderate, and strong flow [9,10]. In addition, "0" is generally used to represent "nothing." To use this color score system, more practice is needed to change the familiar idiom. In this study, this difficult point disappeared after three tests involving practice with 60 cases.
In the O-RADS US system, there are many detailed lexicon descriptors of cystic masses and their walls [10]. These terms were not easy to apply in the initial practice. After four tests (practice with 80 cases), wrong answers became rare for distinguishing between a simple cyst and a non-simple unilocular cyst or multilocular cyst.
The definition of a solid-appearing mass is that "the lesion should be at least 80% solid when assessed subjectively in perpendicular two-dimensional planes" [9]. This lexicon descriptor was poorly understood and applied in training. Incorrect answers were present in all five tests.
In the final test, some trainees still could not correctly distinguish classic benign lesions from other unilocular and multilocular cystic tumors. The lexicon of "classic benign lesions" represented multiple kinds of lesions [10]. Each kind of lesion has varied shapes and echoes, which were difficult to cover by the images in the references. An endometriotic cyst or hydrosalpinx may resemble a unilocular or multilocular cyst with a smooth inner wall. For less experienced doctors, the content of "classic benign lesions" was not as clear as that of other lexicon items. Improving the recognition of classic benign lesions may need a lot of practice over a long time. A significant increase in the learning curve may be far in the future for this task.
In summary, if the definition of a lexicon descriptor needs subjective judgment, it was a difficult point for junior doctors. More experience was needed to better the understanding of these lexicon descriptors, such as the color score, solid-appearing masses, and classic benign lesions.
There are many limitations in this study that need to be acknowledged. First, the test with 20 cases was not enough to test all the lexicon descriptors in the O-RADS US system. The tests had various difficult points due to the random selection of cases. Second, no remarkable improvement was observed in the learning curve with five tests. More training data or a more effective training modality is needed for future studies. Third, the effect of experience could not be fully evaluated because no senior doctors attended the training. A large study of interobserver variability would be needed to validate the use of the system by experts as well as less experienced observers [11,18-20].
In conclusion, after training, junior doctors at different levels can reach a coincident O-RADS US risk stratification. The difficulties focused on the subjective judgment of classic benign lesions, the color flow score, and solid-appearing masses. More experience is needed to improve doctors’ understanding of the system and ability to apply it in real-world circumstances.
Notes
Author Contributions
Conceptualization: Wen L, Zhao B, Liu M. Data acquisition: Zhou S, Guo Y, Wen L, Zhao B. Data analysis or interpretation: Zhou S. Drafting of the manuscript: Zhou S, Guo Y. Critical revision of the manuscript: Wen L, Zhao B, Liu M. Approval of the final version of the manuscript: all authors.
No potential conflict of interest relevant to this article was reported.
References
Article information Continued
Notes
Key points
After training, junior doctors at different levels can reach a coincident O-RADS ultrasound risk stratification. The difficult points focused on the subjective judgment of classic benign lesions, color flow score, and solid-appearing mass. More experiences are needed to reach a better application of the system.