A comparison of the diagnostic performance of the O-RADS, RMI4, IOTA LR2, and IOTA SR systems by senior and junior doctors
Article information
Abstract
Purpose
This study compared the diagnostic performance of the Ovarian-Adnexal Reporting and Data System (O-RADS), the Risk of Malignancy Index 4 (RMI4), the International Ovarian of Tumor Analysis Logistic Regression Model 2 (IOTA LR2), and the IOTA Simple Rules (IOTA SR) in predicting the malignancy of adnexal masses (AMs).
Methods
This retrospective study included 575 women with AMs between 2017 and 2020. All clinical messages, ultrasound images, and pathological findings were collected. Two senior doctors (group I) and two junior doctors (group II) used the four systems to classify AMs. The postoperative pathological diagnosis was used as the gold standard to evaluate the diagnostic efficiency. A receiver operating characteristic curve was used to test the diagnostic performance. The interrater agreement between the two groups was tested using kappa values.
Results
Of all 592 AMs, 447 (75.5%) were benign, 123 (20.8%) were malignant, and 22 (3.7%) were borderline. The intergroup consistency test yielded kappa values of 0.71, 0.92, 0.68, and 0.77 for the O-RADS, RMI4, IOTA LR2, and IOTA SR, respectively. To predict malignant lesions, the areas under the curve of the O-RADS, RMI4, IOTA LR2, and IOTA SR systems were 0.90, 0.89, 0.90, and 0.86 for group I and 0.89, 0.87, 0.88, and 0.84 for group II, respectively. The O-RADS had the highest sensitivity (91.0% in group I and 84.8% in group II).
Conclusion
The four diagnostic systems could compensate for junior doctors’ inexperience in predicting malignant adnexal lesions. The O-RADS performed best and showed the highest sensitivity.
Introduction
Pelvic ultrasonography is the most widely recognized noninvasive examination for adnexal masses. Ultrasonography is a low-cost and accessible, but highly experience-dependent modality. To improve clinical management and the surgical strategy through accurate predictions of the malignancy of adnexal masses, many guidelines, grading systems, and prediction models have been developed [1,2].
The International Ovarian of Tumor Analysis (IOTA) working group is a multicentric, large-sample, and ongoing study team on adnexal lesions. The IOTA Logistic Regression Model 2 (IOTA LR2) and the IOTA Simple Rule (IOTA SR), proposed from 2002 to 2007, are both prospective study products of the IOTA based on large samples [3]. The IOTA SR can compensate for junior doctors’ lack of experience, minimize false negatives, and improve true positives. The high sensitivity of the IOTA LR2 in particular ensures that patients with true positives could be found as much as possible [4-11]. The Risk of Malignancy Index (RMI) system was initially developed from the RMI1 to RMI3 system, and the RMI4 system was proposed by Yamamoto et al. in 2009 [12-16]. It combines menopausal status, ultrasound results, and serum cancer antigen 125 (CA125) levels to provide a simple standard for assessing adnexal masses. The RMI4 score is calculated using the formula: RMI=U×M×S×CA125, where U is the ultrasound score, M is the menopausal score, S is the tumor size score, and CA125 is the absolute value of serum CA125 levels. The 2020 Ovarian-Adnexal Reporting and Data System (O-RADS) was published by the American College of Radiology. This system includes the O-RADS ultrasound lexicon, risk categories, clinical management, and malignancy risk to reduce ambiguity in the description of lesions in ultrasound reports and to provide corresponding management approaches for patients with different risk grades [17,18].
All these prediction systems were established and tested based on data from European populations. These systems need to be more widely validated in practice with various ethnic populations [1,2]. This study aimed to compare the diagnostic efficiency and reliability among the O-RADS, RMI4, IOTA LR2, and IOTA SR for predicting benign and malignant ovarian tumors by senior and junior doctors.
Materials and Methods
Compliance with Ethical Standards
The study was approved by the Human Research Ethics Committee of the Second Xiangya Hospital (No. 2021-038) and performed in accordance with the principles of the Declaration of Helsinki. Written informed consents were waived.
Study Sample
Data from 575 women who underwent gynecological surgery with preoperative ultrasound examinations and postoperative histological diagnoses of adnexal masses in the Second Xiangya Hospital between January 2017 and October 2020 were retrospectively analyzed. The complete medical records of all patients were obtained, including age, menopausal status, gynecological examination, tumor markers, operation methods, postoperative pathology, and follow-up. A postmenopausal state was defined if women over the age of 50 had undergone hysterectomy or lacked records related to menopause.
The inclusion criteria were (1) an interval of less than 30 days between ultrasonography and gynecological surgery and (2) a definite pathological diagnosis.
The exclusion criteria were (1) pregnant women with adnexal masses, (2) women who had images of poor quality or without diagnostic signs, and (3) women with no clear menopausal status and no test for CA125 levels.
Instruments and Image Analysis
Ultrasound diagnostic systems with 9-15 MHz intracavitary transducers were used for the ultrasound examinations, which were performed by doctors at the attending level and above. Transabdominal ultrasonography was performed if the mass was too large to be observed using transvaginal ultrasonography. All images were stored and collected from the ultrasound working system. The ultrasound characteristics of each mass were assessed. The descriptions included single or bilateral, cystic component, morphology and margins, cyst wall thickness, acoustic shadowing, maximum diameter (maximum diameter of the tumor and maximum diameter of the solid part), solid papillary protrusions, separation, ascites, peritoneal nodules, and color Doppler score.
Senior doctors (L.W. and B.Z.; group I), with more than 10 years of ultrasonic diagnosis experience, and junior doctors (Y.G. and S.Z.; group II), with 1 year of ultrasonic diagnosis experience and the diagnosis of 300 adnexal tumors in practice, received theoretical and practical training on the four systems. After training, the authors had good to excellent agreement when applying the O-RADS, RMI4, IOTA LR2, and IOTA SR systems. A series of 40 adnexal masses were randomly selected for a test-retest analysis. In group I, for the O-RADS, RMI4, IOTA LR2, and IOTA SR systems, the intra-reader agreement tests yielded kappa values of 0.92 (95% confidence interval [CI], 0.78 to 1.00), 0.94 (95% CI, 0.83 to 1.00), 0.80 (95% CI, 0.59 to 1.00), and 0.71 (95% CI, 0.48 to 0.95), respectively; for group II, the kappa values were 0.93 (95% CI, 0.81 to 1.00), 0.82 (95% CI, 0.63 to 1.00), 0.87 (95% CI, 0.70 to 1.00), and 0.88 (95% CI, 0.72 to 1.00), respectively.
The two groups analyzed the images, evaluated each mass using the four systems, and were blinded to the clinical information and pathological results. In each group, the two doctors worked together to analyze the images. Lesions with O-RADS grades of 1-3 were classified as benign tumors, and lesions with O-RADS grades of 4-5 were classified as malignant tumors. The cutoff value of the RMI4 system was an RMI4 score of 450. The cutoff value for the IOTA LR2 was a malignancy risk of 10%. For the IOTA SR model, a mass with at least one malignant feature and no benign features was considered a malignant tumor, and a mass with only benign features was considered a benign lesion [3,13,17]. For intermediate cases, with or without benign and malignant features simultaneously, in the IOTA SR, the doctors' subjective judgments were used as the outcomes [3,12,13]. All results were compared with the histological diagnosis, which was classified according to the International Federation of Gynecology and Obstetrics criteria [19], and borderline masses were classified as malignant (Fig. 1).
Statistical Analysis
Statistical analysis was performed using SPSS ver. 26.0 (IBM Corp., Armonk, NY, USA) and GraphPad Prism 6.0 (GraphPad Software Inc., San Diego, CA, USA). Categorical variables were compared by the chi-square test or the Fisher exact test. Continuous variables were compared by the independent-sample t-test or rank-sum test. Receiver operating characteristic (ROC) curves were drawn to test the diagnostic performance of the four ultrasound classification systems in the two groups. The sensitivity, specificity, negative predictive value, positive predictive value, and Youden index were analyzed. The kappa coefficient was used to assess intergroup agreement. A κ≥0.75 was considered as indicating high repeatability, a 0.40≤κ<0.75 was considered as indicating medium repeatability, and a κ<0.40 was considered as indicating low repeatability. A P-value <0.05 was interpreted as statistically significant.
Results
The mean age of the 575 women was 39.0±14.6 years (range, 6 to 81 years). Of the entire sample, 446 women (77.6%) were premenopausal, and 129 (22.4%) were postmenopausal. One hundred and eighteen (30.5%) women had bilateral lesions. One hundred and eighty-five (32.2%) women had elevated CA125 levels, including 135 (23.5%) premenopausal women and 50 (8.7%) postmenopausal women (Table 1). Eight women (1.4%) had undergone a hysterectomy.
With the inclusion of 17 (3.0%) bilateral adnexal lesions, data on a total of 592 adnexal masses were collected. These numbers included 447 (75.5%) benign, 123 (20.8%) malignant, and 22 (3.7%) borderline tumors, confirmed by postoperative pathologic diagnoses. An analysis of the histological findings showed that the most frequent benign tumor was mature teratoma, while the most common malignant tumor was serous adenocarcinoma. The details are shown in Table 2.
The detailed outcomes of the two groups regarding the classification of the 592 adnexal masses using the four systems are demonstrated in Fig. 2. In the O-RADS system, groups I and II classified 379 (64.0%) and 388 (65.5%) cases as benign tumors, respectively, while 213 (36.0%) and 204 (34.5%) cases were classified as malignant. The diagnostic malignancy rates of O-RADS grades 1 to 5 for 592 adnexal masses were 0% (0/1), 2.0% (6/299), 8.9% (7/79), 52.3% (80/153), and 86.7% (52/60) in group I and 0% (0/4), 2.8% (8/283), 13.9% (14/101), 42.9% (48/112), and 81.5% (75/92) in group II, respectively. Using the RMI4 system to classify the adnexal lesions, groups I and II classified 483 (81.6%) and 491 (82.9%) cases as benign masses and 109 (18.4%) and 101 (17.1%) cases as malignant masses, respectively. There were 426 (72.0%) and 397 (67.1%) benign cases and 166 (28.0%) and 195 (32.9%) malignant cases, respectively, using the IOTA LR2 system. Groups I and II applied the IOTA SR and classified 442 (74.7%) and 410 (69.3%) cases as benign tumors, and 150 (25.3%) and 182 (30.7%) cases as malignant tumors. In addition, using the IOTA SR system, groups I and II classified 48 (8.1%) and 79 (13.1%) cases as indeterminate lesions, respectively. The malignancy rates of the IOTA SR for benign, malignant, and uncertain groups were 5.8% (24/417), 78.0% (99/127), and 45.8% (22/48) in group I and 5.6% (20/355), 66.5% (105/158), and 25.3% (20/79) in group II, respectively.
The ROC curves for each system in the two groups are shown in Fig. 3. The O-RADS had the highest area under the curve (AUC), with 0.90 in group I and 0.89 in group II. The IOTA SR had the lowest AUC, with 0.86 in group I and 0.84 for group II (Table 3). The sensitivity, specificity, Youden index, positive predictive value (PPV), and negative predictive value for the four systems are presented in Table 3. Of the four systems, the O-RADS had the highest Youden index (0.73 in group I) and the highest sensitivity (0.91 and 0.85 in groups I and II), respectively.
The two groups had moderate agreement (κ=0.71 and κ=0.68, respectively) in using the O-RADS and IOTA LR2 systems and high agreement (κ=0.92 and κ=0.77, respectively) for the RMI4 and IOTA SR systems (Table 3).
Discussion
Many ultrasound systems for diagnosing adnexal masses have been launched internationally, some of which have undergone prospective or retrospective external validation [20-25]. This study focused on the comparison between the latest proposed O-RADS and other validated classification systems. In the present study, the good to excellent inter-reader agreement in each group ensured consistency of understanding and using the systems and controlled the confounding factors caused by different interpretations of the terms. The present analysis of a large sample of 592 adnexal masses had reliable outcomes.
The results showed that the four systems all had excellent AUCs for the diagnosis of adnexal masses. The RMI4 system had the simplest ultrasound rules, with excellent intergroup agreement (κ=0.92). It had a higher AUC, but the lowest Youden index. The CA125 level is one of the strongest indicators of malignancy in the RMI4 system, but might also increase in women with ovarian endometrial cyst and pelvic inflammatory disease. It is plausible that an ovarian endometrial cyst or inflammatory mass had a high RMI score and would be misdiagnosed as a malignant mass. Hence, for the proposed cutoff score of 450, the diagnostic efficiency should be affected by sample bias. Since ovarian endometrial cysts and inflammatory masses accounted for 19.4% of the sample, it was not surprising that the RMI4 system had the lowest sensitivity but the highest specificity in this study.
The malignancy rates of the IOTA SR benign, malignant, and uncertain groups were consistent with the recommended values in the previous literature [2]. However, the malignancy rate of groups I and II was as high as 45.8% and 25.3% in the indeterminate group, respectively, suggesting that inexperienced doctors might not correctly diagnose these tumors if they cannot obtain assistance from other diagnostic models or obtain a consultation. The uncertain group is the most obvious shortcoming of the IOTA SR. Although the IOTA LR2 could be applied to classify all adnexal masses, its sensitivity in detecting malignant masses was not greatly improved.
The O-RADS system had the highest AUC and Youden index. At the cost of decreased specificity, its detailed explanations of characteristics and descriptions of benign and malignant lesions ensured the highest sensitivity in detecting malignant masses. However, the simple diagnostic indices involved in the RMI4, IOTA SR, and IOTA LR2 systems easily misdiagnosed some tumors without typical malignant features. The O-RADS could be used to identify actual malignant lesions as much as possible to reduce the severe consequences of missing diagnoses. This advantage corresponds to an important capability of a malignant tumor predictive model, because discovering a possibly malignant lesion is the primary step for patients with adnexal masses. For patients with a high suspicion of malignancy, the O-RADS proposes the following management recommendations. Subsequent examinations and clinical measures are advised for these patients with suspected malignant lesions [17]. A magnetic resonance imaging examination or ultrasound expert consultation should be arranged for patients with suspected malignant lesions. Patients with a high suspicion of malignancy should be referred to a gynecologic oncologist and treated in a timely manner. In contrast, the other three diagnostic models do not provide corresponding management measures to identify false-negative patients.
The finding of good intergroup agreement showed that the four diagnostic systems could compensate for junior doctors’ inexperience to some extent. However, the intergroup agreement values for the O-RADS and IOTA systems were much lower than those for the RMI4 system, which included the simplest image parameters. Experience is needed to ensure a better understanding and application of the detailed definitions of diagnostic signs in practice. Artificial intelligence may help to resolve the issue of a long learning curve.
There are many limitations of this study. First, this retrospective study could not obtain dynamic images to evaluate each adnexal mass sufficiently, leading to misjudgments of certain ultrasound features. Second, the low malignancy rate (24.5%) in the present study sample may account for the lower specificity and PPV of the O-RADS. In previous studies, the malignancy rate was 27.5% to 28.8% [2,26]. Third, prospective studies are needed to further test the performance of the management recommendations.
In conclusion, to a certain extent, all four diagnostic systems could compensate for junior doctors’ inexperience in the diagnosis of adnexal masses. The O-RADS performed best and had the highest sensitivity for detecting malignant lesions. It may make sense to use the O-RADS for clinical diagnosis and therapy.
Notes
Author Contributions
Conceptualization: Zhao B, Wen L, Liu M. Data acquisition: Guo Y, Zhao B, Zhou S, Liu J, Fu Y, Xu F. Data analysis or interpretation: Guo Y, Zhao B, Zhou S, Wen L. Drafting of the manuscript: Guo Y, Zhao B, Zhou S, Liu J, Fu Y, Xu F. Critical revision of the manuscript: Guo Y, Zhao B, Wen L, Liu M. Approval of the final version of the manuscript: all authors.
No potential conflict of interest relevant to this article was reported.
Acknowledgements
The authors thank Professor Jiang Ouyang from the Department of Public Health, Changsha Medical College, for assisting in statistical guidance.
References
Article information Continued
Notes
Key point
This is the first comparison of the diagnostic performance of the Ovarian-Adnexal Reporting and Data System (O-RADS), Risk of Malignancy Index 4 (RMI4), International Ovarian of Tumor Analysis Logistic Regression Model 2 (IOTA LR2), and IOTA Simple Rules (IOTA SR) systems in a large sample from Asian populations. The diagnostic efficiency and reliability of the four systems could compensate for junior doctors’ inexperience in predicting the malignancy of adnexal masses. It may make more sense to evaluate and improve those ultrasound predicting models for clinical management and surgical strategy.