More than interobserver agreement is required for comparisons of categorization systems

Gloria Palazuelos; Sergio Alfonso Valencia; Javier  Andres Romero; Eun Jung Choi; Eun Hye Lee

doi:10.14366/usg.19021

We read with interest the article by Choi et al. [1], titled "Interobserver agreement in breast ultrasound categorization in the Mammography and Ultrasonography Study for Breast Cancer Screening Effectiveness (MUST-BE) trial: results of a preliminary study" in the last issue of Ultrasonography. Their article evaluated the interobserver agreement of the modified categorization system established by the Alliance for Breast Cancer Screening in Korea (ABCS-K) and compared the results with the Breast Imaging Reporting and Data System (BI-RADS) categorization. Because the present data consist of preliminary results, it is crucial for us to clarify some points.

The authors used the kappa statistic to evaluate interobserver concordance, but they did not present a frequency table by categories for each categorization system. The kappa statistic has limitations depending on the prevalence of a condition. This is known as the kappa paradox, and if there are doubts about its presence, some other statistics can be used to determine levels of concordance [2].

It is interesting to see the good interobserver concordance of the re-modified ABCS-K categorization, but the interobserver concordance of the BI-RADS categorization differs from previous reports (κ-value of 0.495 vs. 0.51-0.53) [3,4], especially in BI-RADS category 5 (κ-value of 0.45 vs. 0.71) [1,4]. The authors should determine why these differences in the BI-RADS concordance occurred and should take into account the possibility that the discrepancies could have been due to the expertise of the radiologist. It would be also interesting to see a table that compares the κ-value of the BI-RADS categorization by the radiologist’s years of experience.

Another point worth discussing is the methodology used in the ABCS-K categorization, because it is categorized according to major and minor findings, in contrast to BI-RADS, which uses the positive predictive value (PPV) of each finding; this difference can be meaningful, especially for subcategories 4a, 4b, and 4c. Some minor findings in ABCS-K have previously been proven to have a high PPV, such as the presence of calcification in the mass (PPV, 84.6%-100%), echogenic halo (PPV, 66.7%), and angular margin (PPV, 60%) [5]. For this reason, it is essential to compare the diagnostic performance of the ABCS-K categorization to that of BI-RADS. Although concordance is important when selecting a categorization system, the diagnostic performance of a categorization system is an essential factor affecting its suitability for clinical use. For example, the BI-RADS categorization system has shown good diagnostic performance, with an area under the receiver operating characteristic curve of 0.708 in the fourth edition and 0.690 in the fifth edition [5].

We thank you for your interest and comments on our article titled, "Interobserver agreement in breast ultrasound categorization in the Mammography and Ultrasonography Study for Breast Cancer Screening Effectiveness (MUST-BE) trial: results of a preliminary study."

First, using the initially modified categorization, there were 63 benign and 62 suspicious lesions on ultrasonography (US), and 81 benign and 44 breast cancers in the final results. In contrast, using the re-modified categorization, there were 43 benign and 57 suspicious lesions on US, and 54 benign lesions and 46 breast cancers in the final results.

As you mentioned, the kappa statistic is subject to limitations based on the prevalence of a condition [1]. We stated in the Materials and Methods that the proportion of breast cancers among the test series in this article was not low; in fact, the proportion in the test series of this article was 35.2% (44 of 125) using the initially modified categorization and 46.0% (46 of 100) using the re-modified categorization. Therefore, applying the kappa statistic to evaluate interobserver agreement for ultrasound screening in this article is acceptable. In contrast, the prevalence of breast cancers among the test series for screening mammography in the MUSTBE trial was low (1.2%) [2]. Therefore, to avoid the kappa paradox, we applied percent agreement as well as the kappa statistic when evaluating interobserver agreement for mammography, which was done as a part of a quality control program in the trial.

Although most radiologists participating in the MUST-BE trial were experienced in breast imaging (mean, 10.1 years) in an academic setting, the kappa values reported in this article were lower than those of other studies [3,4]. Our results might have been influenced by a larger number of cases and observers than other studies [3,4] because the kappa statistic is dependent on the number of categories and observers, and its value is generally higher if there are fewer categories and observers [1]. In spite of the lower interobserver agreement using the Breast Imaging Reporting and Data System (BI-RADS) categorization in this article, we believe that it is acceptable for real-world clinical practice because the interobserver agreement for dichotomous categories (whether to biopsy or not) was moderate and similar to those of other studies (Table 6 in the manuscript).

Regarding suspicious findings, some minor findings, including calcification in the mass and angular margin, are known to have high positive predictive values. We segregated the suspicious findings into major and minor findings to distinguish category 4 and 5 lesions with the goal of achieving both high reproducibility and convenience based on previous studies [5]. However, we did not achieve an acceptable value for interobserver agreement regarding category 4 subcategorization using the modified categorization system. Therefore, we decided not to apply these criteria for the subcategorization of category 4 in the MUST-BE trial. Instead, we will perform a further analysis to classify the major and minor findings for the subcategorization of category 4 after completion of a research database including information about patients’ breast cancer diagnoses.

More than interobserver agreement is required for comparisons of categorization systems

Conflict of Interest

References

Response

References