Comparative analysis of proficiencies of various textures and geometric features in breast mass classification using k-nearest neighbor

Singh, Harmandeep; Sharma, Vipul; Singh, Damanpreet

doi:10.1186/s42492-021-00100-1

Original Article
Open access
Published: 12 January 2022

Comparative analysis of proficiencies of various textures and geometric features in breast mass classification using k-nearest neighbor

Visual Computing for Industry, Biomedicine, and Art volume 5, Article number: 3 (2022) Cite this article

4119 Accesses
22 Citations
1 Altmetric
Metrics details

Abstract

This paper introduces a comparative analysis of the proficiencies of various textures and geometric features in the diagnosis of breast masses on mammograms. An improved machine learning-based framework was developed for this study. The proposed system was tested using 106 full field digital mammography images from the INbreast dataset, containing a total of 115 breast mass lesions. The proficiencies of individual and various combinations of computed textures and geometric features were investigated by evaluating their contributions towards attaining higher classification accuracies. Four state-of-the-art filter-based feature selection algorithms (Relief-F, Pearson correlation coefficient, neighborhood component analysis, and term variance) were employed to select the top 20 most discriminative features. The Relief-F algorithm outperformed other feature selection algorithms in terms of classification results by reporting 85.2% accuracy, 82.0% sensitivity, and 88.0% specificity. A set of nine most discriminative features were then selected, out of the earlier mentioned 20 features obtained using Relief-F, as a result of further simulations. The classification performances of six state-of-the-art machine learning classifiers, namely k-nearest neighbor (k-NN), support vector machine, decision tree, Naive Bayes, random forest, and ensemble tree, were investigated, and the obtained results revealed that the best classification results (accuracy = 90.4%, sensitivity = 92.0%, specificity = 88.0%) were obtained for the k-NN classifier with the number of neighbors having k = 5 and squared inverse distance weight. The key findings include the identification of the nine most discriminative features, that is, FD26 (Fourier Descriptor), Euler number, solidity, mean, FD14, FD13, periodicity, skewness, and contrast out of a pool of 125 texture and geometric features. The proposed results revealed that the selected nine features can be used for the classification of breast masses in mammograms.

Introduction

Breast cancer continues to be one of the deadliest diseases. It is caused by the invasion of abnormal cells across the usual boundaries due to uncontrolled growth and division [1]. According to the latest statistics, female breast cancer remains a significant hurdle, with an estimated 2.26 million new cancer cases, accounting for nearly 24.5% of the 9.22 million new cancer cases diagnosed among women in 2020. Breast cancer has surpassed lung cancer in terms of the cause of mortality among women, accounting for 15.5% of the total 4.43 million deaths in women of all age groups due to cancer [2]. Early detection of breast cancer is the only entity that can help reduce the death rate [3]. Screening using mammogram images is still considered the best, most reliable, and economical method for the detection of early signs of breast cancer. Radiologists must carefully examine mammogram images to detect abnormalities [4]. However, the success and widespread adoption of mammography has drastically increased the workload of radiologists. Due to this increased workload, even expert radiologists can miss a considerable number of abnormalities or can misinterpret abnormalities that may increase the number of false-positive and false-negative reports. To resolve these issues, computer-aided diagnosis (CAD) systems are used by radiologists as secondary readers [5]. Generally, radiologists look for four different types of abnormalities in mammogram images, namely masses, microcalcifications, architectural distortions, and asymmetric breast tissues, as early signs of breast cancer [6]. Among these, masses and microcalcifications are the most frequently occurring types of abnormalities, any other types of abnormalities are usually found in rare cases. It should also be noted that the diagnosis of masses is a more challenging task than that of micro-calcifications [3]. Moreover, successful CAD systems have already been clinically approved for the diagnosis of microcalcifications. As a result of this, CAD systems for breast masses are attracting considerable research interest.

Generally, the diagnosis of breast masses involves the detection and classification of breast masses. Masses are generally characterized by their shape, margin, and texture. Benign masses possess round and oval shapes with well-circumscribed and smooth boundaries as opposed to malignant masses, which usually possess irregular shapes with rough, ill-defined, and speculated boundaries [7]. Significant differences can also be seen between the texture of benign and malignant masses, with the former being mostly smooth and homogeneous and the latter having a heterogeneous and rough texture [8].

So far, numerous researchers in the literature have made significant contributions to the analysis of texture and geometry-based features. For instance, Mudigonda et al. [9] compared the effectiveness of two sets of features, gradient-based and texture-based, for the classification of breast masses. The best classification accuracy of 82.1% with 0.85 as an area under the receiver operating characteristics curve has been reported by using gray-level co-occurrence matrix (GLCM)-based texture features with a posterior probability-based classifier for the Mammographic Image Analysis Society (MIAS) database. Yang et al. [10] developed a two-stage CAD system for the detection and classification of breast masses. In the first stage, the statistical gray-level difference matrix and fractal dimension-based five texture features were used for the detection and extraction of breast masses using a probabilistic neural network (PNN). In the second stage, four shape features were further coupled with the previously used five texture features for classification using a PNN and achieved an accuracy of 84.1% for the mammograms taken from Taichung Veteran General Hospital. Kegelmeyer et al. [11] proposed a CAD system for the detection of speculated mass lesions by using four Laws’ texture features with a new feature responsive to stellate patterns. A sensitivity of 97.0% with 0.28 FP per image has been reported. Nandi et al. [12] employed a set of 22 features related to shape, texture, and edge sharpness for the classification of breast mass lesions using genetic programming classification techniques that implicitly possess feature selection capability. A shape-based feature called fractal concavity was the most discriminative feature among all, and the proposed system showed classification accuracies above 99.5% and 98.0% for the training and testing sets, respectively. Delogu et al. [13] developed a CAD system for the segmentation and classification of breast masses in mammograms. To extract the exact mass lesions, the first region of interest (ROI) containing the mass lesions was located by expert radiologists, and then the wavelet transform-based segmentation technique was used to separate the mass lesions from the normal tissues in the ROI. Repeated experiments were performed with various combinations of 16 shape-, size-, and intensity-based features using a multi-layered perceptron neural network classifier. The best classification results were obtained using the 12 most powerful features out of a total of 16 computed features. Domínguez and Nandi [14] conducted various experiments to explore the usefulness of a set of six mass margin characterization features extracted from simplified versions of contours. The performance of each of these features and their various combinations were evaluated using three different classifiers on a set of mammographic images taken from mini-MIAS and Digital Database for Screening Mammography (DDSM) datasets. It was found that out of all the possible sets of features, spiculation features performed the best, and most of the systems formed by using different combinations of features, datasets, and classifiers were more efficient in identifying benign masses than malignant masses. Ganesan et al. [15] presented a classification pipeline for studying the textural changes that occurred in mammogram images of cancerous breasts and further improved the classification accuracy. Features based on higher-order spectra, local binary patterns, Law’s texture energy, and discrete wavelet transform were extracted from the manually segmented mass lesions. Out of the six classifiers used, the decision tree (DT) classifier showed promising results. Sharma and Khanna [16] showed that the Zernike moment of order 20 performed better than the other texture descriptors, spatial grey-level co-occurrence matrices (SGLCM), and discrete cosine transform, with a support vector machine (SVM) classifier. The proposed system attained 99.0% sensitivity and 99.0% specificity with the image retrieval in medical applications dataset and 97.0% sensitivity and 96.0% specificity with the DDSM dataset. Liu and Tang [17] investigated the classification performance of a CAD system by employing several feature selection algorithms with an SVM classifier. A new feature selection algorithm called SVM-based recursive feature elimination with normalized mutual information feature selection has been proposed for the selection of an optimal set of features out of a total of 31 features (12-geometry and 19-texture). Experiments were carried out with 826 ROIs (408 m and 418 b) taken from the DDSM dataset. The best area under curve (AUC) values of 0.9439 and 0.9615 were achieved with the proposed feature selection technique and SVM classifier with ten-fold cross-validation and leave-one-out scheme, respectively. Kashyap et al. [18] proposed a CAD system for the diagnosis of breast masses in mammograms and their shape analysis. The fast fuzzy C-means clustering algorithm was employed for the extraction of mass lesions from pre-processed mammograms. SVM was used to classify segmented ROIs as mass or non-mass using the texture features. The proposed system was evaluated on two datasets, mini-MIAS and DDSM, and achieved the highest sensitivity, specificity, accuracy, and AUC values of 91.7%, 96.2%, 95.4%, 96.2%, and 94.6%, 92.7%, 92.0%, 95.3% respectively. Finally, shape analysis was performed by employing radon transform-based features. Lbachir et al. [19] proposed a complete CAD system for breast masses. A histogram region analysis-based k-means algorithm has been proposed for the segmentation of breast mass lesions from enhanced mammogram images. Texture and shape features were then used for false-positive reduction with the bagged trees classifier. Finally, SVM was employed for the classification of breast masses. The proposed system achieved 93.1% and 90.8% detection accuracies and 94.2% and 90.4% classification accuracies for the MIAS and the curated breast imaging subset of DDSM datasets, respectively. Hosni et al. [20] used a systematic map to examine the state-of-the-art ensemble classification methods when applied to breast cancer in terms of nine factors: publication venues, medical tasks addressed, empirical and research types used, types of ensembles proposed, single techniques used to construct the ensembles, validation framework used to evaluate the proposed ensembles, and the tools used. The goal of this study is to conduct a systematic mapping investigation of single approaches. Al-Antari et al. [21] used a You Only Look Once detector for the detection of breast lesions and deep learning convolutional models for retrieving deep features. Classification accuracies of 94.5%, 95.8%, and 97.5%, respectively, for the DDSM dataset and 88.7%, 92.5%, and 95.3%, respectively, for the INbreast dataset were achieved by employing three modified deep learning classifiers, namely regular feedforward convolutional neural network, ResNet-50, and InceptionResNet-V2.

Most of the aforementioned studies dealt with either mass classification or feature selection techniques. Textures and geometric features are being used by most researchers for the characterization of breast mass lesions so they can be classified into benign and malignant categories. It is a well-established fact that usually, all extracted features do not contribute equally to the classification of masses, and some features perform better in combination with other features. Therefore, it is interesting to identify the significant contributing features from the pool of total extracted features. Feature selection algorithms are generally used for the selection of an optimal and relevant subset of extracted features; however, these algorithms cannot be used for performing a comparative analysis of the discriminative capabilities of individual features. There are very few studies in the literature that show a comparative analysis of discriminative capabilities of an individual or a group of features. As a primary contribution, this work is intended to analyze the discriminative capabilities of various texture and geometric (shape and margin) features in CAD systems by incorporating various combinations of texture and geometric features and pattern classification methods. As the key finding, this research investigation revealed the nine most discriminative features out of a pool of 125 texture and geometric features.

The remainder of this paper proceeds as follows. The methods are presented in Section 2. The results are reported and discussed in Section 3, and Section 4 presents the conclusions.

Methods

In this study, a CAD system was proposed for carrying out a comparative analysis of the proficiencies of various texture and geometric features for the classification of breast masses into benign and malignant categories. The schematic diagram of the proposed CAD system is shown in Fig. 1, which consists of five main stages: arrangement of the mammographic dataset, exact mass lesion extraction, feature extraction, feature selection, and classification. A brief description of each step is provided in the following subsections.

Mammographic dataset

Fully field digital mammographic (FFDM) images taken from the INbreast dataset were included in this study for carrying out the experiments. All images in the INbreast dataset have a Digital Imaging and Communications in Medicine format and have been acquired at two different resolutions, 3328 × 4084 and 2500 × 3328, using MammoNovation Siemens FFDM equipment at Centro de Mama - Hospital de S. João (CHSJ), breast center Porto. Boundary points in the form of pixel coordinates inscribing various types of abnormalities in mammographic images of breasts are provided with the dataset [22]. A detailed description of the images included in this study for carrying out the experiments is presented in Table 1.

Table 1 Detailed description of mammographic image dataset [22]

Comparative analysis of proficiencies of various textures and geometric features in breast mass classification using k-nearest neighbor

Abstract

Introduction

Methods

Mammographic dataset

Methodology

Exact mass lesion extraction

Feature extraction

Texture features

Geometry features

Feature selection

Relief-F

Classification

Results and discussion

Comparison with previous works

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords