Potentially suspicious breast neoplasms could be masked by high tissue density, thus increasing the probability of a false-negative diagnosis. Furthermore, differentiating breast tissue type enables patient pre-screening stratification and risk assessment. In this study, we propose and evaluate advanced machine learning methodologies aiming at an objective and reliable method for breast density scoring from routine mammographic images. The proposed image analysis pipeline incorporates texture [Gabor filters and local binary pattern (LBP)] and gradient-based features [histogram of oriented gradients (HOG) as well as speeded-up robust features (SURF)]. Additionally, transfer learning approaches with ImageNet trained weights were also used for comparison, as well as a convolutional neural network (CNN). The proposed CNN model was fully trained on two open mammography datasets and was found to be the optimal performing methodology (AUC up to 87.3%). Thus, the findings of this study indicate that automated density scoring in mammograms can aid clinical diagnosis by introducing artificial intelligence-powered decision-support systems and contribute to the ‘democratization’ of healthcare by overcoming limitations, such as the geographic location of patients or the lack of expert radiologists.
In a variety of recent publications, a strong independent predictor of breast cancer is reported to be mammographic density (
This classification problem is usually handled by machine and deep learning techniques. Published studies concerning feature-based methods have incorporated several approaches. In particular, Bovis (
On the other hand, a deep learning framework was previously applied by Fonseca
This study constitutes an extensive analysis of MBD classification using two publicly available datasets incorporating various feature extraction methods, such as histogram of oriented gradients (HOG) (
The main aim of this study was to present and discuss the results of modern machine learning techniques combining the aforementioned feature extraction methods alongside with more robust CNN schemes presented in the bibliography. In the following section, the mammographic datasets and the proposed workflow are presented.
The patient population used in this study is based on two publicly available datasets, the Mammographic Image Analysis Society Digital Mammogram (mini-MIAS) and the Digital Database for Screening Mammography (DDSM). Further information regarding each dataset is presented below:
The mini-MIAS (
The DDSM (
Medical imaging databases usually consist of limited patient cohorts, such as the aforementioned datasets restricting the learning capacity of deep models. Therefore, to avoid biases related to the low sample number, an exhaustive k-fold cross-validation was performed for splitting the dataset into the multiple convergence and testing set. Additionally, the corresponding convergence set was split into the training and validation set by a shuffle hold-out process as presented in
In order to ensure reliable image quality without artefacts, and limit background noise that may potentially affect the feature extraction analysis, both mini-MIAS and DDSM images were pre-processed as follows. Initially, the threshold selection method described in the study by Otsu (
Distinguishing key points in imaging structures is a crucial step to capture essential abstractions, pixel intensity variations and local dependencies for differentiating between tissue classes. Mammography images usually include different structures, including muscle, breast tissue and benign/malignant lesions. To address this variability in tissue content, a number of algorithms with diverse mathematical backgrounds were employed for extracting discriminative compact representations, including gradient-based features, such as HOG, SURF and texture features, such as LBP and Gabor filters combined with LBP.
This step has been established in the proposed methodology for reducing the high-dimensional raw features to the most significant components resulting in an improved computational complexity and improved performance. The selection was enacted with the use of the neighborhood component analysis (NCA) (
Linear discriminant analysis (
Deep learning analytics introduce a fully automated analysis pipeline with data-driven learnable parameters providing a domain-specific modelling methodology. The main objective of these deep learning architectures, such as CNNs is to learn hierarchical representations of the examined domain across several layers by convolving and propagating features maps of the initial input in an end-to-end and automatic manner. This is formulated as a convex optimization where the model adapts its weights through backwards propagation. To address the clinical question of this study, several pre-trained deep learning models were evaluated for feature extraction such as inception networks, VGG19 (
This process represents an artificial method of increasing the training set and simultaneously promote the generalization ability of the models by offering alternative variants of the original image. The added noise of image transformations, including rotation, flipping, elastic deformation and mirroring amplifies model properties, such as translation, rotational and scale invariance.
The fully trained 2D CNN architecture consists of 15 layers, including the image input of shape 725×234×1, 6 convolutional layers each followed by a batch-normalization layer, 2 fully-connected layers with 100 neurons each and finally a softmax classification layer as depicted in
The fitting process of a deep architecture poses a challenging task of searching the optimal parameters in order to discover the best performing model. The validation set as part of the convergence set was used to perform this task in a transparent and unbiased way on a limiting database, as depicted in
Transfer learning is a powerful research methodology used in the data science community particularly for overcoming the limitations arising in highly-specialized but small datasets. In particular, the contribution of an ‘off-the-shelf’ model in terms of performance was evaluated by an external classifier as a feature extraction component. The selected pre-trained models compute different type of deep features since they integrate diverse architecture elements, such as residual connections, connectivity between successive layers, number of layers and number of parameters.
The pre-trained models with ImageNet weights were employed for this purpose from the open source Keras library (
SVMs are popular classifiers widely used in a variety of image analysis problems demonstrating robust performance. Taking into account the different types of feature-based and deep features calculated by the corresponding proposed methodologies, SVM was selected for the evaluation of the classification performance among the feature extraction processes in a meaningful and direct manner. In particular, the selected kernels for the studied SVMs were the following: Radial basis function for the multiclass and linear for the binary MBD classification. Finally, input feature vectors were generated from the original 5-fold splits of the corresponding model to guarantee a fair evaluation.
The studied binary classification models were evaluated mainly in terms of the area under curve (AUC) score, which is a widely used performance metric of class separability. The multi-class analyses were evaluated with the following accuracy (ACC) metric:
where TP, TN, FP, and FN stand for true-positive, true-negative, false-positive and false-negative respectively.
All studied models were fitted on the same stratified hold-out convergence (training/validation) set and evaluated on identical testing folds of cross-validation to ensure a fair and transparent comparison. This resulted in 64.6% training, 15.4% validation and 20% testing mammography images from the mini-MIAS and 63.9% training, 16.1% validation and 20% testing from the DDSM database, respectively.
Different algorithms and annotation strategies were performed on the two studied datasets to identify the optimal feature space representation. Accuracies in the mini-MIAS dataset ranged from 50.9% (GABOR + LBP selected features) to 74.2% (LBP) for three-class classification, while for binary classification, the AUC scores varied from 48.7% (HOG selected features) to 78.0% (LBP). Similarly, the previously described methodology was applied on the full DDSM dataset for predicting the MBD scoring in a binary and multi-class annotation scheme. The optimal score (ACC 79.3% and AUC 84.2%) for the feature-based techniques was observed in binary (non-dense versus dense mammograms) analysis with the SURF method. The full performance metrics of the proposed machine learning analyses are presented in
The proposed 2D CNN, as depicted in
In the present study, modern machine and deep learning techniques for MBD classification were developed and evaluated on two open datasets. A variety of texture and gradient-based features were investigated in the context of breast density scoring classification. Additionally, end-to-end image analysis architectures including fully trained CNN and ‘off-the-shelf’ deep learning models were also employed with the goal to increase accuracy in the automated breast tissue density classification.
The examined classification clinical tasks were selected based on the current literature regarding mammography image analysis and classification. The majority of similar published works incorporate binary tissue type analysis (non-dense versus dense). This was achieved by merging the fatty and glandular into the ‘non-dense’ class for binary classification in the mini-MIAS dataset. The DDSM can also be examined as a two-class set by merging fatty-glandular and dense-extremely dense, or as a three-class problem with fatty, glandular and a unified dense-extremely dense and finally a four-class analysis based on the BI-RADS criteria for tissue characterization.
The reported results in
This study mainly focused on the medium-size database setting. As regards the presented deep-learning framework, other published studies using databases of similar sizes, have reported an AUC from 59 to 73% for binary MBD classification. In particular, Kallenberg
Recently, in the literature, deep learning architectures for MBD classification have achieved greater performances than this study; however, these were with databases that are not publicly available and the sample sizes were in the order of tens of thousands of images. In particular, Lehman
The high dimensionality of extracted features and the challenging feature selection process can be a limiting factor in feature-based methods. Selecting the optimal extraction algorithm and reduction strategy requires a domain expert in both clinical field and statistics. In particular, the extraction of HOG features could not be completed due to the high demand in computational time and memory resources for the analysis of a large dataset like DDSM. By contrast, deep architectures converge to better models with large databases, but require specialized high throughput computing (HTC) and a complex hyperparameter search to ensure a generalizable analysis. This can be partially resolved by utilizing pre-trained models with the only drawback being the potential need for fine-tuning for domain adaptation.
A dense breast could possibly mask suspicious neoplasms difficult to differentiate in routine mammographic images; thus, a computer-based decision support can add valuable, objective information in support of the clinicians' assessment. To this end, MDB classification is a challenging and important task and the results of this study call for further research in this field, as well as for the further testing of new methodologies, particularly in the smaller dataset setting.
To facilitate future research, a meta-model analysis on multiple feature extraction methods, sophisticated selection algorithms and machine learning classifiers fused by a higher-level decision component, such as logistic regression, AdaBoost, weighted average or even voting could provide richer compact representations of the mammographic data improving the inference confidence. As regards the pre-trained models, fine-tuning could introduce domain-specific analysis improvements allowing an end-to-end fully automated inference and consequently offering substantial performance gains. Finally, the integration of both cranio-caudal and medio-lateral mammographic images either in a unified model or by combining different methods, features and models derived from both views, may further enhance the prediction power of such automated breast density classification systems.
Not applicable.
GSI acknowledges the support by the Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT), under the HFRI PhD Fellowship grant (GA. no. 31430). ET was financially supported by the Stavros Niarchos Foundation within the framework of the project ARCHERS (‘Advancing Young Researchers’ Human Capital in Cutting Edge Technologies in the Preservation of Cultural Heritage and the Tackling of Societal Challenges’).
All data generated or analyzed within this study are included in this published article.
ET, GSI, VDM and KM conceived and designed the study. GZP, AT and DAS researched the literature, performed the analysis of the data and contributed to the drafting of the manuscript. ET, GSI, VDM, GZP, AT, DAS and KM critically revised the article for important intellectual content. All authors have read and approved the final manuscript.
All patient data were obtained from publicly available datasets. Thus, no approval was required.
Not applicable.
DAS is the Editor-in-Chief for the journal, but had no personal involvement in the reviewing process, or any influence in terms of adjudicating on the final decision, for this article. All the authors declare that they have not competing interests.
The data stratification methodology for model fitting, hyperparameter optimization and transparent performance evaluation across every examined image analysis process.
Graphical representation of the machine learning workflow illustrating the three-class breast density mammogram classification case.
Overview of the proposed architecture (fully-trained CNN), including the network layout and layer parameters, such as the receptive field, number of filters, convolutional stride, activation function, number of neurons, dropout and classifier. CNN, convolutional neural network.
The examined pipeline for feature extraction and classification using ‘off-the-shelf’ pre-trained methods.
Breast density scoring.
Methodology or study, authors (Refs.) | mini-MIAS (2-class) ACC/AUC (%) | mini-MIAS (3-class) ACC (%) | DDSM (2-class) ACC/AUC (%) | DDSM (3-class) ACC (%) | DDSM (4-class) ACC (%) | No. of images |
---|---|---|---|---|---|---|
Machine learning | ||||||
HOG | 71.8/52.3 | 53.1 | – | – | – | Full |
LBP | 83.3/78.0 | 74.2 | 67.1/71.4 | 55.1 | 36.6 | Full |
SURF | 82.6/77.6 | 68.3 | 67.5 | 46.8 | Full | |
Gabor + LBP | 76.7/68.4 | 61.7 | 62.8/67.1 | 52.1 | 35.8 | Full |
Selected HOG | 69.0/48.7 | 53.1 | – | – | – | Full |
Selected LBP | 77.9/71.1 | 70.2 | 73.7/79.2 | 64.5 | 40.7 | Full |
Selected SURF | 83.8/77.6 | 73.6 | 75.6/81.5 | 62.9 | 46.8 | Full |
Selected Gabor + LBP | 64.9/60.9 | 50.9 | 62.1/67.7 | 55.4 | 37.7 | Full |
Bovis |
– | – | 96.6/ - | – | 71.4 | 377 |
Tzikopoulos |
– | 70.3 | – | – | – | Full |
Oliver |
– | – | – | – | 40.3–47 | 300 |
Deep learning | ||||||
Proposed architecture | Full | |||||
Inception 3 | 73.6/75.7 | 70.8 | 72.7/79.1 | 49.5 | 48.8 | Full |
VGG19 | 68.6/67.8 | 72.4 | 72.1/79.3 | 62 | 36.8 | Full |
InceptionResNetV2 | 69.9/63.7 | 73.1 | 72.7/79.2 | 55.6 | 37.3 | Full |
DenseNet201 | 75.5/79.6 | 77.9 | 73.1/80.5 | 61.7 | 36.5 | Full |
NASNetLarge | 66.5/66.3 | 72.8 | 72.3/78.7 | 61.4 | 37.8 | Full |
The table presents a 5-fold cross-validation averages for the examined methodologies. HOG, histogram of oriented gradients; LBP, local binary pattern; SURF, speeded-up robust features. Values in bold font indicate the optimal performing methodologies.