Preoperative prediction of lymph node metastasis using deep learning-based features

Lymph node involvement increases the risk of breast cancer recurrence. An accurate non-invasive assessment of nodal involvement is valuable in cancer staging, surgical risk, and cost savings. Radiomics has been proposed to pre-operatively predict sentinel lymph node (SLN) status; however, radiomic models are known to be sensitive to acquisition parameters. The purpose of this study was to develop a prediction model for preoperative prediction of SLN metastasis using deep learning-based (DLB) features and compare its predictive performance to state-of-the-art radiomics. Specifically, this study aimed to compare the generalizability of radiomics vs DLB features in an independent test set with dissimilar resolution. Dynamic contrast-enhancement images from 198 patients (67 positive SLNs) were used in this study. Of these subjects, 163 had an in-plane resolution of 0.7 × 0.7 mm2, which were randomly divided into a training set (approximately 67%) and a validation set (approximately 33%). The remaining 35 subjects with a different in-plane resolution (0.78 × 0.78 mm2) were treated as independent testing set for generalizability. Two methods were employed: (1) conventional radiomics (CR), and (2) DLB features which replaced hand-curated features with pre-trained VGG-16 features. The threshold determined using the training set was applied to the independent validation and testing dataset. Same feature reduction, feature selection, model creation procedures were used for both approaches. In the validation set (same resolution as training), the DLB model outperformed the CR model (accuracy 83% vs 80%). Furthermore, in the independent testing set of the dissimilar resolution, the DLB model performed markedly better than the CR model (accuracy 77% vs 71%). The predictive performance of the DLB model outperformed the CR model for this task. More interestingly, these improvements were seen particularly in the independent testing set of dissimilar resolution. This could indicate that DLB features can ultimately result in a more generalizable model. Supplementary Information The online version contains supplementary material available at 10.1186/s42492-022-00104-5.


Introduction
Breast cancer increases in stage and severity as it metastasizes to axillary lymph nodes [1]. Lymph node involvement increases the risk of recurrence and acts as a prognostic indicator, with the survival rate of node-positive patients being up to 40% lower than node-negative patients [2][3][4][5][6]. As a result, lymph node status is critical for diagnosis, prognosis, and monitoring of treatments [7].
Although lymph node management has become less invasive with the use of sentinel lymph node (SLN) biopsy as opposed to full axillary lymph node dissection, significant side effects including shoulder dysfunction, lymphedema, and nerve damage were still observed in as much as one-fourth of patients [8,9]. Moreover, studies have reported > 70% of biopsied SLNs are negative [8], indicating that such procedure is unbeneficial and potentially harmful to a significant amount of breast cancer patients. Accurate non-invasive assessment of nodal involvement therefore is valuable in cancer staging, surgical risk, and financial cost reduction.
Breast cancer is an area of peaked interest for the combination of radiomics and artificial intelligence, with clinical impact possible as both a diagnostic and prognostic tool [10]. One such task is the development of a predictive model for non-invasive staging of the axillary lymph nodes as an alternative to SLN biopsy. Nomograms and radiomic pipelines have been used to predict SLN status with promising results [9,[11][12][13][14][15][16][17][18][19]. However, conventional radiomics (CR) has several disadvantages. For instance, the robustness of the conventional handcrafted radiomic features is variable based on changing parameters, including pixel size, region-of-interest (ROI) delineation, and signal-to-noise ratio [20]. Deep learning has the potential to serve as a more powerful tool to overcome these issues as shown in several studies [21][22][23][24][25][26]. Moreover, deep learning is capable of learning high-level and task-adaptive image features [27]. It enables direct feature extraction from multiple levels without explicit definition and can provide a higher level of feature abstraction [28]. However, deep learning requires a large training data size to obtain a generalizable and functional classification model. Fortunately, studies have demonstrated that initial features extracted by deep learning network are largely similar to CR, since they both detect edges, ripples, and various other textures prior to observing more complex features [29][30][31]. Thus, it is possible to use features identified by a pre-trained deep learning network as an alternative to hand-crafted features used in CR.
The purpose of this study was to develop a DLB feature prediction model for preoperative prediction of SLN metastasis and compare its predictive performance to state-of-the-art CR. Specifically, this study aimed to compare the generalizability of CR vs DLB features in an independent testing set of dissimilar resolution. Figure 1 shows the general pipeline used in this work.

Study population
The dataset used in this study is an expansion of that described in previous publication [13]. Briefly, data for this institutional review board-approved retrospective study collected images from June 2013 to June 2017. Inclusion criteria were patients that had (1) preoperative dynamic contrast enhanced (DCE)-magnetic resonance imaging (MRI), (2) diagnosis of invasive breast cancer by histopathology, (3) SLN biopsy result, and (4) no neoadjuvant chemotherapy. Exclusion criteria were patients that had (1) no SLN biopsy result, (2) very small tumor ROI (less than 64 voxels), or (3) MRI after neoadjuvant chemotherapy. After inclusion/exclusion criteria, a sample of 198 patients (67 positive SLNs and 131 negative SLNs) was used in this study. Of those 198 subjects, 163 had an in-plane resolution of 0.7 × 0.7 mm 2 ; that 163 subject cohort was randomly divided into two independent subsets: a training set (approximately 67%, 109 patients with 37 positive SLNs) and a validation set (approximately 33%, 54 patients with 18 positive SLNs).
The remaining 35 subjects (35 patients with 12 positive SLNs) with a different in-plane resolution (0.78 × 0.78 mm 2 ) were treated as an independent testing set with dissimilar resolution to test the generalizability of Fig. 1 Schematic representation of pipeline for feature extraction, reduction, and model creation. The CR pipeline and the pipeline using deep learning-based (DLB) features are only different in their feature extraction step. All other steps remain identical. LASSO: Least absolute shrinkage selection operator; ROC: Receiver operating characteristic the predictive models for imaging data acquired with slightly different resolution. Given that radiomics has been shown to have limited generalizability, an independent testing set of dissimilar resolution will more rigorously assess this potential of the predictive model.
Clinical data collected for this study included whether the tumor was confined to the upper inner quadrant, multifocality, age, pathological type, tumor grade, molecular subtype, and lymphovascular invasion.

MRI examination
The MRI examinations were all performed using a dedicated 8-channel breast coil on 1.5 T GE Signa (GE Healthcare, Wauwatosa). The sequence of interest in this study was the DCE series; sagittal VI-BRANT multiphase sequence was acquired with the following parameters: repetition time (TR) = 4.46-7.80 ms; echo time (TE) = 1.54-4.20 ms; flip angle = 10°; matrix = 256 × 256; slice thickness = 2 mm. I.V. contrast agent was Magnevist (Schering, Berlin), injected at a dose of 0.2 mL/kg at a rate of 2 mL/s, followed by 20 mL saline flush. Five phases were acquired: one pre-contrast and four post-contrast images. Patients with pixel sizes of 0.7 × 0.7 mm 2 were split into training and validation cohorts. Patients with pixel sizes of 0.78 × 0.78 mm 2 were separately analyzed in an independent testing set. This analysis allows for the clinically practical reality that it is knowingly difficult to standardize pixel size, which may need to be adjusted based patient specific variable (e.g., size of the patient).

Map calculation
Reducing the effect of varying TR and TE, three ratio maps were used: wash-in maps ((S 1 -S 0 )/S 0 ) × 100%, wash-out maps ((S 1 -S 4 )/S 1 )) × 100%, and signal enhancement ratio (SER) maps ((S 1 -S 0 )/(S 4 -S 0 )) × 100%, where S 0 , S 1 , and S 4 are the pre-contrast, first postcontrast, and fourth (the last) post-contrast images, respectively. These maps are independent of the original MR signal intensity and capture the behavior of contrast enhancement in the tissue. Representative image and calculated kinetic maps are shown in Fig. 2.

Segmentation
ROIs of the tumor were manually drawn on the first post-contrast image by a radiologist with 11 years of experience. We noted that although manually drawn ROIs can be subjective, an automated convolutional neural network (CNN)-based segmentation was shown to be comparable in radiomics task-based assessment within this cohort [32]. The original ROI was dilated by 4 mm using Matlab v2017b (MathWorks, Natick). This resulted in two regions of interest: one intratumoral ROI and one peritumoral region (0-4 mm). These regions are shown in Fig. 3.

DLB features
VGG-16 [36], a pre-trained CNN architecture that is 16 layers deep, was utilized for DLB feature extraction. A schematic representation of the network is shown in Fig. 4. The image was multiplied by the binary mask of either the intratumoral ROI or peritumoral ROI such that the regions outside of the RIOs were set to zero. Absolute resampling, similar to above, was performed (for intratumoral ROIs, wash-in map: 0 ∼ 640%; wash-out map: − 156 ∼ 100%; SER map: − 1280 ∼ 1280%; for peritumoral ROIs, wash-in map: 0 ∼ 640%, wash-out map: − 540 ∼ 100%, SER map: − 1280 ∼ 1280%). Data were then normalized on a scale from 0 to 1, with 0 being the lowest value and 1 being the highest value referenced above. Then, the resultant image was multiplied by 255 to match the 0-255 range expected by VGG-16.
VGG-16 has a predefined input structure of 224 × 224 × 3. Each 2D slice of our dataset was cropped to 224 × 224. A 3D volume comprised of the Wash-In, Wash-Out, and SER maps for each slice was inputted.
Matlab was used to import VGG-16. The model was not retrained; instead, all layers remained frozen (weights remained the same), and only activations from the last fully connected layer (fc8) were extracted. These were exported as rows, in which each 2D slice had a single row of 1000 features. Given that each ROI has multiple slices, the value in each column was averaged across all slices for that subject.

Feature reduction
From this point forward, the pipelines for the CR and DLB features are separate and identical.
Standard z-score normalization was used on the training set; z-score is value minus training set mean divided by training set standard deviation. The validation and testing sets were also normalized using the training set mean and standard deviation. Training set was rebalanced using an adaptive synthetic sampling approach; this improves class balance by the creation of new samples from the minority group [37]. Given the high dimensionality of all the extracted features, several steps were performed to remove redundant or non-informative features. Firstly, Mann-Whitney U-test was used to find significantly different features between SLN positive and SLN negative groups; a range of p-value thresholds were tested (0.001, 0.005, 0.01, 0.05). Secondly, groups of highly correlated radiomic features were identified (Spearman ρ) and only one representative feature was selected from each correlated group; similarly, several ρ-value thresholds were tested (0.75, 0.80, 0.85, 0.90, 0.95). Finally, an optional step of principal component analysis (PCA) was performed to further reduce the feature space; several number of PCA components were tested (20,40,60,80,100). The optimal thresholds for feature reduction were chosen as those that resulted in the highest validation accuracy of the average of 100 random seeds.

Feature selection and model creation
The remaining features from the reduction process combined with the clinical features were the input for feature selection process. A logistic regression model was used for the prediction task. The selection of important predictors was performed in the training set using the Least Absolute Shrinkage Selection Operator Regression (LASSO) [38] with 3fold cross-validation. The selected model was that of minimum cross-validation error plus one standard deviation. To avoid overfitting, the maximum number of the selected features was restricted to 10. These features were then used to establish logistic regression models to predict SLN metastasis. The optimal threshold of the receiver operating characteristic analysis was determined by maximizing the Youden index (YI) in the training set, where the YI is defined as sensitivity + specificity -1. This threshold was applied to the independent validation and testing datasets. Predictive performance measures tabulated included area under the curve (AUC), sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and accuracy. To avoid the model optimization becoming stuck in a local minimum, the LASSO procedure was repeated 100 times with different seeds. The cross-validation results across all folds were averaged; the model that achieved the highest accuracy in the training set was selected as the prediction model. Additionally, the training set was shuffled each iteration to randomize the cross-validation within the training set, while the independent validation and testing set remained the same.

Model incorporating peritumoral region
The primary analysis for this study was the model incorporating intratumoral plus peritumoral (4 mm) features, given that it has been shown to outperform intratumoral features alone in a previous publication [13]. For CR model, a total of 157 features (146 radiomic and 11 clinical) were included in the 3-fold crossvalidation LASSO feature selection process. The optimal feature reduction parameters were a ρ-value threshold of 0.95, a p-value threshold of 0.05, and no PCA. We noted that PCA was also performed for the CR feature reduction pipeline but did not improve the predictive performance of the model. Eight features, including 1 clinical, 2 shape, and 5 texture features, were optimized for this model (Table 1).
For DLB model, the optimal feature reduction parameters were a ρ-value threshold of 0.85, a p-value threshold of 0.001, and a PCA value of 80. After feature reduction, there were 91 features (80 DLB and 11 clinical) inputted into the feature selection process. For this model, 5 features were finally selected, including 2 clinical (tumor grade and lymphovascular invasion) and 3 DLB features.
Predictive performance metrics are shown in Table 2  . It is noted that we included the performance of the training set for the completeness of the paper; however, it should not be used for comparison due to overfitting concerns.

Model excluding peritumoral region
As a secondary analysis, models created utilizing clinical features and intratumoral features alone were analyzed.
For CR model, a total of 104 features (93 radiomic and 11 clinical) were fed into the 3-fold cross-validation LASSO feature selection process. The optimal feature reduction parameters were a ρ-value threshold of 0.95, a p-value threshold of 0.05, and no PCA. In logistic model creation, a maximum of 10 features were included. As discussed in Methods, the random seed with the highest training set accuracy was reported. Finally, there were 10 features chosen, including 2 clinical features, 2 shape features, and 6 texture features (Table 3).
For DLB model, the optimal feature reduction parameters were a ρ-value threshold of 0.75, a p-value threshold of 0.005, and no PCA. After feature reduction, there were 48 features (37 DLB features, 11 clinical) inputted into the 3-fold cross-validation LASSO feature selection process. For this model, 9 features, including 2 clinical features (Tumor grade and Lymphovascular Invasion) and 7 DLB features, were selected.
Predictive performance metrics are shown in Table 4 and Fig. 6 for the CR and DLB models. In the validation set, the DLB pipeline performed similarly compared to the CR pipeline

Discussion and conclusions
The results of our study showed that the predictive performance of the DLB model outperformed the CR model in several metrics. More interestingly, these improvements were seen particularly in the independent testing set with dissimilar resolution. This could indicate that DLB features are less sensitive to varying conditions (e.g., pixel size changes) and ultimately result in a more generalizable model. SLN status prediction has been explored using nomograms; examples include those developed by Memorial Sloan Kettering Cancer Center and MD Anderson, which include age, tumor characteristics (size, grade, type, focality, location), lymphovascular invasion and hormone receptors to predict the likelihood of a diseased, positive SLN. These nomograms have been shown to have a moderate predictive performance [11,12]. Imaging studies have investigated the use of non-invasive quantitative imaging radiomic biomarkers along with clinical data for prediction of SLN status, with promising results (AUC > 0.8) [9,[13][14][15][16][17][18][19]. Specifically, our previously published work has validated a radiomic pipeline of peritumoral region in combination with intratumoral region for the prediction of SLN metastasis [13]. The peritumoral region is of interest because tissue surrounding the tumor may contain valuable information such as angiogenic-lymphangiogenic factors and tumor infiltrating lymphocytes, which have been shown to be related to treatment response [39]. Predictive models based on CR features can be disadvantageous because radiomic features have been shown to be sensitive to changing parameters, such as pixel size alteration [20]. Numerous studies have shown DLB models to outperform CR pipelines. Using MRI, CNN performance has been compared and shown to outperform radiomics for the purposes of breast lesion classification [23] and gene mutations in low grade gliomas [24]. Additionally, the combination of the CR and DLB features was superior in the survival and classification prediction of high-grade gliomas [25,26]. Our study took one step further, creating an independent testing set of dissimilar resolution, in attempt to identify a particular condition in which DLB features may outperform CR features. Note that the goal of this work is to compare DLB and CR features. Also, given that the radiomic model has shown to outperform the model using clinical characteristics alone Table 2 Predictive performance results for intratumoral plus 4 mm peritumoral region for CR and DLB models. Values shown are from the random seed with highest training set accuracy. The DLB pipeline slightly outperformed CR in the validation set of the same resolution as the training set. A larger improvement is seen in the testing set of dissimilar resolution. This indicates the DLB pipeline might be more generalizable and less sensitive to pixel size differences in previous study [13], the models presented here are not compared with the clinical only model. Deep learning can utilize CNNs to extract features by applying convolution layers composed of small-sized fields or kernels, followed by pooling layers to reduce the size of the resultant feature space. Optimizing the kernel to be applied to the image to extract meaningful features is the purpose of training the network. Consequently, the performance of a CNN is dependent on the amount of data it has available to train on. This results in the CNN being more 'data hungry' and typically needing a very large sample size, on the order of millions of images, to have optimal performance [40]. Even if collaborating with multiple institutions, acquiring millions of medical images is frequently not feasible, especially in the instances of rare diseases. One way to address this limitation is the utilization of transfer learning. Transfer learning involves the implementation of a pre-trained network that is further fine-tuned with samples specific to a desired task. Transfer learning has been explored for prediction of lymph node metastasis in patients with cervical cancer using MRI (AUC: 0.9) and breast cancer using CT (AUC: 0.8) with promising results [41,42]. In this work, we used a pretrained VGG model that was originally optimized to classify 1000 types of objects. Thus, the features extracted from the VGG model are expected to be robust as the features were used to differentiate similar shaped objects such as tow trucks and cars, and animals such as ladybugs and crabs. Transfer learning models that further train VGG with new medical imaging data can further manipulate the way features are extracted; whereas our proposed method doesn't rely on refining the original VGG network and directly utilizes the robust features that have been trained in its original task. Our proposed model uses the last fully connected layer in VGG prior to classification, allowing our DCE images to fully propagate throughout the network for feature extraction. This approach is more readily available compared to transferred learning approach and can be directly incorporated into more traditional radiomics pipeline. Particularly that for studies with relatively small sample size, it is expected to be less prone to overfitting.
Future directions of this study include fine-tuning of the network directly for classification instead of feature extraction [40]. Additionally, incorporation of additional features such as CoLlAGe and wavelet features could be performed for the radiomics pipeline; specifically, it has been shown that these directional-based features have correlated with tumor microenvironment as seen on  pathology, such as orientation and densely packed tumor infiltrating lymphocytes [39,43]. Moreover, other evaluations using different parameters other than in-plane resolution could be performed to further evaluate the generalizability. In general, there also exists the possibility to look specifically at the lymph nodes. This study analyzed the in-breast tumor to predict metastasis to the lymph node. To date, efforts to analyze axillary lymph nodes on MRI has been limited by small axillary lymph node sizes, breast coil sensitivity in regions in the axilla, and exclusion of a majority of the axilla from field of view. Although nodal morphological features on MRI are predictive of malignancy [2,44], predictive value of contrast enhancement and morphological criteria, such as size and shape of lymph nodes, was found to be controversial [45,46]. Furthermore, application of this model to more advanced stage breast cancer (e.g., patients undergoing neoadjuvant chemotherapy) or other cancer types (e.g., head and neck cancer) is another potential avenue of study.
Additional file 1: Supplemental Table 1. Summary of radiomic features extracted. Fig. 6 Predictive performance of features including intratumoral region with CR and DLB models. The DLB model performed similarly to CR in the validation set. In the testing set, the DLB model outperformed CR in numerous metrics, including NPV, accuracy and YI. This indicates the DLB model might be more generalizable and less sensitive to pixel size differences. Note that the performance of the training set is not used for comparison due to overfitting concerns. AUC: area under the curve; Sens: sensitivity; Spec: specificity; PPV: positive predictive value; NPV: negative predictive value; Acc: accuracy; YI: Youden index