The research methodology includes three stages: pre-processing and data augmentation, second feature extraction, and third classification and prediction. Dermoscopic image data were acquired from a well-known South Korean university hospital [10]. Various preprocessing methods have been applied to remove the dermoscopic artifacts. A preprocessed dataset was used to train the deep learning models. A flowchart of the methodology is shown in Fig. 1.
Dataset preparation
In this study, 724 dermoscopic images were collected by Severance Hospital in the Yonsei University Health System, Seoul, South Korea. Of these 724 images, 350 were from AM patients and 374 were benign nevus (BN) patients. All diagnoses were confirmed histopathologically. The dataset of images was divided into training, validation, and testing, where 70% of the images were used to train the algorithm and 20% were used for validation. The remaining 10% were utilized to evaluate the performance of the algorithm on unseen images.
Dataset preprocessing
The raw dermoscopic images were high-resolution images, which were computationally expensive. In addition, these dermoscopic artifacts are challenging for automated classification. To overcome this issue, these challenges were solved using the following preprocessing techniques.
Artifacts removal
The raw dermoscopic images contain several artifacts that may lead to poor performance of the deep learning model. Some of the artifacts found in dermoscopic images are dark corner artifact frames, ruler marking, hairs, etc. These were removed using cropping and image processing techniques, as described by ref. [26].
ROI cropping and resizing
The original dataset contains 724 skin lesion 2D RGB images with high resolution (2560 × 1920). These high-resolution images of some lesions require a high computational cost. To fit the input size of the CNN, the images must be resized. As directly resizing images may lead to shape distortion of skin lesions, the first ROI of the skin lesion was as practiced by the authors in ref. [23] and then resized to low resolution to preserve the features and shape of the skin lesion. The center size was set to 0.7 of the height of the original image, and the image was automatically cropped from the center. As illustrated in Fig. 2, this approach enlarges the lesion area for extracting features and preserves the original skin lesion shape.
Data augmentation
High-performance deep networks require large datasets. However, in the medical domain, owing to privacy concerns, obtaining large amounts of data is a significant challenge [22]. The original dataset contains only 724 images belonging to two classes of AM and BN, which are not sufficient for a high-performance deep learning model. To overcome this issue, different image augmentation techniques have been applied, as described by Perez et al. [27] in their work. The images were rotated to different angles and flipped to artificially increase the number of images. Rotation angles of 90°, 180°, and 270° were applied to generate new images. In addition to these image transformations, flipping the image upside down and from left to right was also applied to increase the dataset size. Six samples were generated from a single training sample using the augmentation method. After applying different transformations, a six-fold increase in data was generated. Table 2 summarizes the details of the original dataset and the preprocessed augmented dataset.
Deep learning models
In this section our deep learning network, which was employed for feature extraction and classification tasks, is described.
Model 1 deep CNN
The proposed model is a 7-layer deep convolution neural network. The input to the deep ConvNet model is RGB images of size 224 × 224 × 3. This deep ConvNet model consists of five convolutions and two fully connected layers. MaxPooling was applied after each convolutional layer. The outputs from the convolution layers are normalized using batch normalization layers. The nonlinear activation function ReLU was adopted as the activation function. The first layer applied eight convolutional filters with a size of 3 × 3. After this convolution operation, batch normalization was applied to normalize the input. This sometimes serves as a regularization and increases the network learning speed. ReLU is applied after this, followed by 2 × 2 MaxPooling. This block was repeated four times. However, the number of convolutional filters was increased to 16, 32, 64, and 128 in successive layers. After the final network block, the output from the last convolutional block is flattened to form a 256 neuron fully connected layer, followed by a 2 neuron fully connected layer. In the final layer, the SoftMax layer was applied to classify the inputs into the predicted labels. Dropout was applied to reduce overfitting, with a dropout value of 0.3. Data augmentation was also applied to handle the overfitting in the network. Our network contained 919,346 parameters. SGD was used as an optimization algorithm, and categorical cross entropy was used as the loss function. The cross-entropy loss is defined as
$$\begin{array}{*{20}c}CE=-{\sum }_{i}^{c}{T}_{i}\text{log}\left({S}_{i}\right) \left(1\right)\end{array}$$
where Ti and Si are the ground truth and predicted labels for each class i in C.
In our case, we have two classes, that is, C = 2 (AM and BN), so the cross-entropy loss can be described as follows:
$$\begin{array}{*{20}c}CE=-{\sum }_{i}^{c=2}{T}_{i}\text{log}\left({S}_{i}\right)= - {t}_{1}\text{log}\left({s}_{1}\right)-\left(1-{t}_{1}\right)\text{log}\left(1-{s}_{1}\right) \left(2\right)\end{array}$$
Where t1 = ground truth, s1 = model prediction.
Figure 3 presents the convolutional network architecture utilized in this study.
Model 2 transfer learning with AlexNet
AlexNet is a CNN proposed by Krizhevsky et al. [28]. AlexNet used this network and competed in the ImageNet Large Scale Visual Recognition Challenge in 2012. The network was able to achieve a top-5 error of 15.3%, which was 10.8% lower than the network in the second place. The results of the network showed that the network depth was important for better performance in image classification. Although this was expensive, the use of GPUs made it feasible.
The network architecture contains a total of eight layers, five convolutional layers and three fully connected layers, followed by a SoftMax layer for classification. This architecture uses some special features that help improve the performance of deep ConvNets. It applied max-pooling after convolutional layers and utilized dropout regularization. Additionally, to further train the deep network effectively, the nonlinear rectified linear unit was used as an activation instead of the tanh activation function. This architecture has a total of 60 million parameters.
A pre-trained AlexNet model was utilized, which was originally trained on the ImageNet database. Transfer learning was utilized by replacing the last dense layer of the original network, which consisted of 1000 neurons with 2 neurons. A new fully connected layer with 256 neurons was added, followed by a ReLU activation layer. To overcome overfitting, a dropout layer was added with a value of 0.4. All convolutional layers were used to extract features from dermoscopic images, and the final classifier was trained on our dataset. The configuration of the adopted architecture is shown in Fig. 4.
Model 3 finetuning deep residual neural networks
ResNet was proposed by He et al. [29]. A residual learning method is presented to train deeper networks. Deeper networks are difficult to train because of the vanishing gradient and exploding gradient problems; therefore, this residual method was proposed to train deep networks. According to their research, residual learning-based deeper networks achieve better optimization and high accuracy because of the depth of the network. When deep networks start converging, the accuracy becomes saturated. This problem was solved by introducing a residual function. The original network was trained using ImageNet database. ResNet took first place in the 2015 ILSVRC competition with a top 5 error rate of 3.57%.
Network layers are stacked in plain neural networks to learn the anticipated mapping directly. However, in residual networks, residual mapping is learned by the stacking layers. The mapping function, referred to as H(x), is equipped with a stacked layer, where x is the input. The mapping is given by the following equation:
$$\begin{array}{*{20}c}H\left(x\right)=F\left(x\right)+x \left(3\right)\end{array}$$
And the residual mapping function is given by:
$$\begin{array}{*{20}c}F\left(x\right)=H\left(x\right)-x \left(4\right)\end{array}$$
Instead of learning the H (x) function, the stacked layers learn the residual function F (x) explicitly. The original mapping function is determined after approximating the residual function as H(x) = F(x) + x. This mapping function F(x) + x is realized in a feedforward neural network as a residual shortcut connection and performs element-wise addition.
A pre-trained ResNet-18 model was utilized, which was trained on the ImageNet database. The network is illustrated in Fig. 5. The network contains 18 layers, 17 convolutional layers, and a fully connected layer. Transfer learning was performed by replacing the last dense layer of the network, which consisted of 1000 neurons with 2 neurons. All the convolutional layers were frozen, except for the last four layers. The last four convolutional layers and the last fully connected layer were trained on our dataset. Dropout regularization was also applied to prevent overfitting. SGD with momentum was used as an optimizer, and cross-entropy loss was utilized to train the network. Figure 5 shows the configuration of the modified ResNet architecture.