 Original Article
 Open access
 Published:
Typicality and instancedependent label noisecombating: a novel framework for simulating and combating realworld noisy labels for endoscopic polyp classification
Visual Computing for Industry, Biomedicine, and Art volume 7, Article number: 10 (2024)
Abstract
Learning with noisy labels aims to train neural networks with noisy labels. Current models handle instanceindependent label noise (IIN) well; however, they fall short with realworld noise. In medical image classification, atypical samples frequently receive incorrect labels, rendering instancedependent label noise (IDN) an accurate representation of realworld scenarios. However, the current IDN approaches fail to consider the typicality of samples, which hampers their ability to address realworld label noise effectively. To alleviate the issues, we introduce typicality and instancedependent label noise (TIDN) to simulate realworld noise and establish a TIDNcombating framework to combat label noise. Specifically, we use the sample’s distance to decision boundaries in the feature space to represent typicality. The TIDN is then generated according to typicality. We establish a TIDNattention module to combat label noise and learn the transition matrix from latent ground truth to the observed noisy labels. A recursive algorithm that enables the network to make correct predictions with corrections from the learned transition matrix is proposed. Our experiments demonstrate that the TIDN simulates realworld noise more closely than the existing IIN and IDN. Furthermore, the TIDNcombating framework demonstrates superior classification performance when training with simulated TIDN and actual realworld noise.
Introduction
Deep learning neural networks have achieved remarkable performance [1] due to large amounts of labeled data availability. Unfortunately, labeling for medical image classification is often timeconsuming and expertdemanding, which could lead to incorrect annotations. Noise labels refer to incorrect annotations, which can originate from inexperienced experts or mistakes made by annotators [2], particularly in endoscopic polyp classification with indistinct features. Noisy labels can mislead deep neural networks due to their strong ability to fit images and labels [3]. Consequently, learning with noisy labels (LNL) methods have been developed. These techniques aim to train neural networks effectively using noisy labels while achieving high accuracy (ACC) on wellannotated test sets. Previous studies [4,5,6,7,8] developed models that handle simulated instanceindependent label noise (IIN) [9]. However, their effectiveness is limited in dealing with realworld label noise [10]. Under the IIN paradigm, humangenerated noisy labels \(\widetilde{Y}\) is only related to the original true labels \(Y\), i.e., the noisy transition probability is \(P(\widetilde{Y}Y)\). However, in actual scenarios, label noise is often related to the samples; for example, atypical samples are more likely to be mislabeled. This leads to the concept of instancedependent label noise (IDN), where the transition probability becomes \(P(\widetilde{Y}Y,X)\), where \(X\) denotes the input images. The IDN models the realworld scenario better, resulting in improved handling of realworld label noise compared with the IIN. Therefore, to address the challenge of learning with realworld label noise, it is crucial to simulate and combat it.
Methods for simulating label noise can be divided into IIN and IDN. The simulated IIN flips the original labels using a noise transition probability matrix [11,12,13]. This process depends only on the class of the original label. Classic IIN includes random flipping and pair flipping noise. In the IDN paradigm, the simulated label noise described in ref. [14] converts the pixel value into the probability of flipping labels. This approach combines instances and the probability of flipping; however, it lacks reasonableness and ignores the typicality of the samples. Cheng et al. [15] presented a boundary noise model confined to twodimensional feature spaces. This approach is overly simplistic for complex, multidimensional spaces and falls short of accurately representing realworld label noise. The current IDN fails to consider the critical factors of typicality, particularly in medical tasks. In practical scenarios, the mislabeling of data often correlates with the typicality of the instance features. Figure 1 demonstrates how beginners might find it challenging to correctly identify small atypical lesions, as shown in the second column. Similarly, the experts and novices may have misclassified a blurred adenoma polyp in the third column.
Methods for combating label noise can be categorized into modelbased or modelfree approaches based on whether they model the noisy transition distribution from the ground truth to noisy labels. Modelfree approaches do not model the noise paradigm (i.e., IIN or IDN). They mainly rely on the “small loss trick” [16], which suggests that the training loss for samples with noisy labels tends to be larger than for those with ground truth. This category includes methods such as MentorNet [17], coteaching [4], and coteaching+ [16]. Sample selection methodologies for identifying labels likely to be valid for network training have emerged. Double branch networks [4, 10, 16] have enhanced the selection precision. However, the “small loss trick” is ineffective for the IDN paradigm [14], as neural networks may overfit complex decision boundaries. Semisupervised learning methods [5, 18, 19] have also been adapted for the LNL problem. These methods leverage the information within the images of noisy samples to assist in selecting and correcting noisy labels. However, these methods do not fully utilize the information in noisy labels, and the correction error for noisy labels remains uncontrolled.
In comparison, modelbased methods are deemed more reliable because they theoretically guarantee an optimal classifier for modeling the distribution of true labels. These methods introduce a noisy transition matrix \(T\left({\varvec{X}}\right)\), where \({\varvec{X}}\) denotes the raw instances. This matrix represents the transition probability from the latent ground truth to the observed noisy labels. Given oracle \({T}^{*}\), a statistically consistent model can be learned by minimizing the crossentropy loss reweighted by \({T}^{*}\) [20]. However, the existing modelbased studies rely on strong assumptions. Under the IIN assumption, which implies \(T\left({\varvec{X}}\right)={T}_{c\times c}\), ref. [6] established a Softmax layer representing the IIN transition channel, which is optimized in an expectationmaximizing manner. Anchor points methods [21], which assume that the most confident samples of neural networks are predicted correctly as anchor points, estimate and fill the simple \({T}_{c\times c}\). Unfortunately, the estimated \({T}_{c\times c}\) of IIN cannot improve realworld noisy labels. Under the complex IDN assumption, Xia et al. [14] assumed that the noisy transition matrix depends only on the parts [22] of the instances rather than the raw images. Partdependent methods are ineffective for medical images with more complex features and are difficult to compose into parts. Cheng et al. [15] introduced a method designed to be robust to binary boundary noise and validated it in a twodimensional feature space, which is inapplicable to complex medical image classification tasks. CSIDN [23] estimated \(T\left({\varvec{X}}\right)\) according to the confidence of each sample but did not consider overconfidence from neural networks. In addition to the strong assumptions modelbased IDN methods mentioned above, these methods overlook the relevance between typicality and the noisy transition matrix, which aligns with the wild.
We introduce typicality and IDN (TIDN) to simulate realworld label noise and develop a TIDNcombating framework to combat the label noise. A TIDN is generated by disturbing the original labels according to the typicality of the samples. We propose using the distance between the samples and decision boundaries to represent typicality, calculated using a support vector machine (SVM) [24]. In the TIDNcombating framework, we establish a TIDNattention module to link features and noisy transition matrix. A recursive algorithm was proposed to enable the framework to learn the noisy transition matrix, following the spirit of the expectationmaximization (EM) algorithm. The classification network correctly predicts with corrections from the learned noisy transition matrix. Moreover, we proposed using an instanceindependent noisy transition matrix to initialize the instancedependent matrix in a recursive algorithm.
Our main contributions are as follows:

We introduce a TIDN to simulate realworld label noise closely. In the TIDN paradigm, atypical samples are more likely to be mislabeled. We propose using the distance between the samples and decision boundaries to represent typicality, calculated using an SVM.

We propose the TIDNcombating framework to combat label noise. This method establishes a TIDNattention module that maps features to a persample noisy transition matrix. A recursive algorithm is introduced to enable the framework to learn the transition matrix following the EM algorithm. The network could generate accurate predictions by understanding the transition relationship instead of overfitting noisy labels.

Experiments were conducted to demonstrate that the TIDN closely mirrors realworld label noise compared with existing simulation paradigms. The TIDNcombating framework exhibits superior performance for both simulated and realworld label noise. This is evidenced by the higher test ACC when training with simulated and realworld label noises.
The remainder of this paper is organized as follows. Methods and experimental setups are described in detail in the Methods section. The experimental results are reported in Results section to demonstrate the effectiveness of the proposed method. In the Discussion section, we provide an extended discussion.
Methods
The workflow of the proposed methods is depicted in Fig. 2. To address the problem of combating realworld label noise, we first seek a simulated label noise to approximate the real world. After that, we design a TIDNcombating framework to combat the wellsimulated label noise. With the success in combating wellsimulated noise, this framework can also address realworld label noise.
Preliminaries
In a \(C\)class classification task, we are provided with \(N\) training pairs \({\left\{\left({x}_{n}, {\widetilde{y}}_{n}\right)\right\}}_{n=1}^{N}\) and \(M\) testing pairs \({\left\{\left({x}_{n},{y}_{n}\right)\right\}}_{n=1}^{M}\), where \({x}_{n}\) represents the input medical images and \({\widetilde{y}}_{n}{,y}_{n}\in \left\{1,\dots ,C\right\}\) are the corresponding realworld noisy labels and ground truth, respectively.
The simulation objective is to generate instancedependent noisy labels \({{y}^{\mathrm{^{\prime}}}}_{n}\) that are closely aligned with the realworld noise \({\widetilde{y}}_{n}\). Under the IIN paradigm, \({{y}^{\mathrm{^{\prime}}}}_{n}\) depends solely on the original true label, \({y}_{n}\). The probability that the generated noisy label belongs to a certain class j is \(P({{y}^{\mathrm{^{\prime}}}}_{n}=j{y}_{n}=i)\). Under the IDN paradigm, \({{y}^{\mathrm{^{\prime}}}}_{n}\) depends on \({y}_{n}\) and the input image \({x}_{n}\). The corresponding probability of flipping is \(P\left({{y}^{\mathrm{^{\prime}}}}_{n}=j{y}_{n}=i,{x}_{n}\right)\).
The objective of combating labels is to train a deep neural network classifier using the pairs \({\left\{\left({x}_{n}, {{y}^{\mathrm{^{\prime}}}}_{n}\right)\right\}}_{n=1}^{N}\) for it to perform well on the test set \({\left\{\left({x}_{n},{y}_{n}\right)\right\}}_{n=1}^{M}\).
Simulating the TIDN
Given a dataset \({\left\{\left({x}_{n},{\widetilde{y}}_{n},{y}_{n}\right)\right\}}_{n=1}^{N}\), we generated a simulated \({{y}{\prime}}_{n}\) that could be in close proximity to the realworld noise \({\widetilde{y}}_{n}\) under the IDN paradigm. In actual medical labeling scenarios, instances with typical characteristics are less likely to be mislabeled than those with atypical characteristics. Based on this observation, we propose a method that converts the persample distance from the classification boundary into the probability of label disturbance. Figure 3 presents a simplified illustration of the proposed TIDN model. This highlights that samples located at the classification boundaries are susceptible to mislabeling. However, it is important to note that the feature space often has a higher dimensionality in image classification tasks.
An SVM was used to calculate the classification boundary within the feature space explicitly. The boundary hyperplanes, as defined by the “one versus rest” SVM approach [24], are denoted as \({H}_{i}\), where \(i\in \{1,..,C\}\) represents the classes. The Euclidean distance to \({H}_{i}\) of each instance is denoted as \({d}_{ti}\), where \(t\in \{1,..,N\}\) denotes the instances. The probability of an instance label being disturbed was then established using the following equation:
where \(j=\underset{i\in\{1,..,C\}}{argmax}\;d_{ti}\). The maximum distance from the C channels is translated into the probability of label flipping for the tth sample. Equation (1) ensures that the greater the distance of sample t from the hyperplane, the higher the likelihood of label flipping owing to its lower typicality. \(\lambda\) is a hyperparameter for controlling the noise ratio of the simulated noisy dataset. After identifying the sample labels flipped using Eq. (1), we determine the specific class to which these labels are flipped. This process involves
where i represents the class of the original true label y, and j represents the class of the noisy label \({y}^{\mathrm{^{\prime}}}\) after flipping. \("i\ne j"\) ensures labels do not flip to their original class. The Softmax function can transform a Cdimensional distance into a probability distribution of length C with a sum of one.
Combating label noise
TIDNcombating
Having successfully simulated a TIDN that closely mirrors realworld scenarios, we introduce the TIDNcombating framework. Let \({\varvec{X}}\in {\mathbb{R}}^{h\times w}\) denotes the input image; \(Y,\widetilde{Y}\in {\left\{\mathrm{0,1}\right\}}^{C}\) represent the onehot latent ground truth and observed labels, respectively. Let \({\ell}\) represents the crossentropy loss for classification, and let \(\theta\) denotes the parameters of the classification network. Directly minimizing \({\mathbb{E}}_{{\varvec{X}},\widetilde{Y}}\left[{\ell}\left({f}_{\theta }\left({\varvec{X}}\right),\widetilde{Y}\right)\right]\) leads deep networks to memorize the noisy label. To learn the correct distribution guided by ground truth \(Y\), the oracle noisy transition matrix \({T}^{*}\left({\varvec{X}}\right)=\) \(P\left(\widetilde{Y}Y,{\varvec{X}}\right)\) is introduced, as minimizing \({\mathbb{E}}_{{\varvec{X}},\widetilde{Y}}\left[{\ell}\left({T}^{\boldsymbol{*}}{f}_{\theta }\left({\varvec{X}}\right),\widetilde{Y}\right)\right]\) leads to the same effect of minimizing \({\mathbb{E}}_{{\varvec{X}},Y}\left[{\ell}\left({f}_{\theta }\left({\varvec{X}}\right),Y\right)\right]\). Here, we introduce the structure of the TIDNcombating framework and its corresponding recursive algorithm, illustrating the construction of \({T}^{*}\left({\varvec{X}}\right)\). With the modeling of \({T}^{*}\left({\varvec{X}}\right)\), the fitting of the observed \(\widetilde{Y}\) leads to the fitting of the latent \(Y\).
An overview of the TIDNcombating framework is presented in Fig. 4. During the training stage, the feature extraction backbone \({\omega }_{1}\) outputs embedded F features. Classification head \({\omega }_{2}\) is expected to predict ground truth \(Y\), and the noise modeling phase is expected to construct the mapping from the embedded features to the instancedependent noisy transition matrix \(T({\varvec{X}})\), which is an intermediate product rather than a given parameter [6]. The observed \(\widetilde{Y}\) is calculated by multiplying \(T({\varvec{X}})\) with \(Y\). In the testing phase, the predictions are output through \({\omega }_{1}\) and \({\omega }_{2}\).
Structure of the TIDNattention
To build the learning pathway from the features to a persample noisy transition matrix, the TIDNattention includes a \(1\times 1\) convolutional layer [26] and a fully connected layer, as depicted in Fig. 5. This architecture is aptly termed ‘attention,’ as it extracts a set of optimizable coefficients from the features, which are then applied multiplicatively to \(Y\). Notably, \(Y\) is also obtained through features using classification head \({\omega }_{2}\).
Specifically, a convolutional layer was used to downsample the features. The kernel size of the \(1\times 1\) convolutional layer is set according to \(k=\psi \left(F\right)={\left\frac{{{\text{log}}}_{2}\left(F\right)}{\gamma }+\frac{b}{\gamma }\right}_{odd}\), where \({\leftt\right}_{odd}\) indicates the nearest odd number of \(t\). In this study, we set \(\gamma =2,b=1\), in accordance with the default setting outlined in ref. [27] to capture local crossfeature interaction. The downsampled features were then activated by the ReLU function, which is mapped to a \({C}^{2}\times 1\) vector through fully connected layers, where the activating function is a Sigmoid function. \({C}^{2}\times 1\) vector was then reshaped to align with the correct dimension of the noisy transition matrix, and the columns were subjected to a columnwise Softmax operation to align with the definition of the noise transition matrix columns, which represent \(P\left(\left.\widetilde{Y}\rightY,{\varvec{X}}\right)\).
Recursive algorithm for noise modeling
With the proposed framework, designing a recursive method to estimate \(T({\varvec{X}})\) is feasible, following the spirit of expectation maximization. Instead of the EM algorithm, which cannot be directly used in deep networks, the likelihood of T and Y is alternately optimized in the proposed algorithm. In the training phase, the loglikelihood is
When latent variable Y is introduced, which represents the latent distribution of ground truth, the new loglikelihood becomes
where C is the total class number and \({\omega }_{3}\) represents the TIDNattention parameters. Based on the training data, we aim to find neural network parameter \({\omega }_{1},{\omega }_{2},{\omega }_{3}\) that maximize the likelihood function. We then introduce \({\omega }^{k1}\) representing parameters in the last turn to perform an expectation maximization process to optimize recursively \({\omega }^{k}\). According to the EM algorithm, the evidence lower bound of the likelihood function can be derived from Jensen’s Inequality
We denote \({c}_{ti}^{k1}\), which is an \(N\times C\) matrix, as the posterior distribution of the hidden true label, given the parameters in the last iteration as
As \({c}_{ti}^{k1}\) is the posterior distribution of the hidden true label, it can be specifically denoted by the parameters in the final turn
where \({Y}^{k1}=f({{\varvec{X}}}_{{\varvec{t}}};{\omega }_{1}^{k1},{\omega }_{2}^{k1}), { T}^{k1}\left({{\varvec{X}}}_{{\varvec{t}}}\right)=f({{\varvec{X}}}_{{\varvec{t}}};{\omega }_{1}^{k1},{\omega }_{3}^{k1})\); j refers to the row number where 1 is located in the onehot label \({\widetilde{Y}}_{t}\). Note that the calculation of \({c}_{ti}^{k1}\) generated no gradients in the network. As \(T({\varvec{X}})\) is generated by \({\omega }_{3}\) and \(Y\) is predicted by \({\omega }_{2}\), the second term in Eq. (5) could be divided into two alternative terms:
The final loss function to be optimized in the neural networks can then be written as the negative of the loglikelihood function
where j refers to the row number, and 1 is located on the onehot label \({\widetilde{Y}}_{t}\).
\({c}_{ti}^{k1}\) is obtained using noisy labels and the parameters in the last turn; the first term in Eq. (9) is directly calculated from the \(f({{\varvec{X}}}_{{\varvec{t}}};{\omega }_{1}^{k},{\omega }_{3}^{k})\), which equals the ith column of \(T({\varvec{X}})\). The last term in Eq. (9) is the prediction results of \(f({{\varvec{X}}}_{{\varvec{t}}};{\omega }_{1}^{k},{\omega }_{2}^{k})\). The first term in Eq. (9) also represents the expectation loglikelihood function: \({\mathbb{E}}_{y}(logP(\widetilde{Y}y,{\varvec{X}}))\), and \({\omega }_{1}^{k},{\omega }_{3}^{k}\) are optimized through gradient decent fixing \({\omega }_{2}^{k}\). The latter term in Eq. (9) also denotes KullbackLeibler divergence between the prior and posterior distribution of latent true labels, and it is optimized by fixing \({\omega }_{3}\). The pseudocode is presented in Algorithm 1.
Initialization of parameters
The successful convergence of the network training hinges on a careful and precise initialization of its parameters for both \({\omega }_{2}\) and \({\omega }_{3}\). We initialized \(T\left({\varvec{X}}\right)\) with \(T\) using the IIN method [7]. Because \(T\left({\varvec{X}}\right)\) in our method is an intermediate product of the network and not a directly adjustable parameter, it necessitates the use of a learning approach to initialize \(T\left({\varvec{X}}\right)\) using \(T\). Under the IIN paradigm, ref. [7] outputs recognized noisy samples using \(T\). We utilized the recognized noisy samples to train \({\omega }_{3}\) while fixing \({\omega }_{2}\).
In the proposed framework, \(T\left({\varvec{X}}\right)\) is obtained through the propagation paths of \({\omega }_{1}\) and \({\omega }_{3}\), whereas \(Y\) is acquired via the pathways of \({\omega }_{1}\) and \({\omega }_{2}\). Therefore, based on the multiplicative relationship \(T\left({\varvec{X}}\right)Y=\widetilde{Y}\), the network can learn the noise transition matrix of the IIN method as an initialization by fixing \({\omega }_{2}\) and optimizing \({\omega }_{3}\).
In addition, a warmup stage is required to learn the initial distribution of \(Y\). We set warmup epochs to optimize \(\omega_1\) and \({\omega }_{2}\) while freezing \({\omega }_{3}\), as the samples with noisy labels still benefit neutral networks in an early training stage [28].
Final prediction at test phase
Because \(T({\varvec{X}})\) could model the transition distribution from the ground truth to the observed noisy labels, the network shown in Fig. 4 fits both the observed noisy label \(\widetilde{Y}\) and the latent ground truth. During the training stage, the feature extraction backbone and classification head could be fed with correct supervision; the fitting of \(\widetilde{Y}\) leads to the simultaneous fitting of the ground truth. Thus, the noise modeling phase was removed during the test phase, and the remaining feature extraction backbone and classification head output the final classification predictions.
Dataset
We selected two datasets for colonoscopy image polyp classification: Kvasir V2 [29] and a colonoscopy video classification dataset [30]. The public dataset, Kvasir V2, contains 8000 images across eight categories, with 1000 images per category. These categories included dyed resection, esophagitis, ulcerative colitis, and five other classes relevant to polyp characterization. The labels sourced from clinical institutions and experts were considered accurate. The public dataset [30] comprised 152 colonoscopy videos, including 80 adenoma, 30 serrated, and 42 hyperplastic videos, amounting to three lesion types. The video lengths varied from 6 s to 76 s, with an average of approximately 30 s. The labels were derived from the histopathology results and diagnoses by expert doctors or beginners. Histopathology results provided accurate annotations, whereas diagnoses by experts and beginners were considered noisy, with noise ratios of 35.52% and 50.00%, respectively.
We employed a video classification dataset [30] with realworld label noise to validate the proposed method for simulating label noise. In this dataset, the histopathology results were considered the ground truth. The annotations made by the experts and beginners were treated as label noise with noise ratios of 50.00% and 35.52%, respectively. The effectiveness of the noise simulation methods was validated by comparing the similarity between the simulated noise and actual realworld noise.
To verify the ability of the model to combat label noise, we trained it on both simulated and real label noise data. The datasets were divided into training, validation, and test sets at an 8:1:1 ratio. The training set labels were noisy, whereas the validation and test sets contained accurate labels.
Baselines and metrics
The IIN and IDN were compared with the proposed TIDN. The IIN contains symmetric and pairfliplabel noise [11,12,13]. For symmetric label noise, the labels of randomly selected instances were uniformly flipped to other classes. For symmetric noise, labels were flipped to neighboring classes for pairflip noise. For the simulated IDN proposed in ref. [14], the probability of flipping is related to the pixels of the images, thereby generating IDN.
The comparison methods for combating label noise include the IIN and IDN methods. For the IIN methods, coteaching+ [16] for methods of selecting clean samples, DivideMix [5] for semisupervised learning, and noise layer [6] for IIN layers, which are similar to our work, were selected for comparison. The partdecomposing method, partdepend [14], and confidentscorebased method, CSIDN [15], were selected for the IDN method. The baseline was set as a ViT trained directly on the noisy labels.
The mean total distance is a metric [31] used to measure the difference between the distributions of a realworld and a simulated noisy dataset. Let \({D}_{1}={\left\{{x}_{i},{y}_{i}^{1}\right\}}_{i}^{N}\) and \({D}_{2}={\left\{{x}_{i},{y}_{i}^{2}\right\}}_{i}^{N}\) be the same dataset with two types of noisy labels. The mean total distance between datasets \({D}_{1}\) and \({D}_{2}\) is defined as
where \({y}_{i}^{1}\) and \({y}_{i}^{2}\) are soft labels representing probability distributions over \(\left\{1,...,C\right\}\).
Test ACC was chosen as the metric for combating label noise. The annotations in the test and validation sets are the ground truth to prove the robustness of LNL. The test and validation sets were blinded during training.
Implementation details
The ViT [25] was chosen as the feature extraction backbone of our methods for the image classification task, and the video transformer network [32] was chosen as the backbone for the video classification task. During training, the resolution of all input images was adjusted to 224 \(\times\) 224, and the pixel values were normalized channelwise. The dimensions of the embedded features were \(B\times 768\), where \(B\) is the batch size. Data augmentation was performed by random cropping and vertical flipping.
The network was based on the PyTorch (version 1.9.1) framework and trained on two 12 GB NVIDIA TITAN Xp GPUs. The ViT was optimized using the stochastic gradient descent (SGD) optimizer, whereas the TIDNattention structure was optimized using the Adam optimizer. The SGD optimizer applied an initial learning rate of 0.003 divided by 0.2 every 10 epochs. The Adam optimizer set a fixed learning rate of 0.003. The image classification task batch size was set to eight, and one for the video classification.
The training set contained noisy labels, and the validation and test sets contained the ground truth. Notably, the output epoch was chosen based on the top training ACC in the last five epochs, and the validation set was blind during training, as we were studying LNL.
Results
In this section, we describe the experiments conducted on the image classification dataset with simulated label noise and the video classification dataset with realworld label noise. Validation of the simulated TIDN subsection demonstrates that the proposed simulated noise is closer to the realworld noise. Results for combating the TIDN subsection presents the classification performance of the TIDNattention method in countering simulated noise. Results for combating realworld label noise subsection demonstrates the classification performance of the TIDNattention method when trained with realworld label noise. Ablation study of TIDNcombating subsection presents an ablation study of the TIDNattention module and the initialization process.
Validation of the simulated TIDN
Different approaches for simulating label noise have been applied to colonoscopy video classification datasets containing realworld label noise. The simulated label noise was compared with realworld label noise to evaluate the simulation methods. Table 1 shows the mean total distances between the existing simulated label noise and realworld label noise from a human expert (low noise level with a noise ratio of 35.52%) and a human beginner (high noise level with a noise ratio of 50.00%). The noise ratio of the human annotators was calculated based on the ACC between the ground truth and their annotations. Our simulated TIDN had the lowest mean total distance to realworld noisy labels for both the low noise ratio (0.3440) and high label noise (0.3581) scenarios. Notably, all the simulated label noises align with the noise ratio of the realworld label noise.
The TSNE map depicting the distribution of instances from different classes is shown in Figs. 6 and 7. In Fig. 6, the twocomponent TSNE map shows the distribution of labels in the feature space. The three classes are of three different colors. Different simulated label noises with the same noise ratio (50.00%, aligned with that of a human expert) and ground truth are presented. The human expert label noise was mainly distributed on the edge of the feature map, and the proposed TIDN was the closest to it from the visualization. The simulation results for the colonoscopy classification for the eight classes are presented in Fig. 7, where there is no realworld label noise. The red circle area shows that the disturbed spaces are usually at the edge of the classification boundaries, indicating that atypical samples are more easily disturbed.
Results for combating the TIDN
Methods for combating label noise were evaluated through test ACC when training with simulated and realworld label noise. The test ACC (top 5) of the different methods used for comparison is summarized in Table 2.
Notably, the training set contained only simulated noise, whereas the labels were the ground truths in the test set. The baseline indicates that the ViT is trained directly with the simulated TIDN without any methods to combat label noise. Coteaching+ , DivideMix, and noise layer ignore the dependence of instances. Partdependent and CSIDN methods consider the instance dependence of label noise. The TIDNattention achieves the greatest improvement from 87.81% to 92.44% and 67.82% to 86.23% for the 15% and 40% noise ratios, respectively. Under a 70% noise ratio, DivideMix achieved the highest test ACC of 56.41%, whereas our method achieved 52.31%, compared with the baseline of 34.82%.
Figure 8 illustrates the training process of the proposed method, including the curves for ACC and loss during training. The labels in the validation set were accurate, the training set labels were noisy, and the validation set data remained unseen during training. Baseline refers to the classification network being trained directly on noisy data without using methods to counterlabel noise. TIDNattention represents the proposed classification network combating label noise. Figure 8a and c shows that when trained with noisy data, the classification network gradually overfitted noisy labels as the number of epochs increased. This was evidenced by the continuously decreasing loss of the training set, whereas the loss of the validation set initially decreased and then increased. The TIDNattention method proposed in this study enables the network to fit noisy and accurate labels simultaneously. This is shown in Fig. 8b and d, where the training and validation sets show increased ACC.
Results for combating realworld label noise
Results of the realworld label noise are presented in Fig. 9. The test set contained 15 unique videos with groundtruth labels from histopathology. The baseline denotes that the network is trained directly on noisy labels without any methods for combating the label noise. The ground truth is also the upper bound because clean labels guide the network. Our proposed method achieved the same performance of 86.67% as the upper bound when combating realworld label noise based on the opinions of human beginners. It also achieved the highest improvement, from 40.00% to 80.00%, for label noise from human experts. Only the CSIDN designed for IDN effectively improved from 40.00% to 66.66%.
Ablation study of TIDNcombating
Figure 10 presents the results of the ablation experiments using the TIDNattention algorithm. The blue solid and red dashed lines represent the results of the proposed TIDNattention module with and without initialization, respectively. Specifically, without initialization refers to random initialization of \({\omega }_{3}\) and with initialization refers to the method described in Initialization of parameters subsection. The green dashed line represents the scenario in which the noise transition matrix degenerates to IIN [6], assuming T(X) = T. Figure 10a presents the results for simulated noise with noise rates ranging from 15% to 70%, whereas Fig. 10b shows the outcomes for real noise at rates from 35.52% to 50%. Under various noise settings, the proposed method consistently outperformed the ablated methods for the test set ACC.
Discussion
We introduced a TIDN to simulate realworld label noise and validated this approach by comparing the mean total distance to realworld noise against that of existing simulated noises. Subsequently, we propose the TIDNcombating framework to combat realworld label noise. The performance in combating label noise was validated using simulated and realworld noisy datasets.
In this section, we describe TIDN simulations. Figure 6 illustrates that the simulated TIDN closely resembles realworld label noise. In Fig. 7c, the area marked by the red circle indicates that the samples near the decision boundaries were prone to disturbances. As the TSNE map represents an abstract feature space, instances on the classification boundaries were effectively identified as atypical. The mean square distances in Table 1 prove that the proposed label noise is the closest to realworld noise. Because the proposed TIDN closely mimics realworld label noise, it can validate the LNL methods without realworld label noise and ground truth data.
The noise resistance performance of the TIDNcombating was demonstrated for real and simulated noise. Table 2 shows that the proposed method combats the TIDN better than the other methodologies, and Fig. 9 proves that it also effectively combats realworld noise. Coteaching+ , DivideMix, and noise layer ignore the dependence of instances on label noise. Coteaching+ is ineffective because the smallloss trick does not apply to IDN. DivideMix has the best performance under 70% simulated label noise; however, it performs poorly in other settings. Noise layer is limited because its theory is based on instanceindependence assumption. For methods that consider instance dependence, partdepend does not perform as well, and the partdecomposing method does not apply to complex colonoscopy images for medical use. CSIDN has a basic improvement over the baseline; however, it is still limited as the confidence score easily causes networks to fall into overconfidence.
Figure 8 indicates that our method fits both noisy labels (high training ACC) and the latent ground truth (high validation ACC). For the baseline method, the performance on the validation set first increases and then declines as the training ACC increases to the point of overfitting. In contrast, in the training process of TIDNattention, the validation ACC increases even when the training ACC increases to above 90%. The loss curve shows convergence after a sudden rise in the warmup and initialization epochs. The training ACC and loss were calculated using noisy labels, whereas the validation ACC and loss were calculated using the ground truth. Training and validation ACC increase simultaneously because our recursive algorithm optimizes the likelihood of T(X) and the latent ground truth. The structure fits the observed noisy labels while also fitting the ground truth distribution with the assistance of T(X). Note that the validation set contains accurate labels, it remains unseen during training in actual LNL scenarios. Despite this, the experiments demonstrate that the proposed method can learn both the distribution of label noise and true labels simultaneously. Therefore, the convergence of the training loss signifies the achievement of a neural network robust to label noise.
Figure 10 shows the results of the ablation study. Comparisons between the noise layer and TIDNattention highlight the benefits of modeling instancedependent T(X) rather than instanceindependent T. The baseline approach with no modeling of T performed poorly. The initialization of T(X) is inevitable because it outperforms the random initialization methods. This is because initialization restricts the degrees of freedom of T(X), enhancing performance.
The limitation of our work lies in the need for better initialization to limit the degrees of freedom of T(X) or to theoretically tackle the freedom problems for an instancedependent noisy transition matrix. In addition, our method can be applied to the latest classification methods, such as those based on diffusion models [33, 34], to mitigate the impact of incorrect labels.
Conclusions
We introduce a novel simulated TIDN for closely approximating realworld label noise. Because TIDN aligns well with realworld scenarios, effectively combating TIDN leads to a combination of realworld label noise. Therefore, we developed the TIDNcombating framework, which includes the TIDNattention block and a corresponding recursive algorithm. This framework simultaneously fits the observed noisy labels and latent ground truth by modeling a noisy transition matrix, ultimately leading to accurate classification predictions. Our experiments demonstrate that the TIDN closely mimics realworld noise. Furthermore, the TIDNcombating framework achieves superior ACC on the test set annotated with ground truth, whether trained on datasets with simulated or realworld noisy labels.
Availability of data and materials
The datasets used during the current study are available from the corresponding author upon reasonable request.
Abbreviations
 LNL:

Learning with noisy labels
 IIN:

Instanceindependent label noise
 IDN:

Instancedependent label noise
 TIDN:

Typicality and instancedependent label noise
 SVM:

Support vector machine
 EM:

Expectationmaximization
 ViT:

Vision transformer
 ReLU:

Rectified linear unit
 SGD:

Stochastic gradient descent
 ACC:

Accuracy
References
Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Karimi D, Dou HR, Warfield SK, Gholipour A (2020) Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med Image Anal 65:101759. https://doi.org/10.1016/j.media.2020.101759
Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS et al (2017) A closer look at memorization in deep networks. In: Proceedings of the 34th international conference on machine learning, JMLR.org, Sydney, 611 August 2017
Han B, Yao QM, Yu XR, Niu G, Xu M, Hu WH et al (2018) Coteaching: robust training of deep neural networks with extremely noisy labels. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Curran Associates Inc., Montréal, 28 December 2018
Li JN, Socher R, Hoi SCH (2020) DivideMix: learning with noisy labels as semisupervised learning. In: Proceedings of the 8th international conference on learning representations, OpenReview.net, Addis Ababa, 2630 April 2020
Goldberger J, BenReuven E (2017) Training deep neuralnetworks using a noise adaptation layer. In: Proceedings of the 5th international conference on learning representations, OpenReview.net, Toulon, 2426 April 2017
Northcutt C, Jiang L, Chuang I (2021) Confident learning: estimating uncertainty in dataset labels. J Artif Intell Res 70:1373–1411. https://doi.org/10.1613/jair.1.12125
Yao JC, Han B, Zhou ZH, Zhang Y, Tsang IW (2023) Latent classconditional noise model. IEEE Trans Pattern Anal Mach Intell 45(8):9964–9980. https://doi.org/10.1109/TPAMI.2023.3247629
Natarajan N, Dhillon IS, Ravikumar P, Tewari A (2013) Learning with noisy labels. In: Proceedings of the 26th international conference on neural information processing system, Curran Associates Inc., Lake Tahoe, 510 December 2013
Jiang L, Huang D, Liu M, Yang WL (2020) Beyond synthetic noise: deep learning on controlled noisy labels. In: Proceedings of the 37th International Conference on Machine Learning, ICML, Virtual Event, 1318 July 2020
Rolnick D, Veit A, Belongie S, Shavit N (2017) Deep learning is robust to massive label noise. arXiv preprint arXiv: 1705.10694
Zhang CY, Bengio S, Hardt M, Recht B, Vinyals O (2021) Understanding deep learning (still) requires rethinking generalization. Commun ACM 64(3):107–115. https://doi.org/10.1145/3446776
Zhang HY, Cissé M, Dauphin YN, LopezPaz D (2018) mixup: Beyond empirical risk minimization. In: Proceedings of the 6th international conference on learning representations, OpenReview.net, Vancouver, 30 April3 May 2018
Xia XB, Liu TL, Han B, Wang NN, Gong MM, Liu HF et al (2020) Partdependent label noise: Towards instancedependent label noise. In: Proceedings of the 34th international conference on neural information processing systems, Curran Associates Inc., Vancouver, 612 December 2020
Cheng JC, Liu TL, Ramamohanarao K, Tao DC (2020) Learning with bounded instance and labeldependent label noise. In: Proceedings of the 37th International Conference on Machine Learning, ICML, Virtual Event, 1318 July 2020
Yu XR, Han B, Yao JC, Niu G, Tsang I, Sugiyama M (2019) How does disagreement help generalization against label corruption? In: Proceedings of the 36th international conference on machine learning, PMLR, Long Beach, 915 June 2019
Jiang L, Zhou ZY, Leung T, Li LJ, FeiFei L (2018) MentorNet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In: Proceedings of the 35th international conference on machine learning, PMLR, Stockholm, 1015 July 2018
Patrini G, Rozza A, Krishna Menon A, Nock R, Qu LZ (2017) Making deep neural networks robust to label noise: a loss correction approach. In: Proceedings of 2017 IEEE conference on computer vision and pattern recognition, IEEE, Honolulu, 2126 July 2017. https://doi.org/10.1109/CVPR.2017.240
Xu Z, Lu DH, Luo J, Wang YX, Yan JP, Ma K et al (2022) Antiinterference from noisy labels: meanteacherassisted confident learning for medical image segmentation. IEEE Trans Med Imaging 41(11):3062–3073. https://doi.org/10.1109/TMI.2022.3176915
Yong L, Pi RJ, Zhang WZ, Xia XB, Gao JH, Zhou X et al (2023) A holistic view of label noise transition matrix in deep learning and beyond. In: Proceedings of the 11th international conference on learning representations, OpenReview.net, Kigali, 15 May 2023
Zhang Y, Niu G, Sugiyama M (2021) Learning noise transition matrix from only noisy labels via total variation regularization. In: Proceedings of the 38th international conference on machine learning, ICML, Virtual Event, 1824 July 2021
Agarwal S, Awan A, Roth D (2004) Learning to detect objects in images via a sparse, partbased representation. IEEE Trans Pattern Anal Mach Intell 26(11):1475–1490. https://doi.org/10.1109/TPAMI.2004.108
Berthon A, Han B, Niu G, Liu TL, Sugiyama M (2021) Confidence scores make instancedependent labelnoise learning possible. In: Proceedings of the 38th international conference on machine learning, ICML, Virtual Event, 1824 July 2021
Hong JH, Cho SB (2008) A probabilistic multiclass strategy of onevs.rest support vector machines for cancer classification. Neurocomputing 71(1618):3275–3281. https://doi.org/10.1016/j.neucom.2008.04.033
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th international conference on learning representations, ICLR, Online, 37 May 2021
Woo S, Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision  ECCV 2018. 15th European conference, Munich, September 2018. Lecture notes in computer science, vol 11211. Springer, Heidelberg, pp 3–19. https://doi.org/10.1007/9783030012342_1
Wang QL, Wu BG, Zhu PF, Li PH, Zuo WM, Hu QH (2020) ECANet: efficient channel attention for deep convolutional neural networks. In: Proceedings of the 2020 IEEE/CVF conference on computer vision and pattern recognition, IEEE, Seattle, 1319 June 2020. https://doi.org/10.1109/CVPR42600.2020.01155
Liu S, Liu KN, Zhu WC, Shen YQ, FernandezGranda C (2022) Adaptive earlylearning correction for segmentation from noisy annotations. In: Proceedings of the 2022 IEEE/CVF conference on computer vision and pattern recognition, IEEE, New Orleans, 1824 June 2022. https://doi.org/10.1109/CVPR52688.2022.00263
Pogorelov K, Randel KR, Griwodz C, Eskeland SL, de Lange T, Johansen D et al (2017) KVASIR: a multiclass image dataset for computer aided gastrointestinal disease detection. In: Proceedings of the 8th ACM on multimedia systems conference, ACM, Taipei, China, 2023 June 2017. https://doi.org/10.1145/3193289
Mesejo P, Pizarro D, Abergel A, Rouquette O, Beorchia S, Poincloux L et al (2016) Computeraided classification of gastrointestinal lesions in regular colonoscopy. IEEE Trans Med Imaging 35(9):2051–2063. https://doi.org/10.1109/TMI.2016.2547947
Gu KR, Masotto X, Bachani V, Lakshminarayanan B, Nikodem J, Yin D (2023) An instancedependent simulation framework for learning with label noise. Mach Learn 112(6):1871–1896. https://doi.org/10.1007/s10994022062077
Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: Proceedings of 2021 IEEE/CVF international conference on computer vision workshops, IEEE, Montreal, 1117 October 2021. https://doi.org/10.1109/ICCVW54120.2021.00355
Kazerouni A, Aghdam EK, Heidari M, Azad R, Fayyaz M, Hacihaliloglu I et al (2023) Diffusion models in medical imaging: a comprehensive survey. Med Image Anal 88:102846. https://doi.org/10.1016/j.media.2023.102846
Packhäuser K, Folle L, Thamm F, Maier A (2023) Generation of anonymous chest radiographs using latent diffusion models for training thoracic abnormality classification systems. In: Proceedings of the IEEE 20th international symposium on biomedical imaging, IEEE, Cartagena, 1821 April 2023. https://doi.org/10.1109/ISBI53787.2023.10230346
Acknowledgements
Not applicable.
Funding
This research was funded by the National Natural Science Foundation of China, No. 62371139; and the Science and Technology Commission of Shanghai Municipality, Nos. 22ZR1404800 and 22DZ1100101.
Author information
Authors and Affiliations
Contributions
YG (Yun Gao) was responsible for designing and conducting the experiments and writing the paper; JF provided critical experimental data and conducted an analysis of the feasibility of the technical approach; YW reviewed and edited the paper; YG (Yi Guo) contributed to the revisions, supervision, and conceptualization of this paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gao, Y., Fu, J., Wang, Y. et al. Typicality and instancedependent label noisecombating: a novel framework for simulating and combating realworld noisy labels for endoscopic polyp classification. Vis. Comput. Ind. Biomed. Art 7, 10 (2024). https://doi.org/10.1186/s4249202400162x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4249202400162x