Flipover outperforms dropout in deep learning

Flipover, an enhanced dropout technique, is introduced to improve the robustness of artificial neural networks. In contrast to dropout, which involves randomly removing certain neurons and their connections, flipover randomly selects neurons and reverts their outputs using a negative multiplier during training. This approach offers stronger regularization than conventional dropout, refining model performance by (1) mitigating overfitting, matching or even exceeding the efficacy of dropout; (2) amplifying robustness to noise; and (3) enhancing resilience against adversarial attacks. Extensive experiments across various neural networks affirm the effectiveness of flipover in deep learning.


Introduction
In recent years, deep learning has demonstrated significant success across diverse fields, spanning computer vision, natural language processing, medical imaging and drug design.Properly designed and trained artificial neural networks can adeptly model intricate patterns and nuances derived from extensive data.However, as model complexity grows, challenges emerge, notably ensuring the model robustness, particularly under noisy or adversarial conditions.
In deep learning models, robustness is defined by the ability to produce consistent and reliable outputs amidst shifts and perturbations in the input data.The variations alter the distribution of the input data from that of the training data [1], with the most prevalent case being the shift from the training dataset to the testing dataset.Model lacking robustness, exemplified by over-fitting, may excel on training data but often fails on unseen data, resulting in sub-optimal real-world performance [2].
Several methods have been proposed to address overfitting and other issues, such as refs.[3,4].An established measure for evaluating model robustness is its ability to handle noisy input data.For instance, an image classifier should identify an input image even with added noise [5].Furthermore, robustness indicates resilience against adversarial attacks.With the growing use of deep learning models in critical applications, their susceptibility to adversarial attacks has been investigated.Adversarial attacks use crafted input data in such a way that they mislead the model to make incorrect predictions, while being almost indistinguishable from the original data [6].The adversarial defense has emerged as a prominent topic in deep learning with multiple strategies [7][8][9]; however, these defenses are often computationally expensive.
Among techniques developed to improve model robustness, dropout, introduced by Hinton et al. [10], stands out as an effective yet simple method.Dropout introduces randomness into the model by sporadically setting a fraction of input units to zero during training.This strategic noise incorporation deters neuron coadaptation, improving generalization and preventing over-fitting.Initially applied on fully connected models, dropout has since been expanded to various deep neural networks, including convolutional neural networks (CNNs) and transformers.Modifications to the original dropout method include DropConnect [11], DropBlock [12], MaxDropout [13], and spectral dropout [14].
Deep neural networks are increasingly applied to complex tasks, necessitating stronger regularization strategies.While dropout is effective in preventing overfitting, it offers a limited enhancement of noise and adversarial robustness.Specialized adversarial defense algorithms are effective, while often incurring computational overhead [15].Thus, algorithms that bolster the comprehensive robustness without incurring high computational costs are required.Existing research demonstrates that multiplying the model weights using Gaussian noise in the form of N (1, σ ) , which is positive, can outperform standard dropout [16].Here, this study proposes an upgraded version of dropout, named 'flipover' , which can improve the model's robustness from new angles.Unlike dropout, flipover does not merely zero out certain features.Instead, it employs a bolder approach, multiplying a selection of the original features using a negative factor, for instance, -1.This approach does not merely remove features; it introduces opposite features as perturbations, challenging the model to learn from an altered feature representation.The preliminary work indicates that flipover can (1) prevent over-fitting as effectively as standard dropout, (2) improve noise robustness, which is not the primary focus of dropout, (3) facilitate adversarial defense because flipover generates adversarial attacks efficiently.While not designed initially for adversarial defense, Wang et al. [17] reported that applying dropout during testing enhances model performance under adversarial attacks.However, incorporating a large dropout factor during testing can substantially diminish the model's effectiveness on the original dataset.Conversely, our method, when employed with appropriate techniques, minimally impacts the original model's performance.

Algorithm description
Figure 1 illustrates a standard fully-connected network, and its modifications with either dropouts or flipovers.Dropout randomly removes some neurons such that they do not work during the feed-forward or back-propagation processes.Conversely, flipover modifies a certain proportion of neurons by a negative multiplier in the hope that their negative effects help robustify the network.This section presents the flipover formulation, following the process of dropout illustrating flipover as a stronger regularization strategy.The subsequent section experimentally establishes the advantages of flipover over dropout in several significant cases.
First, dropout technique is formulated using standard fullyconnected network with L hidden layers.Let y (l) denote the output of the l th layer ( l ∈ {1, . . ., L} and y (0) be the inputs).The feed forward operation can then be computed as where f is the activation function and w (l) and b (l) are the weights and biases of the l th layer.With dropout, Eq. ( 1) becomes where ŷ(l) = r (l) * y (l) and r (l) is a vector of independ- ent Bernoulli random variables each of which has probability p of being 1, i.e., r (l) j ∼ Bernoulli(p) [16].Similarly, the forward operation of the flipover method is the similar to the dropout operation except for ŷ(l) = − 1 α r (l) * y (l) , where α is a factor to control the amplitude of the flipped variables.Hence, the elements of ŷ(l) have a probability of (1 − p) remaining in their original and p to be flipped.
(1) y In deep learning, cross-entropy loss and the least square (LS) loss are two commonly used loss functions.In the following two subsections, the study employs these two losses to prove that flipover is a regularization mechanism.

Derivation for cross-entropy loss
First, cross-entropy loss is considered: where t i denotes the ground truth for the i th output.If the activation function is Sigmoid function, the gradient of the standard network's loss L N concerning w (l+1) i can be computed as follows: After applying dropout, the gradient of the dropout network's loss L D becomes: where ⊙ represents the inner product operation.There- fore, the effect of dropout is equivalent to applying mask r (L) to the gradient of the standard network.Given this dropout mask, the gradients during back propagation are scaled.This helps to prevent the weights from receiving large gradient updates, making them over-reliant on specific patterns or features in the training data.This has a similar effect as weight decay (such as L2 regularization) where the magnitudes of the weight updates are constrained, although the mechanism is different.
Similarly, the gradient involving flipover is which is equivalent to applying a mask r (L) that prevents large gradient updates and adds perturbations to the direction of the gradient.It has been proved that gradient noise can be regarded as a smoothing factor, contributing to global convergence [18].Compared to random noise, the study flipped the gradient components the opposite direction, which had stronger and more targeted effects.

Derivation for LS loss
With simplifications, the LS loss function can precisely express the regularization term introduced by filpover in its exact form.The losses for the normal network L N and dropout network L D can be written as where I i denotes the inputs to a certain layer, and δ i ∼ Bernoulli(p).δ is equal to 1 with probability p, and 0 oth- erwise.In this calculation, only consider a linear model without activation functions is considered.The gradient of the dropout network can be calculated as: For simplicity, assume w ′ = pw .It turns to The expectation of the gradient of the dropout network can be calculated as Therefore, dropout can be treated as a regularization term for the original loss function.Following this, the loss of the flipover network can be expressed as Here, α is set to 1 for convenient computation.Hence, the gradient is Therefore, flipover can be treated as a stronger regularization strategy than dropout.When α is set to zero, flip- over reduces to dropout.Overall, two hyper parameters exist: the flipover rate, which indicates the proportion of neurons that will be flipped, and the flipping amplitude, a negative number.

Experimental settings
Dataset and network structure.The proposed method was applied on two different neural network models, which yield promising results, demonstrating the effectiveness of the flipover concept across networks of varying scales.The first model is a small CNN consisting of four convolutional layers and two fully connected layers, and has been used as a standard model in many previous work [19][20][21].The parameter settings proposed by Wang et al. [17] is used, and the model was trained on the Modified National Institute of Standards and Technology (MNIST) dataset [21].The second model was ResNet18 [4], which was trained using CIFAR10 dataset [22].
Implementation details.For the small CNN network, a simple flipover between the two fully-connected layers is applied, with α set to 1.For ResNet18, since the net- work is deeper, adding flipover to a single layer is insufficient.In the PyTorch official documentation, ResNet18 is divided into four blocks, each containing two basic blocks, which consists of convolutional layers, batch normalization layers and down-sampling layers.Flipover was applied between the two basic blocks of the fourth block.Further, the original single fully-connected layer was replaced with two layers and flipover was applied between them.In this case, α was set to 0.5.For a fair comparison, dropout was applied at the same positions as flipover for all networks.First the small CNN was trained to demonstrate the effect of flipover on preventing overfitting.Subsequently, both networks were trained and random noise was added on the test set to demonstrate the improvement in noise robustness.Finally, adversarial attack were performed on both networks and the accuracy under attack was compared among models without regularization and with either dropout or flipover.

Results
Overfitting prevention.It was found that flipover was effective in preventing overfitting, as evidenced by plotting the training and test losses when training the small CNN on the MNIST dataset, as shown in Fig. 2. In the absence of any regularization, a clear pattern of overfitting emerged: the training loss consistently declined converging to zero, whereas the test loss initially decreased and then stopped at a significant level.When applying flipover, the test loss was effectively controlled, and its efficacy was directly proportional to the flipover proportion utilized.A flipover proportion of 0.2 outperformed the counterpart employing a dropout rate of 0.5.Essentially, the incorporation of flipover serves as a robust measure to counteract overfitting, enhancing the model's generalizability and ensuring consistent performance across diverse datasets.However, when the flipover rate is high, the training loss converged to a relatively large value, which may cause a performance drop.Table 1 shows the test accuracy (ACC) for different regularization methods, where the Dropout/Flipover rate represents the probability that a neuron is dropped or flipped.With a flipover rate of 0.2, the model achieved the highest accuracy among all the settings.
Noise suppression.Flipover was applied to the small CNN and ResNet18 models to evaluate its effect on noise suppression.The models were trained on the original datasets and tested on noisy datasets generated by adding different types of noise to the original test sets.Three common types of noise in images were applied: Gaussian, Poisson, and salt-and-pepper.For the MNIST dataset, because Poisson noise did not significantly change the digital images, only Gaussian noise and salt-and-pepper noise were applied.The standard deviation of the Gaussian noise was set to 1.0, and the salt-and-pepper noise ratio was set to 0.4.For the CIFAR10 dataset, all three types of noise were applied.The standard deviation of the Gaussian noise was set to 0.1, the salt-and-pepper noise ratio was 0.05, and the scaling factor of the Poisson noise was 50 [23].Figure 3 shows examples of the noisy data on the CIFAR10 dataset.Table 2 summarizes the experimental results.The models with flipover significantly outperformed the original models without regularization or with dropout.The results confirm the efficacy of flipover in enhancing the model's resilience to noise, underscoring its potential as a tool for enahncing model reliability under noisy conditions.
Adversarial defense.The efficacy of our method for adversarial defense is further assessed.The fast gradient sign method (FGSM) was applied [2] to attack the CNN and ResNet18 networks, and the flipover and dropout were respectively used for defense.A constant attack power of ǫ = 0.25 was maintained throughout the testing phase.In ref. [2], the authors separately set the training and test dropout rates from 0 to 0.9 to find the best combination of (training rate, test rate).This setting was followed with the flipover rate ranging from 0 to 0.4.Figure 4 shows the accuracy of the small CNN model under adversarial attacks with different combinations of training and test parameters.Table 3 lists the best results for flipover and dropout on both networks.On both the small CNN and ResNet18 networks, flipover achieve a much higher performance under attack than dropout.Furthermore, the flipover method has several advantages.Unlike dropout, which requires a high dropout rate at the test time for defense, flipover can achieve a decent defense effect when applied only during training.On ResNet18, dropout had almost no use under attack, while flipover still greatly improved the defense ability.Generally, the optimal combination of the flipover parameters is (0.3, 0.0), and with that settings, the original model accuracy was rather close to the vanilla models.By contrast, with the best combination of dropout parameters (0.7, 0.9), there was a substantial decrease in the original accuracy.Collectively,  these findings underscore flipover's preeminence in boosting adversarial robustness and enhancing the defense ability of a network without a concomitant compromise in model accuracy associated with dropouts.

Discussion
Positions for flipover and parameter settings.
Because flipover is a stronger regularization strategy than dropout, the positions for implementing flipover and its parameters need to be carefully determined for optimal model performance, especially for deep neural networks.For small networks, flipover can be simply applied before the last few fully-connected layers, whereas for large networks more flipover operations can be added to ensure effectiveness.In general, flipover should be applied to the deeper layers of the network to avoid initial information loss.There are two parameters of the flipover operation: flipover rate and flip amplitude.Both affect the strength of the regularization.According to the experiments, for the selected  network architectures, with the flipover rate of around 0.3 the model can improve robustness while not harming its accuracy.However, the flip amplitude can vary significantly in different models, and experientially, larger models generally prefer smaller amplitudes.
Effectiveness on transformer architecture.Recently, the application of transformer-based networks has become widespread, surpassing that of traditional deep neural networks for various tasks including computer vision, natural language processing, and medical imaging.Recognizing the effectiveness of dropout in these networks, the potential benefits of implementing flipover is explored.A small vision transformer (ViT) model [24] is selected, which contains six transformer blocks with embedding sizes 512 and four heads.The network was trained on the CIFAR10 dataset and subjected it to FGSM attacks.Since ViT originally contained dropout layers, the initial attempt involved the straightforward replacement of the dropout layers with the flipover counterparts within the transformer blocks.However, this approach did not yield satisfactory results.Different strategies for incorporating flipover were explored and the findings compiled in Table 4.Although there was a notable  improvement in the model's accuracy under attack, the enhancement was not as significant as that observed for small CNN and ResNet architectures.Hypothetically, the moderate improvement can be attributed to the sub-optimal current integration of flipover in the transformer.The foundational elements of transformers, namely the attention mechanism [25] and embedding process [26], could be pivotal in this context.A more targeted approach, involving the introduction of perturbations within these core components could magnify the efficacy of flipover in defending against adversarial attacks.Combination with other regularization strategies.It was found that flipover could be easily combined with other regularization methods.By adding a batch normalization layer to the small CNN model, the accuracy under attack reaches 67% coupled with flipover.As an ablation study, applying batch normalization only obtained results similar to what the vanilla model got (about 20%).This demonstrates the compatibility of flipover with other regularization methods.
Limitations.The proposed flipover method showed promising results in the experiments, surpassing the performance of the same models using standard dropout, yet the new method has its limitations.First, the decision of which layers to apply 'flipover' to requires deliberation.While larger networks generally benefit from more flipped layers, identifying the optimal strategy requires an empirical adjustment, which can be time-consuming.The flipover parameters, including the rate and amplitude, were manually determined.These fixed settings may not be optimal for other training scenarios.Further efforts are needed for optimal performance and generalizability.Notably, flipover, being a more intense form of regularization than dropout, can significantly impact the model performance if its strength is too strong.Therefore, caution must be exercised when setting a high flipover rate, as it may adversely affect the model performance.Finally, the use of flipover in large models is another interesting topic.For example, to regularize/stabilize a transformer architecture, the study hypothesizes that the attention mechanism should be the best target to perturb.This is currently being worked as a follow-up project.

Fig. 1
Fig. 1 Illustration of flipover operations.(a) A standard feedforward network; (b) The network with dropouts; (c) The network with flipovers

Fig. 3
Fig. 3 Examples of test data before and after adding Gaussian noise for the CIFAR10 dataset.(a) The original data without noise; (b) The data with Gaussian noise; (c) The data with Poisson noise; (d) The data with salt-and-pepper noise

Fig. 4
Fig. 4 Adversarial defense capabilities of different methods.(a) Original test accuracies with different training and test dropout rates; (b) Test accuracies under attack with different training and test dropout rates; (c) Original test accuracies with different training and test flipover rates; (d) Test accuracies under attack with different training and test flipover rates

Table 1
Test accuracy of the small CNN model on MNIST dataset

Table 2
Noise suppression using the flipover techniqueNote: ACC Org stands for the accuracy on the original test set; ACC Gaussian stands for the accuracy on dataset with Gaussian noise; ACC Poisson stands for the accuracy on dataset with Poisson noise; ACC salt stands for the accuracy on dataset with salt-and-pepper noise

Table 3
Adversarial robustness measures using no regularization, dropout, and flipover respectively Note: ACC Org stands for the accuracy on the original test set; ACC Att for the accuracy under adversarial attack; Para Comb for the optimal combination of training and test rates