 Original Article
 Open access
 Published:
Superiority of quadratic over conventional neural networks for classification of gaussian mixture data
Visual Computing for Industry, Biomedicine, and Art volumeÂ 5, ArticleÂ number:Â 23 (2022)
Abstract
To enrich the diversity of artificial neurons, a type of quadratic neurons was proposed previously, where the inner product of inputs and weights is replaced by a quadratic operation. In this paper, we demonstrate the superiority of such quadratic neurons over conventional counterparts. For this purpose, we train such quadratic neural networks using an adapted backpropagation algorithm and perform a systematic comparison between quadratic and conventional neural networks for classificaiton of Gaussian mixture data, which is one of the most important machine learning tasks. Our results show that quadratic neural networks enjoy remarkably better efficacy and efficiency than conventional neural networks in this context, and potentially extendable to other relevant applications.
Introduction
In machine learning, the mainstream approach is now artificial neural networks (ANNs), especially deep neural networks. Usually, a neural network consists of several layers of neurons, each of which consists of a linear compartment in the form of the inner product of inputs and weights and a nonlinear unit known as an activation function to make a signal on (activated) or off (attenuated). Deep neural networks have been recently shown to achieve remarkable successes in various applications such as natural language processing [1, 2], autodriving [3,4,5], gameplaying [6], image analysis [7, 8], and image reconstruction [9].
Classification/clustering is one of the essential pattern recognition techniques in machine learning, and has a wide arrange of applications such as bioinformatics [10, 11] and medial imaging [12, 13]. It is well known that the Gaussian mixture model (GMM) is the most popular data model. Since the prior probability of each Gaussian component is typically not given, known as latent variables, the correct parameters of GMM are solved using the expectationmaximization (EM) algorithm. Alternatively, a neural network approach can be used to classify GMM data. Clearly, the decision boundary for the classification can be viewed as a complicated function where a network with a large number of neurons can approximate that boundary. After the classification network is trained, the inference by the trained network is more efficient than the EM algorithm, which is iterative and timeconsuming.
In our previous study [14,15,16,17,18,19], a new type of neurons, referred to as quadratic neurons, was introduced, where the inner product inside a conventional neuron is upgraded to a quadratic function. The initial motivation is to enrich the diversity of artificial neurons, inspired by the fact that the biological diversity exists at the cellular level, and such diversity enables efficiency, flexibility, functionality, and other benefits. Hence, it is hypothesized that a quadratic neural network would be advantageous similarly, which can, for example, approximate a given function with a lighter structure than a conventional neural network.
The main purpose of this paper is to highlight the superiority of quadratic over conventional neural networks with the classification task as an illustrative example. The rest of the paper is organized as follows. In the next section, we review the theoretical minimum error in the GMM classification and the EM algorithm that is traditionally used to reach that error bound. In the third section, we present our procedure for initializing and training the conventional and quadratic networks with an adapted backpropagation (BP) algorithm. In the fourth section, we perform numerical experiments systematically and establish the superiority of quadratic networks over the conventional counterparts in the GMM classification. Finally, in the last section we discuss relevant issues and conclude the paper.
Methods
GMMbased classification error
In statistical classification, the Bayes error rate is theoretically optimal. In practice, without knowing latent GMM parameters, the Bayes error rate cannot be directly calculated. To close the gap, the classic EM algorithm can be used to approximate the optimal error rate, which is the benchmark to evaluate the performance of classification neural networks.
Bayes error
Given the mean \(\mu\), covariance \(\mathbf {C}\), and prior probability \(\pi\) of each Gaussian component of GMM \(\mathcal {N}\), the posterior probability \(p\left( z_{k}=1x_n \right)\) is calculated by
which means a D dimensional sample vector \(x_n\), \(n\in \left\{ 1,\dots ,N\right\}\), should be assigned to the \(\hat{\mathbf {y}}x_{n}\) Gaussian component,
where D, N, and K represent the dimensionality of the sample vector, the sample size and the number of Gaussian components respectively. We can obtain the Bayesian inference results, a size N vector \(\hat{\mathbf {y}}\), by applying Eq. 2 to the entire sample pool \(\mathbf {x} =\left[ x_{1},\dots ,x_{N}\right] ^{T}\) and compare it with the ground truth labels \(\mathbf {t}\). However, in most of real cases all these GMM parameters are not directly known. Fortunately, we can use the EM algorithm to estimate them, as described in the following subsection.
Note that the inference \(\hat{\mathbf {y}}\) cannot be directly used as the predicting label of each sample since our task is clustering instead of classification. For example, while the ground truth parameters are \(\theta = \{\theta _{1}, \theta _{2}, \theta _{3}, \theta _{4}\}\), the results from the EM algorithm can be \(\hat{\theta } = \{\hat{\theta }_{2}, \hat{\theta }_{4}, \hat{\theta }_{1}, \hat{\theta }_{3}\}\), \(\hat{\theta _{k}}\approx \theta _{k}\) for \(k\in \{1, 2, 3, 4\}\). Hence, we have to perform an order correction, i.e., rearranging \(\hat{\theta }\) as \(\{ \hat{\theta }_{1} ,\hat{\theta }_{2} ,\hat{\theta }_{3} ,\hat{\theta }_{4} \}\).
A solution to this problem is to perform an exclusive search so that the accuracy or loss can be optimized. By doing so, the best match will be found as our final result. More efficiently, the alternating variables method can be used as described in Algorithm 1, a common derivativefree method for numerical optimization, with the idea to maximize the accuracy by exchanging two coordinates each time and fixing all the remaining ones. We set the MaxCycle according to the number of Gaussian components K, and in our experiment \(MaxCycle = 20\), which is sufficiently large for \(K = 8\).
Then, we compute Eq. 2 using parameters after the order correction we present above for the Bayesian inference results \(\mathbf {y}\) and gain the Bayes error as the banchmark of performance of neural networks.
EM algorithm
As a classic iterative method, the EM algorithm consists of the following two steps: expectation (E) and maximization (M). The E step evaluates the expectation function based on the currently available intermediate parameters, and the M step updates the intermediate parameters to maximize the expectation function. To estimate all the \(\mu , \mathbf {C}, \pi\) parameters, the expectation function in the E step is the posterior probability \(p\left( z_{k}=1\mathbf {x}\right)\) for \(k = 1, \dots , K\).
To start the EM procedure, for \(k = 1, \dots , K\), we initialize \(\mu ^{\left( 0\right) }_{k}\) with a size D vector that filled with values from the standard normal distribution, \(\mathbf {C}^{\left( 0\right) }_{k}\) with D by D identity matrix and \(\pi ^{\left( 0\right) }_{k} =1/K\). Then, for the jth iteration, \(j\ge 0\), the posterior probability in the E step is computed as
in terms of the current parameters \(\mu ^{\left( j\right) }_{k} ,\mathbf {C}^{\left( j\right) }_{k} ,\pi ^{\left( j\right) }_{k}\), \(k= 1, \dots , K\). After this E step, the M step goes as follows:
The E and M steps are repeated until the parameters being estimated converge within a prespecified range or a maximum number of iterations is finished. With these estimated GMM parameters, Eq. 2 and Algorithm 1 can be used for GMMoriented classificaiton.
Neural network training
Training a neural network involves two steps: initialization which sets up network parameters appropriately, and optimization which adjusts the neural parameters iteratively. An optimizer used in the second step is illustrated in Fig. 1. The key idea is to perform computational optimization using the wellknown BP algorithm with respect to an objective or loss function.
While the conventional and quadratic neural networks can be trained based on the same idea of computational optimization, they differ in specific steps, since the chain rule must be applied to different functions that summarize data (i.e., inter product versus quadratic operation). Specifically, let us formulate the forward and BP processes in the following two subsections respectively, and then describe the whole process in the third subsection.
Forward computation
An exemplary feedforward neural network is shown in Fig. 2, including input, hidden, and output layers. There are L layers in total, in each of which there is a number of neurons. A typical layer first implements affine transforms for conventional neurons and quadratic operations for quadratic neurons, and then nonlinear activations \(\sigma ^{(l)}\) are performed, which are common for conventional and quadratic neurons.
An illustration of the affine layer of conventioinal and quadratic neurons is shown in Fig. 3. For a conventional neural network, the affine transform can be expressed in terms of a input matrix \(\mathbf {a}^{(l)}\) and a weight matrix \(\mathbf {w}^{(l)}\) plus a bias row vector \(\mathbf {b}^{(l)}\) as follows:
For a quadratic neural network, the quadratic transform can be expressed as
where \(\mathbf {a}^{\left( l\right) } \mathbf {w}^{\left( l\right) }\) stands for matrix multiplication and \(\circ\) means an elementwise square operation. In this study, the ReLU function is used as the activation function, but if the lth layer is the last layer of the network, i.e., \(l = L\), the softmax function is computed instead. Therefore, the output of each layer is computed as follows:
In other words, the input to the forward process is a N by D sample matrix \(\mathbf {a}^{(1)} = \mathbf {x}\), and output is a N by K matrix \(\mathbf {a}^{(L+1)}\). The prediction of each sample vector \(x_n\) is quantified by
The loss or error is produced when the prediction differs from the ground truth. Note that in the forward computation we compute and store the output of each affine transform, which are subsequently used for the gradient descent search in the BP process described in the following subsection.
BP formulation
To optimize a neural network, we perform numerical optimization. Specifically, we first find the partial derivatives with respect to each of the parameters and update them via gradient descent search at a suitable step size (learning rate). Using the chain rule, this process was formulated as the wellknown BP algorithm, which is widely used to train a neuronal network. As its name indicates, the BP process computes the partial derivatives layerwise from the output layer to the input layer. A brief BP diagram is shown in Fig. 4.
Let Q stand for the crossentropy loss value defined as
where N is the number of sample vectors, \(\mathbf {y}_n\) is the predicted result, and \(\mathbf {t}_n\) is the ground truth label for each of the samples \(x_n\). Recall that the activation function of the output layer is the softmax function, hence the gradient of the output layer can be computed as
If \(l\ne L\), the activation function is the ReLU function, and we have
For a conventional neural network, we know that
where
The same chain rule can be applied to optimize a quadratic neural network layerwise. Specifically, let us consider Eq. 6 in the following three parts:
and we have
Then, the gradients with respect to the parameters in the three parts can be respectively found as follows:
and
In contrast to the forward computation, the input to the BP procedure is the predicted result \(\mathbf {y}\), which is the output of the forward process. For layer \(l = L, \dots , 1\), one layer at a time, we compute \(\partial Q/\partial \mathbf {z}^{\left( l\right) }\) using Eqs. 9 or 10 depending on whether it is the last layer, the same for the conventional and quadratic neural networks. Then, we compute \(\partial Q/\partial \theta ^{(l)}\) according to Eqs. 11 (for conventional neurons) or 13 (for quadratic neurons) respectively, where \(\theta\) denotes a vector of all trainable parameters of the network. Finally, we compute \(\partial Q/\partial \mathbf {a}^{\left( l\right) }\), which is used in Eqs. 12 (for conventional neurons) and 14 (for quadratic neurons) respectively for the next iteration. After the gradient of the network is obtained, we update the parameters via 'Adam' in this study.
Whole training process
Initiation. Let us use a series of integers to describe a feedforward neural network architecture of our interest,
where \(d^{\left( l\right) }\) represents the dimension of \(\mathbf {a}^{(l)}\). Then, the total number of neurons used in the network is \(\sum \nolimits ^{L}_{l=1} d^{\left( l+1\right) }\). Note that \(d^{\left( 1\right) } = D\), the dimension of input samples, and \(d^{\left( L+1\right) } = K\), the number of classes.
Then, the network can be randomly initialized with a vector of parameters \(\theta ^{\left( l\right) }\) for each layer. Specifically, for each layer \(l\in [1, L]\), let d_from be the input dimension \(d^{\left( l\right) }\) andÂ d_to the output dimension \(d^{\left( l+1\right) }\) Setting all weights â€“ \(\mathbf {w}^{(l)}\) for a conventional neural network and \(\mathbf {w}_r^{(l)}, \mathbf {w}_g^{(l)}, \mathbf {w}_b^{(l)}\) for a quadratic neural network â€“ and biases â€“ \(\mathbf {b}^{(l)}\) for a conventional network and \(\mathbf {b}_r^{(l)}, \mathbf {b}_g^{(l)}, \mathbf {c}^{(l)}\) for a quadratic network â€“ as follows:
whereÂ np stands for NumPy (version 1.23.0), a Python package. That is, the bias is a 1 by \(d^{\left( l+1\right) }\) zero matrix, and the weight is a \(d^{\left( l\right) }\) by \(d^{\left( l+1\right) }\) matrix.
Optimization. As shown in Fig. 1, given a neural network we just initialized and a training dataset containing samples \(\mathbf {x}\) including the corresponding labels \(\mathbf {t}\), we can repeat the forward computation and BP processes described in the above two subsections until the stopping criteria are satisfied. The crossentropy losses on the training and validation samples will be estimated during the training process.
Results and discussion
Using the training methods in the preceding section, we optimized conventional and quadratic neural networks to solve a number of GMMbased classification problems. At the beginning, we solved a threeclass problem in the twodimensional (2D) space to illustrate the working principle. Then, we performed a systematic comparison between conventional and quadratic neural networks on samples with different numbers of classes and dimensions. Finally, we applied all methods on three real data sets. Meanwhile, we used the EM algorithm and Bayes inference to obtain the Bayes error rate as the performance benchmark of the neural networks.
Illustrative classification example
Our initial classification problem assumes a finite number of classes (the first example, K = 3) in the 2D space (\(D = 2\)): two Gaussian clusters plus a background, which can be viewed as a special case of the Gaussian distribution. As in other networkbased classification networks, a onehot vector was used in our networks as well. The parameters of the background were set to
where b indicates the background. Then, we randomly set the parameters of the other Gaussian clusters as
for \(k = 1, \dots , K1\) where \(aa^T\) stands for matrix multiplication and np stands for NumPy (version 1.23.0), a Python package. Given the mean \(\mu _{k}\) and covariance \(\mathbf {C}_{k}\), we generated \(N_k\) points for each class except the background where \(N_k \in \left[ 20000, 30000\right]\) was chosen randomly. Then, we generated \(N_{b}=\sum \nolimits ^{K1}_{k=1} N_{k}\) points for the background. The entire dataset was shuffled and split into the three parts: 50% as training samples, 20% as validation samples, and 30% as test samples. Figure 5 shows the scatter plot of sample points.
We trained conventional and quadratic neural networks with different numbers of neurons for GMM classification. The decreasing loss is shown in Fig. 6 on the validation samples during the training process. The decision boundaries are shown in Fig. 7 for the conventional and quadratic neural networks as well as EM algorithm respectively. It took hundreds of neurons for the conventional network to approach the elliptical boundaries, while the quadratic network accurately fitted them with only three quadratic neurons.
The lighter the network structure, the higher the computation efficiency. Table 1 shows time spent to train the conventional and quadratic neural networks, and the accuracies of the EM and neural networks on the test samples. Our quadratic neural network with only one neural layer produced a performance closer to the EM benchmark than the conventional neural network of more than one hundred conventional neurons. Also, the time need for the quadratic neural network is only about 7% that of the conventional counterpart.
Systematic comparation
To systematically compare conventional and quadratic networks, we tested conventional and quadratic networks in 2D and threedimensional (3D) spaces with \(K=5\) and 8 Gaussian clusters. In each case, we randomly generated 50 samples using the aforementioned method except we replaced the background by a Gaussian cluster and set \(N_k \in \left[ 6000, 9000\right]\). Typical scatter plots of these samples are represented in Fig. 8.
We trained and tested the EM algorithm, conventional and quadratic networks with different numbers of layers/neurons in terms of the average accuracy. The resutls are summarized in Table 2. Very interestingly, in all cases, the accuracy of the quadratic networks with only output layer of few neurons is higher than that of the conventional network of over one hundred neurons. Meantime, the training time needed for quadratic neural networks is only about \(26.82\%\), on average, of that taken by the much more complicated conventional network. Generally speaking, the quadratic neural networks delivered a performance very close to that of the EM algorithm.
Real data
Finally, we applied conventional and quadratic networks on three real data sets from the UCI Machine Learning repository [20]: protein localization sites (yeast), penbased recognition of handwritten digits (pendigits), and isolated letter speech recognition (isolet). All three data setsâ€™ attribute types are numerical. Some basic information about these datasets are in Table 3. For the yeast dataset, we split the whole dataset in the same proportions as that described in the first subection. Typical yeast cell (Saccharomyces cerevisiae cell) images the Cell Image Library [21] are shown in Fig. 9, visualized through transmission electron microscopy. For the pendigits and isolet datasets, with the test samples already provided, 30% of training samples were used for validation.
We trained and tested the EM algorithm, conventional and quadratic networks with different numbers of layers/neurons on each dataset 20 times. The average accuracy of and time needed by each method are shown in Table 4. In each application, the quadratic neural network with only layer of few neurons has the highest accuracy while its training time is about half of the conventional networks orders of magnitude larger than the quadratic version.
Conclusions
Although it has been well tested with a solid theoretical foundation, the EM algorithm needs to take an entire dataset into the memory, processes them iteratively, and is timeconsuming, under the restriction that data must come from GMM. Furthermore, when new samples become available, parameters need to be adjusted again. A neural network approach can be much more desirable, effective and efficient, workable with many data models in principle thanks to its universal approximation nature. After a network is well trained, new samples can be used to finetune the network or processed to inference in a feedforward fashion, being extremely efficient and generalizable to cases much more complicated than GMM. Very interestingly, compared to conventional networks, quadratic networks can deliver a performance close to that of the EM algorithm in the GMM cases and yet be orders of magnitude simpler than conventional networks for the same classification task.
In conclusion, in this paper we have numerically and experimentally demonstrated the superiority of quadratic networks over conventional ones. It is underlined that the quadratic neural network of a much lighter structure rivals the conventional network of a complexity orders of magnitude more in solving the same classification problems. Clearly, the superior classification performance of quadratic networks could be translated to medical imaging tasks, especially radiomics.
Availability of data and materials
The datasets analysed during the current study are available in the UCI Machine Learning repository, http://archive.ics.uci.edu [20]. Applications and source codes are available at https://github.com/tianruiqi/QuadraticNeurons.
Abbreviations
 ANN:

Artificial neural network
 BP:

Backpropagation
 EM:

Expectationmaximization
 GMM:

Gaussian mixture model
 2D:

Twodimensional
 3D:

Threedimensional
References
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al (2020) Language models are fewshot learners. Adv Neural Informat Proc Syst 33:18771901
Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence 34(05), 87328740 (2020)
Di Biase, G., Blum, H., Siegwart, R., Cadena, C.: Pixelwise anomaly detection in complex driving scenes. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1691816927 (2021)
Liu, Y., Zhang, J., Fang, L., Jiang, Q., Zhou, B.: Multimodal motion prediction with stacked transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 75777586Â (2021)
Ma, X., Zhang, Y., Xu, D., Zhou, D., Yi, S., Li, H., et al.: Delving into localization errors for monocular 3d object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and PatternÂ Recognition, 47214730 (2021)
Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, et al (2019) Grandmaster level in starcraft II using multiagent reinforcement learning. Nature 575(7782):350354. https://doi.org/10.1038/s415860191724z.
Moen E, Bannon D, Kudo T, Graf W, Covert M, Van Valen D (2019) Deep learning for cellular image analysis. Nat Methods 16(12):12331246. https://doi.org/10.1038/s4159201904031.
Isensee F, Jaeger PF, Kohl SAA, Petersen J, MaierHein KH (2021) nnUnet: a selfconfiguring method for deep learningbased biomedical image segmentation. Nat Methods 18(2):203211. https://doi.org/10.1038/s4159202001008z.
Wang G, Ye JC, De Man B (2020) Deep learning for tomographic image reconstruction. Nat Mach Intell 2(12):737748. https://doi.org/10.1038/s4225602000273z.
Bennett KP, Brown EM, De Los Santos H, Poegel M, Kiehl TR, Patton EW, et al (2019) Identifying windows of susceptibility by temporal gene analysis. Sci Rep 9(1):2740. https://doi.org/10.1038/s41598019393188.
Petegrosso R, Li ZL, Kuang R (2020) Machine learning and statistical methods for clustering singlecell RNAsequencing data. Brief Bioinform 21(4):12091223. https://doi.org/10.1093/bib/bbz063.
Arunkumar N, Mohammed MA, Ghani MKA, Ibrahim DA, Abdulhay E, RamirezGonzalez G, et al (2019) Kmeans clustering and neural network for object detecting and identifying abnormality of brain tumor. Soft Comput 23(19):90839096. https://doi.org/10.1007/s0050001836187.
Huang H, Meng FZ, Zhou SH, Jiang F, Manogaran G (2019) Brain image segmentation based on FCM clustering algorithm and rough set. IEEE Access 7:1238612396. https://doi.org/10.1109/ACCESS.2019.2893063.
Fan FL, Cong WX, Wang G (2018) A new type of neurons for machine learning. Int J Numer Methods Biomed Eng 34(2):e2920. https://doi.org/10.1002/cnm.2920.
Fan FL, Cong WX, Wang G (2018) Generalized backpropagation algorithm for training secondorder neural networks. Int J Numer Methods Biomed Eng 34(5):e2956. https://doi.org/10.1002/cnm.2956.
Fan FL, Shan HM, Kalra MK, Singh R, Qian GH, Getzin M, et al (2019) Quadratic autoencoder (QAE) for lowdose CT denoising. IEEE Trans Med Imaging 39(6):20352050. https://doi.org/10.1109/TMI.2019.2963248.
Fan, F., Shan, H., Gjesteby, L., Wang, G.: Quadratic neural networks for CT metal artifact reduction. Developments in XRay Tomography XII 11113, 111130 (2019). International Society for Optics and Photonics.
Fan FL, Wang G (2020) Fuzzy logic interpretation of quadratic networks. Neurocomputing 374:1021. https://doi.org/10.1016/j.neucom.2019.09.001.
Fan FL, Xiong JJ, Wang G (2020) Universal approximation with quadratic deep networks. Neural Netw 124:383392. https://doi.org/10.1016/j.neunet.2020.01.007.
Dua, D., Graff, C.: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml. Accessed 20220528.
HÃ¶Ã¶g, J., Panagaki, D., Croft, J.: CIL:50813  50817, Saccharomyces cerevisiae (bakerâ€™s yeast, budding yeast), Mixed population of S. cerevisiae cells. CIL. Dataset. (2020).http://cellimagelibrary.org/groups/50815. Accessed 20220528.
Acknowledgements
Not applicable.
Funding
This work was supported in part by NIH, Nos.Â R01CA237267, R01HL151561, R21CA264772, and R01EB032716.
Author information
Authors and Affiliations
Contributions
GW suggested this research topic. TRQÂ designed the networks and experiments, and drafted the paper. Both discussed data analysis and revised the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Consent for publication
All authors give consent for publication.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Qi, T., Wang, G. Superiority of quadratic over conventional neural networks for classification of gaussian mixture data. Vis. Comput. Ind. Biomed. Art 5, 23 (2022). https://doi.org/10.1186/s4249202200118z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4249202200118z