Robust facial expression recognition system in higher poses

Facial expression recognition (FER) has numerous applications in computer security, neuroscience, psychology, and engineering. Owing to its non-intrusiveness, it is considered a useful technology for combating crime. However, FER is plagued with several challenges, the most serious of which is its poor prediction accuracy in severe head poses. The aim of this study, therefore, is to improve the recognition accuracy in severe head poses by proposing a robust 3D head-tracking algorithm based on an ellipsoidal model, advanced ensemble of AdaBoost, and saturated vector machine (SVM). The FER features are tracked from one frame to the next using the ellipsoidal tracking model, and the visible expressive facial key points are extracted using Gabor filters. The ensemble algorithm (Ada-AdaSVM) is then used for feature selection and classification. The proposed technique is evaluated using the Bosphorus, BU-3DFE, MMI, CK + , and BP4D-Spontaneous facial expression databases. The overall performance is outstanding.


Applications
Facial expression recognition (FER) is the automatic detection of the emotional state of a human face using computer-based technology. The field of study is currently a hotspot of research because it has increasing applications in several domains, such as psychology, sociology, health science, transportation, gaming, communication, security, and business. According to Panksepp [1], facial expressions and emotions guide the lives of people in a variety of ways, and emotions are key aspects that enlighten us in how we should act, from elementary processes to the most intricate acts [2,3].
The sporadic advancements in the use of facial expressions in neuropsychiatric complications have shown more positive results [4], and current studies are focusing on human behavior and the detection of mental illnesses [5,6].
FER can also affect data collection in specific research projects. For example, Shergill et al. [7] proposed an intelligent assistant FER framework that could be implemented in e-commerce to determine the product preferences of customers. The system captures the facial data as they browse the e-shop for products to acquire. Based on the facial expression, the systems can automatically suggest more products of possible interest.
Certain physiological features of people have been discovered to be useful as intelligent data in the search for criminals [8,9]. This theory is based on the tendency for someone with ego to commit a high-profile crime, such as terrorism, exhibits specific emotions such as anger and fear. Consequently, the accurate recognition of these expressions could lead to further security measures in apprehending criminals.
FER can also be valuable during the testing phase of video games. Target groups are frequently invited to play a game for a set amount of time, and their behaviors and emotions are observed as they play. Game developers may acquire more insights and valuable deductions about the emotions recorded during gameplay using FER technology, and incorporate the feedback into production.

Open Access
Visual Computing for Industry, Biomedicine, and Art *Correspondence: jkappati@ug.edu.gh

Technical issues on the use of two-dimensional facial data
Two-dimensional (2D) FER systems are extremely sensitive to head orientation. Therefore, to achieve good results, the subject must be constantly in a fronto-parallel orientation. The problem resulting from this is that the throughput of most site-access systems is significantly reduced. This implies that subjects are frequently required to perform several verifications to attain an ideal facial orientation. Consequently, surveillance systems operate on luck, hoping the subject faces the camera. Another problem that arises from the use of 2D technology is the illumination conditions of the surrounding environment. If the subject is in a setting with varying lighting conditions, FER reduces in accuracy because the FER processes are sensitive to the direction of lighting and the ensuing shading pattern. Consequently, cast shadows may obstruct recognition by concealing informative features.
Three-dimensional (3D) FER systems have a higher detection rate than 2D systems because of their higher intensity modality, and they also have more object description geometry information [10,11]. This demonstrates the importance of pushing FER into higher face orientations to improve its realism and practicality.

Related work
The primary focus of this study is to improve FER accuracy in higher facial orientations.
Yadav and Singha [12] adopted the Viola-Jones descriptor [13] to detect faces and used a combination of local binary patterns (LBP) and the histogram of gradients (HOG) as a feature extraction tool. Subsequently, traditional SVM with the k-means method was employed as a training algorithm. LPB feature extraction techniques, such as Gabor, are orientation-selective, and thus, highly robust in tracking key facial features. However, the Viola-Jones descriptor is computationally demanding and has a low detection accuracy. Furthermore, the conventional SVM described in the study is slow to classify. Consequently, the overall architecture used in the study was computationally expensive. Yao et al. [14] proposed a linear SVM method that used AUs to recognize seven facial expression prototypes in the CK database. The Viola-Jones descriptor was used as the face-detection technique again. Although the goal of the study was to minimize computational complexity and enhance recognition accuracy, the resulting average recognition accuracy of 94.07% for females and 90.77% for males was too low for a viable implementation. Ashir et al. [15] also proposed an SVM-based multiclass classification for detecting seven facial expressions across four prominent databases. The Nyquist-Shannon sampling method [16] was used to compress the extracted facial feature samples. Although the sampling method reduces data loss, it is prone to aliasing issues, particularly when the bandwidth is extremely large. The Nyquist-Shannon sampling technique is difficult to deploy because it assumes the sampled signal is completely band-restricted. In real-world applications, this is a concern because no actual signal is genuinely and completely band-restricted. The compressing sampling [17] paradigm could have been a better option because it is less restrictive. Perez-Gomez et al. [18] recently proposed a 2D-3D FER system that used principal component analysis (PCA) and a genetic algorithm for feature selection, and a k-nearest neighbor (KNN)multiclass SVM for learning. In this study, the synthetic minority oversampling technique (SMOTE) [19] was used to balance the instances. However, SMOTE creates an equal number of synthetic samples for each minority data sample and relies on the hypothesis performance to update the distribution function. The adaptive synthetic (ADASYN) [20] method tends to generate more synthetic data for minority class samples that are harder to learn than with SMOTE, which is easy to learn. In addition, PCA uses observations from all the extracted features in the projection to the subspace and only considers linear relationships, ignoring the input multivariate structures. Compared to other recent studies, the findings of this study were not positive.
Li et al. [21] proposed a robust 3D local coordinate technique for extracting pose-invariant facial features at key points. The descriptor in this method is a multi-task sparse representation fine-grained matching algorithm. The method was evaluated using the Bosphorus datasets, and an average recognition accuracy of 98.9% was obtained. The success of this study is largely owed to the accurate tracking of 3D key points. This recent study is a primary driving force behind our proposed study.
The following are the significant contributions of this work: (1) A robust head-tracking algorithm that tracks facial features from one frame to the next, accounting for more features in the overall prediction process; (2) A unique ensemble approach that employs AdaBoost for feature selection, and a combination of AdaBoost and SVM for classification. AdaBoost is extremely fast, whereas SVM is extremely accurate. Consequently, the proposed technique becomes extremely fast while also improving the recognition accuracy.
The remainder of this paper is organized as follows. Methods section delves into the proposed strategy. Results and discussion section discusses the findings, debates, and analyses. Finally, Conclusions section concludes the study. Owusu

Methods
We robustly tracked the facial features from one frame to the next using 3D facial data. With 3D data, information, such as the size and shape of an object, can be correctly estimated in each frame without prior assumptions.
The first priority is to detect the focal points in each frame. The next step is to search for matching features or objects across all frames. This method addresses the changing behavior of a moving object and the preceding annotations of the scene. In this approach, the location of an object is projected by iteratively updating the object position from previous frames [22,23]. Figure 1 presents the framework of this study.

Architectural framework
This procedure uploads images and robustly tracks the features across frames using the proposed ellipsoidal model. Subsequently, the Gabor feature-extraction approach was used. Feature points extraction section explains the reason for using Gabor features in this study. Feature selection and classification were executed using the Ada-AdaSVM.

Ellipsoidal feature tracking method
Accurate tracking of a human face from the forehead, to the left cheek, to the chin, to the right cheek, and back to the same spot on the forehead where the tracking began unmistakably demonstrates that the human face is best shaped like an ellipse. Thus, considering the 3D facial representation in Fig. 2 with N feature points tracked across frames, we denote: where N represents the most relevant feature points. In this study, we assumed N to be 24. In addition, let f j (t) ∈ α(t) denote a facial feature. As the features move from one frame to the next at time t + 1, the . Therefore, f j (t + 1) ∈ α(t + 1) . Assuming that Y j is the position of α j on the 3D facial model and α j,p [∅(t + 1)] represents its back projection on the image plane, the 3D facial orientation at t + 1 is the vector ∅(t + 1) that minimizes N j=1 S 2 j , where: This is a multi-view system based on the assumption that cameras are positioned around the subject to capture various rotation movements. Consequently, the facial  image can be captured with a high degree of precision in any orientation. We extracted the features in the same manner as for 2D images. The right and left eyes, lips, and muscles around the cheeks are important parts of the face to consider. Slight disruptions primarily and severely distort the muscles in these places. The Gabor technique is then used to extract the features of the captured face. The algorithm models a procedure that chooses a set of features and robustly tracks them from one frame to the next while discarding all other features that are no longer required for tracking. The ellipsoidal 3D face was modelled, as shown in Fig. 3.
Adopting homogeneous coordinates for an ellipsoid of the semi-axis, a, b, and c, states that a point The algorithm tracks the facial features that are more noticeable by slight deformation from one frame to the next using the brightness change constraint [24]. These muscles are usually near the eyes, mouth, cheeks, and edges, as shown in Fig. 4 and contour τ in Fig. 3. Given that pixel (x, y) with luminance I x, y T moves from position (x, y) T at frame t to position x + u, y + v T at frame t + 1 in high frame rates. In this instance, we can deduce that By applying Taylor's series, and considering I x and I y as gradients and that I t is a temporal deviation of the image, we can infer that If a whole window ω k is considered instead of a single pixel, we deduce that The solution of Eq. (6) is an optimization problem. By introducing the cost function, it follows that Fig. 3 Ellipsoidal face model Fig. 4 Model of feature extraction points in 3D The optimal displacement vector that determines the new position of face ω k is given by: where, (u k , v k ) represents the image at a new position. By computing the derivative of J with respect to u and v and equating them to zero, we obtain: where C k = ω k I 2 x ω k I x I y ω k I x I y ω k I 2 y , and is the matrix of the 3D face, then the j th level of the pyramid description of the face image is expressed by the recursion: The displacement vector in Eq. (9) can also be rewritten as: The displacement vector in Eq. (10) is computed at the deepest pyramid level j max (in the Newton-Raphson fashion), and the result of the computation is propagated to the upper level j max − 1 by the expression: Equation (12) was used as the initial estimate for the evaluation of the displacement vector of the 3D face. The final displacement vector is given by the expression The visible features of the face can be extracted from any location on the face, similar to any other 2D dimensional face. The extracted features are candidates for predicting the overall expression of the face. The Gabor extraction technique is critical for extracting the maximum amount of information required for the classifier.

Feature points extraction
The 2D Gabor filters are spatial sinusoids localized by the Gaussian window, and because they are orientation-, localization-, and frequency-selective, they are useful in this study. Demonstrate images using Gabor wavelets provides flexibility because the details about their spatial relations are preserved in the process. The general form of the Gabor function is given by: where R 1 = uxcosθ and R 2 = uysinθ , u is the spatial frequency of the band pass, θ is the spatial orientation, σ is the standard deviation that the 2D Gaussian envelops, and (x, y) is the position of the light impulse in the visual field. To allow for more robustness in illumination, we set the filter to zero direct current. The Gabor wavelet is then given by: where x, y, θ , u, σ are parameters with (i, j) being the new position of the 2D input point, θ is the scale, u is the orientation of the Gabor kernel, σ is the standard deviation of the Gaussian window in the kernel, n is the maximum size of the face peak, and q is the size of the filter given by q = (2n + 1) 2 . In this study, we used 8 orientations given by 0, π 8 , π 4 , 3π 8 , π 2 , 5π 8 , 3π 4 , 7π where I is a sub-image of the expressional face; R and I are the real and imaginary parts of each Gabor kernel, respectively; and the star (*) is the convolution operator. The final magnitude response, representing the feature (14) G(x, y, θ, u, σ ) = 1 vectors, was computed by determining the square root of the sum of the squares of G 1 and G 2 . Figure 5 shows the magnitude response of a template image.

Classification using Ada-AdaSVM
For this optimization problem, an SVM with a radial basis function kernel was used as a weak classifier. This weak SVM classifier was trained to produce the optimum Gaussian value for the scale parameter δ and regularization parameter ∂. Typically, the best parameters are ′ ∂ ′ : 1.0, ′ δ ′ : 0.1 . The feature selection hypothesis is then computed from the expression sgn T t−1 ω t h 1 t ϕ 1 t , where T is the final iteration, h 1 t is the hypothesis with the most discriminating information, and ω t is weights that weigh h 1 t based on its classification performance. The learning process formulated in our recent study [25] is as follows: Step 1: Input the training sets, [ y 1 , x 1 , y 2 , x 2 , . . . , y N , x N ] , N = a + b ; where datasets a and b comprise y i = +1 and y i = −1 datasets, respectively. Initially, δ = δ ini , δ min , δ step . The scale parameter δ , x, and y are the feature vectors selected by the AdaBoost algorithm.
Step 2: Initialize the training set weights, Step 3: Apply the RBFSVM kernel to train the weighted training datasets by applying the leave-onesubject-out cross validation (LOSOCV) approach and compute the training error for the weak classifier h t as Step 4: At ξ t = 1 2 , reduce δ by a factor of δ step and then jump to Step 1.
Step 5: Place the weight of the constituent classifier h t such that Step 6: Update the weights by computing:  where N t is a normalization constant and n i=1 w t+1 i = 1 Step 7: The final classifier is given by The LOSOCV approach is given by the expression: 1 2n = n t=1 f i (x i ) − l i , where n represents the total trained data.

Facial expression datasets
The algorithm was trained and tested on five popular datasets: Bosphorus, BU-3DFE, MMI, CK + , and BP4D-Spontaneous, and executed on a (4 CPUs), approximately 2.2 GHz processor with a memory capacity of 8192 MB RAM.

Experiments on databases
Bosphorus contains 4666 images of 105 subjects [26] comprising 60 men and 5 women, with the majority being Caucasian; 27 of whom were professional actors, in various poses, expressions, and occlusion conditions. In addition to the 6 basic emotional expressions, various systematic head poses (13 yaw and pitch rotations) were present. The texture images have a resolution of 1600 × 1200 pixels, whereas the 3D faces comprise approximately 35,000 vertices [27]. Figure 6 presents sample datasets from Bosphorous. Occlusion images were discarded because they were not the focus of this study. The datasets used comprised 6 poses and 7 expressions. The images were partitioned into training and testing sets using the conventional LOSOCV approach. One specimen from each of the 6 groups of expressions was used as a test dataset during each training run, whereas the rest of the samples were used as a testing set. Table 1 summarizes the FER in Bosphorus.
The BU-3DFE database was created at Binghamton University [28]. There were 100 respondents, ranging in age from 18 to 70 years old. Whites, Blacks, East Asians, Middle East Asians, Indians, and Hispanics are among the ethnic groups. Each participant displayed 7 expressions at 4 intensity levels, including neutral, and 6 archetypal facial expressions. Figure 7 shows sample datasets in the database. The images were separated into training and testing sets using the same LOSOCV method as that used for the Bosphorus datasets, and the average recognition accuracy was 94.56%.
The MMI database comprises over 2900 high-resolution videos submitted by more than 20 students and research staff members, of which 44% are female, ranging in age from 19 to 62 years old. Seventy-five subjects were included in total, and Fig. 8 shows samples. The female, 81% Euro-American, 13% Afro-American, and 6% from other ethnic groups. The expressions included anger, contempt, disgust, fear, happiness, sadness, and surprise. Figure 9 presents sample datasets. A tenfold cross-validation procedure was used to partition the datasets into training and testing sets. The average recognition accuracy is 99.48%.
Finally, the BP4D-Spontaneous dataset is a 3D video collection of spontaneous facial expressions from young individuals. The database comprises 41 subjects (23 women and 18 men) ranging in age from 18 to 29 years old, including 11 Asians, 6 African-Americans, 4 Hispanics, and 20 Euro-Americans. Figure 10 shows sample images. We extracted expressions of anger, disgust, fear, pain, happiness, sadness, and surprise. The datasets were partitioned into training and testing sets using tenfold cross-validation. The average recognition accuracy is 97.2%. Figures 11 and 12 exhibit the respective confusion matrices for facial expressions and pose predictions in the Bosphorus database. Figures 13, 14, 15, and 16 show the rest of the confusion matrices for FERs in BU3DFE, MMI, CK + , and BP4D-Spontaneous, respectively.

Comparison of methods
In Table 2, the proposed method was compared to some recent techniques. These results clearly demonstrated that the proposed method is promising. Figures 17, 18, and 19 show the performance of each of the 7 facial expressions. In the BU3DFE database, many authors failed to report the performance of neutral expressions; thus, the comparison was performed using the other 6. The performance shown in Fig. 17 was encouraging. Figure 18 shows the performance of the CK + database. Although the result, as shown in Fig. 18, depicts fierce rivalry between three current methods [29][30][31], the overall average recognition shows that the proposed technique is promising. In  the Bosphorus database, the proposed method outperformed the most recent methods (Fig. 19). A comparison of the performances of the individual FER prototypes in the MMI and BP4D-Spontaneous databases could not be executed because there were no reported data for comparison at the time of compilation. Statistical analysis using ANOVA shows the following performance results: In the Bosphorus database, an analysis of variances demonstrated statistically significant differences between the proposed technique and the following: Hariri et al. [36] (p = 0.001), Azazi et al. [37] (p = 0.000), and Moeini A and Moeini H [40] (p = 0.013). In addition, the outcome is the same as in the BU3DFE: the variance analysis shows that a statistically significant difference (p < 0.05) exists between the proposed method and all other methods. However, in the CK + FER database, the statistical analysis shows that, except ref. [41], where a statistically significant difference (p < 0.05) exists, the remaining datasets show no statistically significant differences (p > 0.05). The proposed method compared to yields from An and Liu [29] (p = 0.847), Ch [30] (p = 0.909), and Liao et al. [31] (p = 0.991). Although the analysis appears to reveal a balanced performance between the proposed methodology and the last three techniques, the average  recognition accuracy of the proposed method against any of them, as shown in Fig. 18, indicates that the proposed method is superior.

Conclusions
This study improves the FER performance in higher poses. 2D pose conversion schemes have been established to handle pose-invariant FER problems successfully, within a small-scale pose variation. However, they often flop for large-scale, in-depth face variations because of the disjointedness of the image. Human face geometry is ellipsoidal; therefore, the feature points are robustly tracked from one frame to next using an ellipsoidal model. We use the Gabor feature extraction technique for the salient visible features, mostly around the cheeks, eyes, mouth, and nose ridges. The Gabor feature extraction algorithm is useful for this study because it is selective toward orientation, localization, and frequency. We then used an ensemble classification technique, which combines SVM and AdaBoost, for feature selection and classification. The proposed technique outperforms the most recent and popular methods. In the future, we intend to investigate this problem using other feature extraction methods such as LBP and LBP + HOG.