 Original Article
 Open Access
 Published:
STTGnet: a Spatiotemporal network for human motion prediction based on transformer and graph convolution network
Visual Computing for Industry, Biomedicine, and Art volume 5, Article number: 19 (2022)
Abstract
In recent years, human motion prediction has become an active research topic in computer vision. However, owing to the complexity and stochastic nature of human motion, it remains a challenging problem. In previous works, human motion prediction has always been treated as a typical intersequence problem, and most works have aimed to capture the temporal dependence between successive frames. However, although these approaches focused on the effects of the temporal dimension, they rarely considered the correlation between different joints in space. Thus, the spatiotemporal coupling of human joints is considered, to propose a novel spatiotemporal network based on a transformer and a gragh convolutional network (GCN) (STTGNet). The temporal transformer is used to capture the global temporal dependencies, and the spatial GCN module is used to establish local spatial correlations between the joints for each frame. To overcome the problems of error accumulation and discontinuity in the motion prediction, a revision method based on fusion strategy is also proposed, in which the current prediction frame is fused with the previous frame. The experimental results show that the proposed prediction method has less prediction error and the prediction motion is smoother than previous prediction methods. The effectiveness of the proposed method is also demonstrated comparing it with the stateoftheart method on the Human3.6 M dataset.
Introduction
Human motion prediction is the prediction of future poses based on a provided sequence of observed poses. It has promising applications in areas such as humanrobot interaction, automatic driving, human tracking, and medical care. Nowadays, motion capture equipment can be used to accurately obtain human skeleton sequences. Therefore, it is feasible to use these sequences to predict the future poses of the human body. The human motion prediction problem is usually formulated as a sequence modeling problem, and a common approach to solving this problem is to model contextual information in the temporal dimension to capture the temporal dependence between successive frames.
In previous research, the majority of most methods have used sequential autoregressive or sequencetosequence encoderdecoder models. However, as human motion is a stochastic process, the capture of longterm historical information is difficult, so it is easier to generate static poses with an increasing prediction range. Therefore, motion prediction should depend on not only the temporal relationship between sequences, but also the spatial coupling relationship of different joints in motion. For example, in the action of ‘walking,’ the two arms should swing in opposite directions, so that the joints of the two arms influence each other during the process of ‘walking.’ Spatiotemporal dependencies have also been considered in action recognition research [1, 2], which further improve the recognition rate of actions. Recently, there has also been research that takes the spatial dependency into account. Li et al. [3] captured spatial dependencies through convolutional filters, but the dependencies were heavily influenced by the convolutional kernel. In addition, Mao et al. [4] used graph neural networks to simulate spatial correlation.
Past research indicates that relatively complex networks have generally been required to consider the temporal and spatial dependencies simultaneously, In addition, transformer models have become increasingly popular in computer vision fields and achieved unexpected performance in recent years. Compared with other neural networks, a transformer is completely based on attention mechanisms, there is no complex network structure and the number of parameters is small. Even the most primitive transformer structure may produce comparable results to a complex neural network. Therefore, through previous research, the transformer was introduced as a replacement for previouslyused temporal modeling and combined transformers with other neural networks to model the spatiotemporal dependencies, thus effectively capturing more correlations in both temporal and spatial dimensions.
Affected by the cumulative error, the prediction error increases gradually with the prediction length. Moreover, when the prediction error increases suddenly, the “frame skipping” phenomenon occurs, and the predicted motion becomes stiff. Given the continuity of human motion, this paper presents a prediction revision module based on fusion strategy. The current prediction frame can be fused with the previous frame, to effectively reduce the prediction error and improve the continuity of the prediction action.
In short, the main contributions of this work can be summarized as follows.

A spatiotemporal network STTGNet consisting of a temporal transformer and spatial graph convolutional network (GCN) modules is designed. The temporal transformer can extract the global temporal correlation, and the spatial GCN can capture the local spatial coupling of the joints.

A prediction revision module is proposed, which can effectively reduce the prediction error and improve the smoothness of the prediction sequence, thereby alleviating the problem of error accumulation.

In the shortterm motion prediction task, fewer parameters can be used, resulting in better prediction performance on the Human3.6 M dataset, and for nonperiodic actions, the prediction effect is improved.
Related work
The purpose of human motion prediction is to predict the trend of human motion based on observed human motion. As the frontier research direction of artificial intelligence, this technology has been widely followed and studied. The early traditional methods [5,6,7,8,9,10,11] were able to effectively model a single simple motion through mathematical calculations. With the development of deep learning and largescale motion datasets, deep learning methods have become a better choice for human motion prediction compared to traditional methods. Since human motion prediction is a highly time dependent task and recurrent neural networks (RNNs) are well suitable for time series data, many works have applied RNN and their variants to solve this problem. In addition, some other works have attempted to take advantage convolutional neural networks (CNNs), generative adversarial networks (GANs), and more to solve this problem. Therefore, the related works are roughly divided between RNNbased methods and others.
RNN based methods
Fragkiadaki et al. [12] constructed EncoderRecurrentDecoder and 3 LSTM layers, combined them with nonlinear multilayer feedforward networks to predict motion trends of human skeleton in videos, and synthesized novel motions while avoiding drifting for long periods. To dynamically model the entire body and individual limbs, Jain et al. [13] proposed the SRNN model, using a structural graph of nodes and edges composed of LSTMs for motion prediction, however, they ignored the problem of discontinuity between the observed and predicted poses. In addition Martinez et al. [14] solved the discontinuity problem by using a simple gated recurrent unit (GRU) with residual structure and demonstrated the effect of modeling one particular velocity. In order to synthesize complex motions and generate unconstrained motion sequences, Zhou et al. [15] proposed an autoconditioned RNN model capable of generating motion sequences of arbitrary length and without the problem of stiffness. For static joints in prediction, Tang et al. [16] proposed a modified highway unit that effectively eliminated static poses by summarizing the historical poses associated with the current prediction based on RNN as well as the frame attention module. To guide the model to generate longerterm motion trajectories, Gopalakrishnan et al. [17] used derivative information as a computational feature in a neurotemporal model with a twolevel processing architecture containing a toplevel and bottomlevel RNN. The hierarchical motion recurrent network proposed by Liu et al. [18] used LSTM to model the global and local motion context hierarchically, and captured the correlation between joints by using Lie algebra to represent the skeleton frame. Corona et al. [19] proposed a contextaware human motion prediction method, which used a semantic graph model to build the influence by the spatial layout of the objects in the scene, and introduced an RNN to improve the accuracy of human motion prediction. In order to combine the influence of human trajectory on motion, Adeli et al. [20] used GRU to encode trajectory and pose information to solve the task of predicting both human trajectory motion and skeletal pose in an endtoend structure. An RNN has excellent time modeling ability, but most works using RNN modeling ignored the spatial correlation between human joints.
The other methods
Li et al. [3] considered both invariant and dynamical information of human motion and used a multilayer convolutional sequencetosequence model to learn features in space and time, resulting in more accurate predictions. Considering that the degree of activity of each part of the body during movement is different, Guo and Choi [21] divided the body structure into five nonoverlapping parts based on the human body to learn the local structural representation separately and obtained better results in longterm prediction. Similarly, Li et al. [22] further improved the idea of Guo and Choi [21] to divide the human body into only five parts, constructing an encoderdecoder structure composed of multiscale graphs to extract human motion features at different scales and further improve the prediction performance. Barsoum et al. [23] tried to use a GAN to produce prediction output, and a Gaussian distribution vector z was added to GAN to increase the diversity of the predicted sequences. Two global complementary discriminators were introduced in the adversarial geometryaware encoderdecoder framework proposed by Gui et al. [24] to improve the accuracy of longtime motion prediction through both local and global discriminators. In order to change the endtoend training method of the human motion prediction task, Wang et al. [25] transformed it into a reinforcement learning problem by proposing a reinforcement learning formulation and an imitation learning algorithm that extended the generative adversarial imitation learning framework to be able to make accurate predictions of poses. Pavllo et al. [26] proposed a quaternionbased pose representation method, which solved the ambiguity and discontinuity caused by Euler angle and axis angle representation, and presented two versions using RNN and CNN, respectively. The structural training made predicted pose more accurate and the error smaller, but the conversion to fourdimensional space was relatively complex. Mao et al. [4] designed a simple feedforward deep neural network, different from a pose space, which encoded temporal information in trajectory space via discrete cosine transform (DCT) based on the residual structure. The temporal variation of each human joint was represented as a linear combination of DCT coefficients, using a GCN to model the spatial dependence between joints. Building on this work, Mao et al. [27] later proposed a motion attentionbased model to learn spatiotemporal dependence by forming motion estimates from historical information. Estimates were combined with the latest observed motion, and the combination was then fed into a GCNbased feedforward network. Recently, Mao et al. [28] investigated the use of different levels of attention, applying attention specifically to three different levels of the whole body, body parts and individual joints, and introduced a fusion module to combine this multilevel attention mechanism, achieving better performance. The advantages of GCNs were also found experimentally by Hermes et al. [29], who designed a spatiotemporal convolution with a GCN to extract spatiotemporal features, using an expanded causal convolution to model temporal information, which also contains local joint connectivity, to obtain a lightweight autoregressive model. In contrast, MartínezGonzález et al. [30] proposed a nonautoregressive transformer model to infer pose sequences in parallel, with selfattention and encoderdecoder attention, and added a skeletonbased activity classification to the encoder to improve motion prediction through action recognition.
A CNN generally abstracts dependencies between sequences by performing convolution operations in temporal dimension, but it is not as effective at learning sequence relationships over a longer period. A GCN can effectively learn temporal dependence of motion sequences through the supervised learning of generators and discriminators, but GANs are relatively difficult to train and their parameter tuning is complicated. Although RNNs are more suitable for processing data with temporal dependencies, their ability to learn longtime correlations remains weak, whereas transformer [31] can model global dependencies of inputs and outputs through an attention mechanism, which can break the limitation of RNN that restricts computation in parallel and learning over long distances. In addition, most of the methods are modeling temporal relations, ignoring the spatial correlation of joints, whereas a GCN can deal specifically with nonEuclidean type data, and can capture the temporal and spatial dependencies of human joints through graphs defined on temporally connected motion trees. It is understood that the transformer is not yet widely used in human motion prediction, but is well established for use in human pose estimation tasks [32, 33]. In order to use a more compact representation of a human skeleton, this study is influenced by papers [4, 27] and uses DCT coefficients for the motion transformation.
Methods
This study proposes a STTGNet based on a transformer and GCN, which comprehensively considers the temporal and spatial dependence in human motion to improve the accuracy of motion prediction. The overall network framework is shown in Fig. 1.
First, the DCT is applied to encode the temporal information of each joint into the trajectory space. Second, the computed DCT coefficients are passed through a temporal position embedding (TPE) followed by a temporal transformer to learn the global dependence of the whole temporal sequence. The correlations between local joints are then efficiently learned by the spatial GCN module based on a stack of graph convolution blocks. Finally, in the testing phase, a prediction revision module is added to further correct the error of predicted action. Compared to previous models, this model captures global and local dependencies in the temporal and spatial dimensions respectively, and models the motion of human skeletal joints over time, so the prediction result is more competitive.
Data preprocessing
Provided a motion sequence X_{1 : }_{N} = [X_{1}, X_{2}, X_{3}, ⋯, X_{N}] consisting of N consecutive human poses, where X_{t} ∈ R^{M} denotes the human pose at frame t, and M is the dimension size of the pose at each frame. The purpose of human motion prediction is to predict the posture sequence X_{N + 1 : N + T} for the next T frames. First, the last frame X_{N} is replicated T times to generate a temporal sequence of length N + T. In this way, the whole task becomes a matter of generating an output sequence \({\hat{X}}_{1+N+T}\) from the input sequence X_{1 : N + T}. The DCT has the ability to obtain a more compact representation by discarding highfrequency signals, which can well capture the smoothness of human motion. Therefore, this study uses DCT to map the human motion joints into a more compact trajectory space to facilitate the learning of overall features. Let \({\left\{{X}_{k,l}\right\}}_{l=1}^L\) represent the angle data of the kth joint in frames 1 to L, and its DCT coefficients can be calculated by the following equation:
where l ∈ {1, 2, ⋯, L} , and \({\delta}_{\mathrm{ij}}=\left\{\begin{array}{c}1,i=j\\ {}0,i\ne j\end{array}\right.\) .
Second, the computed DCT coefficients are sequentially fed into the temporal transformer (Ttransformer) and spatial GCN (SGCN) modules to learn the dependencies in the temporal and spatial dimensions respectively. Finally, the processed DCT coefficients are subjected to an inverse discrete cosine transform (IDCT) to obtain the human motion pose data, with the following equation:
Ttransformer
Compared with an RNN commonly used in human motion prediction, the transformer has a relatively improved ability to extract longdistance features and can build long dependencies dynamically on input sequences. Therefore, it can more effectively capture the longdistance dependencies. Considering these advantages, the use of a transformer is proposed instead of an RNN and other variant networks used in the past to capture the relationship between more frames in the temporal dimension in order to obtain more temporal dependence. Unlike [34] using a spatiotemporal transformer, this study builds a network based on transformers only in the temporal dimension, therefore, the temporal transformer (Ttransformer) module is proposed.
Ttransformer module
The proposed Ttransformer module focuses on modeling the global dependencies between temporal frames in the input sequence and the network structure, as is shown in Fig. 2. Similar to the machine translation task, when using the transformer, the human pose is regarded as a ‘word’ and then the future pose is predicted in the same way as the ‘word.’ The sequence of human poses {X_{1}, X_{2}, ⋯, X_{N + T}} is concatenated with Z ∈ R^{(N + T) × K} after the DCT, where K is the dimension of each pose. Before the Ttransformer module is applied, in order to retain the position information of the temporal frames, the TPE is used, and then the result is added to the input sequence to obtain the input feature Z_{0} ∈ R^{(N + T) × K}. The Ttransformer encoder consists of a multiheaded dot product attention and multilayer perceptron (MLP) to focus on the temporal correlation of the input data, and its output is denoted as \({Z}_{L_T}\in {R}^{\left(N+T\right)\times K}\). The whole temporal transformer structure can be expressed as the following process:
where LN(⋅) represents layer normalization, and l = 1, 2, ⋯, L_{T} denotes that the Ttransformer is stacked by L_{T} equal layers.
Multihead selfattention
The use of multihead attention is intended to simulate information from subspace with different locations using multiple heads. The input feature Z_{0} ∈ R^{(N + T) × K} will be calculated by a linear transformation to obtain Q = ZW_{Q}, K = ZW_{K}, and V = ZW_{V} , where the weight matrixs W_{Q}, W_{K}, W_{V} ∈ R^{K × K}, and Q, K, V ∈ R^{(N + T) × K}. Then the three input matrices Q, K, V are subjected to h different linear transformations (h represents the number of used heads), and the dot product attention is used for parallel processing. Finally, the attention outputs of the h heads are concatenated together. This process can be expressed as:
where W_{out} is the weight matrix of the attention output of the spliced h heads and h indicates the number of multiple heads, in this study h takes the value of 8.
Scaled dotproduct attention
The dotproduct attention model used in this study is the scaled dot product attention [31]. This attention can be interpreted as an input composed of query matrix Q, key matrix K, and value matrix V. The attention output is computed by calculating the dot product of each query and all keys, its dot product result is multiplied by a certain scaling factor, and then the weight of value is obtained by the Softmax function. The similarity score between Q and K can be calculated as follows:
where \(1 \left/ \sqrt{d}\right.\) is the scaling factor. The aim is to perform proper normalization to prevent the value of d from increasing, which will cause the use of the Softmax function to saturate and only produce a very small gradient. Ultimately, the output obtained after the dot product attention can be expressed as:
MLP
The MLP is added to increase the nonlinearity of the network. In this study, the output of multihead attention is used as the input of the MLP after layer normalization and then passed through two fully connected layers in turn, which can be expressed as follows:
where LN(⋅) denotes layer normalization, fc(⋅) denotes a fully connected layer, and Z_{ma} is the output of a multiheaded selfattentive layer.
SGCN
The proposed Ttransformer module can only extract the temporal features of the sequence. However, because of the motion coupling, joints also affect each other in space during motion. Considering that the human skeleton is similar to the graph structure in the data structure, its joints can be regarded as nodes of the graph and the connections between joints can be considered as edges. Inspired by ref. [35], this study adopts the GCN module which is similar to refs. [4, 27]. The network structure is improved as shown in Fig. 3, namely, SGCN. The human skeleton is regarded as a fully connected graph with K nodes, the learnable adjacency matrix A ∈ R^{K × K} represents the connection strength between the nodes, the feature matrix H^{(l)} ∈ R^{K × K} is the input of the graph convolution layer, M represents feature dimension of the output of the previous layer. In addition, the output of the graph convolution block can be obtained by combining the trainable weights \(\tilde{M}\) is the feature dimension of the output of the graph convolution layer, and the entire graph convolution block can be expressed as follows:
where BN(⋅) means batch normalization. Either A^{(l)} or W^{(l)} can be obtained by back propagation training.
The K × (N + T) matrix of output by the Ttransformer was used as the first layer input of SGCN, and after each graph convolution block, a \(K\times \tilde{M}\) size matrix would be obtained. The SGCN module was constructed by designing to stack multiple such graph convolution blocks. To match the dimension size, the dimension of the last layer was mapped back to the same dimension as the input matrix, and the output of the whole SGCN module was denoted as \({SG}_{L_S}\in {R}^{K\times \left(N+T\right)}\). Adding long residual connections [36] between the ith and (Li + 1)th block was considered, i ∈ (1, ⋯, L/2), as shown in Fig. 3. Adding long residual connections allows for easier propagation of gradients, prevents gradient disappearing, and accelerates training.
Prediction revision module
A common problem in human motion prediction is that it is difficult to recover from its predicting error, which leads to error accumulation and discontinuous motions. Previous works have commonly addressed this problem by samplingbased loss [14] and convergence loss [21], or by forcing the internal state of the network through a GAN, both of which increase the hyperparameter of the network to a certain extent. Unlike previous works, this study adds a simple and effective prediction revision module in the testing phase to reduce the final prediction error of the model, as shown in Fig. 4. The module is based on a fusion strategy, in which the current prediction frame is fused with the prediction information from the previous frame, and then the fused value is used as the prediction value for the current frame. The basis for this consideration is that human actions are continuous, and the difference in actions between two adjacent frames should not be too great. So if the current frame produces a large prediction error, fusion with the prediction of the previous frame will ‘pull’ back the prediction of the current frame to prevent a sudden jump in motion. Thereby the prediction error is reduced and the smoothness of motion is improved. The specific fusion equation is shown below:
where \({\hat{Y}}_P\) is the predicted value of the previous frame, \({\hat{Y}}_C\) is the ‘predicted’ value of the current frame, \(\hat{Y}\) represents the ‘final predicted’ value of the current frame, and α and β are fusion coefficients.
Results and Discussion
In order to demonstrate the effectiveness of STTGNet proposed in this study, experiments were carried out on the Human3.6 M dataset. The results were compared and analyzed with the stateoftheart method.
Experimental details
The proposed network model was implemented based on Pytorch framework and trained it using the ADAM optimizer [37]. All experimental results were obtained by using a single NVIDIA 1080Ti graphics card. The batch size was set to 32, the number of training epochs was set to 3000 and the learning rate was 0.0005. The parameter size of the network was 2.33 M.
Joint angles were used to represent the human pose. Given the input joint angles, the corresponding coefficients were obtained by using DCT and then applying IDCT to recover the predicted DCT coefficient to the corresponding angle after training the model. In order to train the network effectively, the average L_{1} distance between the predicted joint angle and the ground truth was applied as the loss function. Thus, for a training sample, the loss function can be expressed as:
where \({\hat{\mathrm{x}}}_{k,n}\) is the predicted value of the kth joint in the nth frame obtained through the network, and x_{k, n} is its corresponding ground truth.
Dataset
Human3.6 M [38] is currently the most commonly used opensource dataset in human motion prediction task. It contains 3.6 million 3D human pose data recorded by the Vicon motion capture system and the corresponding RGB images, depth images, and body surface data acquired by 3D scanning equipment. It describes 15 actions such as walking, eating, discussing, and more, which are performed by seven subjects, each subject performs two experiments for each action, with each a sequence containing approximately 3000 to 5000 frames, and each frame contains 34 rows of data, including global translation, global rotation and 32 joint rotations relative to their parent joints. According to the data processing of previous works [4, 30], global rotation, translation, and constant angle were removed. Following standard agreements [13, 14, 26], all motion sequences were down sampled to 25 frames per second, Subject 5(S5) was used as a test set, Subject 11(S11) was used as validation set, and the remaining subjects were used as a training set.
Evaluation metric and baselines
Evaluation metric
In order to fairly verify the validity of experimental results, mean angular error (MAE) was used as the evaluation metric. Specifically:
where \({\hat{\mathrm{y}}}_n\) is the predicted value of nth frame, and y_{n} is its corresponding ground truth. For the above evaluation metric, the prediction results from 0 to 400 ms were highlighted and reported following the baselines of previous works [13, 14].
Baselines
The proposed approach was compared with commonly used motion prediction baselines and some of the latest methods, including MultiGan [39], OoD [40], STConv [29], POTRGCN [30] and STtransformer [34] as well as the stateoftheart methods HRI [27] and DMGNN [22]. For the used prediction baselines, the results were taken from their respective papers, and for HRI [27], the official code published on GitHub was reproduced.
Experimental results
Consistent with previous studies, the model was trained using 50 frames and predicted the pose for the next 10 frames. Table 1 [3, 14, 22, 27, 29, 30, 34, 39, 40] shows the joint angle error results of all actions compared with baselines of this model on Human3.6 M. In order to observe the results more intuitively, the best results among all the experimental results are presented in bold, and the suboptimal results are presented in italics.
It can be seen from the comparison results that compared with the common baselines [3, 14] in motion prediction, STTGNet has made great improvements in all other actions except for the ‘Phoning’ movement. This is mainly due to the fact that the ‘Phoning’ movement has less spatiotemporal dependence, as its movements occur mainly at one hand and the rest of the body is almost static. Even so, STTGNet achieved suboptimal results on this action. Compared with the recently proposed method [34] that also uses transformers for motion prediction, the error produced by the proposed method is almost smaller for each action, resulting in better average error results. Because STtransformer [34] pays more attention to longterm prediction, it performs better in motions longer than 1 s, indicating that the advantage of the spatiotemporal transformer is more obvious as time increases. The study focused more on shortterm motion prediction and only used temporal transformer to capture the temporal relationship, producing excellent results in shortterm forecasting, which shows the effectiveness of the temporal transformer in this study. For other recently proposed methods proposed in refs. [22, 29, 30, 34, 39, 40], STTGNet achieved optimal results on more than half of the actions and achieved approximate optimal results on the others. In the comparison of the average error, in addition to the suboptimal results at 80 ms, STTGNet achieved the best results at 160, 320, and 400 ms, respectively. Moreover, compared with the stateoftheart method, the average prediction error was reduced by 3.85% at 160 ms, 2.44% at 320 ms, and 5.32% at 400 ms. Furthermore, it can be seen from the experimental results that the prediction error of STTGNet grew slower with the prediction time increase, which indicates that the method has a small error accumulation.
Since STTGNet adopted the transformer structure, the model is relatively simple and has fewer parameters. The total parameter amount is only 2.33 M, whereas the total parameter amount of ref. [27] is 3.08 M. In order to more intuitively show the advantages of the method in this study, a visual comparison of some prediction results was made, and the comparison results are shown in Fig. 5.
Ablation experiment
This study conducted extensive ablation experiments on the Human3.6 M dataset to better validate the contributions of various module components in the proposed STTGNet. In order to compare the impact of the validation component more fairly, the structural parameters were fixed, except for the part to be verified in the experiment.
The influence of the Ttransformer module, SGCN module, and prediction revision module
In Method section, the proposed Ttransformer module, SGCN module, and prediction revision module was described in detail, and here the focus was on evaluating their impact on the whole network. The DCT coefficients were used as the input of Ttransformer module, and a TPE was also included before the input module, with the aim of retaining more information about the position of temporal frames, so the impact of TPE was evaluated simultaneously. To prove the effects of each proposed modules, the following combinations of ablation experiments were explored: (1) applying only the Ttransformer module; (2) applying only the SGCN module; (3) using the Ttransformer and SGCN module; (4) using Ttransformer, SGCN, and TPE; (5) using Ttransformer, SGCN, TPE and the prediction revision module (PR for short). The results are documented in Table 2.
Based on the results of the ablation experiment, it could be seen that when Ttransformer and SGCN were used together, the effect was better than using either one alone. It further proved that Ttransformer and SGCN could capture the dependencies in temporal and spatial dimensions respectively. When the TPE module was used, the prediction results reached the level of stateoftheart. After adding the prediction revision module, the prediction error at 80 ms increased by 0.01. This is mainly since for prediction results with small error, the revision module may introduce new error. However, for the other cases, the prediction results were improved to a different extent. Especially when the error is large, the revision effect is obvious.
The influence of network parameters
There are three important parameters in the STTGNet. They are the number of multiheads H, the layer number of Ttransformer L_{T}, and the layer number of GGCN L_{S}. The experiment also explored various combinations of these parameters to find the best composition of the network structure. Without adding the prediction revision module, the effect of different structural parameters on experimental results is recorded in Table 3. From the results, it is easy to find that, when H = 8, L_{T} = 6, and L_{S} = 14, the network acquired the best result.
Influence of coefficients in prediction revision module
During the experiment, the fusion effect of the predicted frame and the predicted value of the previous frame were investigated, so different combinations of α and β coefficients were used to explore the optimal degree of fusion. Table 4 shows the results of different coefficients. Through experiments, it was found that different experimental coefficients had different effects on the prediction results. When α = 0.125 and β = 0.875, the smallest mean error is obtained. Therefore, this group of coefficients was selected in the final experiment.
Conclusions
The spatiotemporal network (STTGNet) proposed in this work used its internal Ttransformer and SGCN two modules to model the spatiotemporal dependence of human skeletal joints, and the prediction revision module can reduce the cumulative error by fusing the current prediction frame with the prediction information of the previous frame to better accomplish the task of human motion prediction. The experiments on the Human3.6 M dataset show that the proposed method achieved stateoftheart results on most actions compared to the commonly used baselines and recently released motion prediction models. Although STTGNet produced excellent results in shortterm motion prediction using relatively few parameters, there remains still room to reduce the amount of parameters and improve the results in longterm motion prediction. For future work, we will continue to try to build a lightweight network to further reduce network parameters, and study algorithms to learn the fusion changes of correction modules. Further we will continue to explore models for longerterm motion prediction.
Availability of data and materials
The datasets used or analyzed during current study are public available.
Abbreviations
 STTGNet:

Spatiotemporal network based on transformer and GCN
 GCN:

Graph convolutional network
 GAN:

Generative adversarial network
 RNN:

Recurrent neural network
 CNN:

Convolutional neural network
 LSTM:

Long shortterm memory
 DCT:

Discrete cosine transform
 TPE:

Temporal position embedding
 ICDT:

Inverse discrete cosine transform
 Ttransformer:

Temporal transformer
 MLP:

Multilayer perceptron
 MAE:

Mean anqular error
 GRU:

Gated recurrent unit
 SGCN:

Spatial GCN
References
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using twostream recurrent neural networks. Paper presented at 2017 IEEE conference on computer vision and pattern recognition, IEEE, Honolulu, 2127 July 2017. https://doi.org/10.1109/CVPR.2017.387
Liu J, Shahroudy A, Xu D. Kot AC, Wang G (2018) Skeletonbased action recognition using spatiotemporal LSTM network with trust gates. IEEE Trans Pattern Anal Mach Intelli, 40(12): 30073021. https://doi.org/10.1109/TPAMI.2017.2771306
Li C, Zhang Z, Lee W S, Lee G H (2018) Convolutional sequence to sequence model for human dynamics. Paper presented at 2018 IEEE/CVF conference on computer vision and pattern recognition, IEEE, Salt Lake City, 1823 June 2018.https://doi.org/10.1109/CVPR.2018.00548
Mao W, Liu MM, Salzmann M, Li HD (2019) Learning trajectory dependencies for human motion prediction. Paper presented at 2019 IEEE/CVF international conference on computer vision, IEEE, Seoul, 2728 October 2019.https://doi.org/10.1109/ICCV.2019.00958
Tanco LM, Hilton A (2000) Realistic synthesis of novel human movements from a database of motion capture examples. Paper presented at Workshop on Human Motion, IEEE, Austin, 78 December. https://doi.org/10.1109/HUMO.2000.897383
Pavlovic V, Rehg JM, MacCormick J (2000) Learning switching linear models of human motion. Paper presented at 13th international conference on neural information processing systems, MIT Press, Denver, 1 January 2000.
Arikan O, Forsyth D A, O'Brien J F (2003) Motion synthesis from annotations. Paper presented at ACM SIGGRAPH, ACM, New York, 2731 July 2003.https://doi.org/10.1145/1201775.882284
Treuille A, Lee Y, Popović Z (2007) Nearoptimal character animation with continuous control. ACM Trans Graph 26(3):7es. https://doi.org/10.1145/1275808.1276386
Wang J M, Fleet D J, Hertzmann A (2007) Gaussian process dynamical models for human motion. IEEE Trans Pattern Anal Mach Intelli 30(2): 283298.https://doi.org/10.1109/TPAMI.2007.1167
Akhter I, Simon T, Khan S, Matthews I, Sheikh Y (2012) Bilinear spatiotemporal basis models. ACM Trans Graph 31(2): 17. https://doi.org/10.1145/2159516.2159523
Taylor G W, Hinton G E, Roweis S T (2007) Modeling human motion using binary latent variables. Paper presented at 20th annual conference on neural information processing systems, MIT Press, Vancouver, 47 December 2006.
Fragkiadaki K, Levine S, Felsen P, Malik J (2015) Recurrent Network Models for Human Dynamics. Paper presented at 2015 IEEE international conference on computer vision, IEEE, Santiago, 713 December 2015.https://doi.org/10.1109/ICCV.2015.494
Jain A, Zamir A R, Savarese S, Saxena A (2016) StructuralRNN: Deep learning on spatiotemporal graphs. Paper presented at 2016 IEEE conference on computer vision and pattern recognition, IEEE, Las Vegas, 2730 June 2016.https://doi.org/10.1109/CVPR.2016.573
Martinez J, Black M J, Romero J (2017) On human motion prediction using recurrent neural networks. Paper presented at 2017 IEEE conference on computer vision and pattern recognition, IEEE, Honolulu, 2126 July 2017.https://doi.org/10.1109/cvpr.2017.497
Zhou Y, Li ZM, Xiao SJ, He C, Huang Z, Li H (2017) Autoconditioned recurrent networks for extended complex human motion synthesis. Paper presented at 6th international conference on learning representations, OpenReview, Vancouver, 30 April3 May 2017.
Tang YL, Ma L, Liu W, Zheng WS (2018) Longterm human motion prediction by modeling motion context and enhancing motion dynamic. Paper presented at the 27th international joint conference on artificial intelligence, IJCAL, Stockholm, 1319 July 2018. https://doi.org/10.24963/ijcai.2018/130
Gopalakrishnan A, Mali A, Kifer D, Giles L, Ororbia AG (2019) A neural temporal model for human motion prediction. Paper presented at 2019 IEEE conference on computer vision and pattern recognition, IEEE, Long Beach, 1520 June 2019.https://doi.org/10.1109/CVPR.2019.01239
Liu ZG, Wu S, Jin SY, Liu Q, Lu SJ, Zimmermann R et al (2019) Towards natural and accurate future motion prediction of humans and animals. Paper presented at 2019 IEEE conference on computer vision and pattern recognition, IEEE, Long Beach, 1520 June 2019. https://doi.org/10.1109/CVPR.2019.01024
Corona E, Pumarola A, Alenyà G, MorenoNoguer F (2020) Contextaware Human Motion Prediction. Paper presented at 2019 IEEE/CVF conference on computer vision and pattern recognition, IEEE, Seattle, 1319 June 2020.https://doi.org/10.1109/CVPR42600.2020.00702
Adeli V, Adeli E, Reid I, Niebles JC, Rezatofighi H (2020) Socially and contextually aware human motion and pose forecasting. IEEE Robot Autom Lett 5(4): 60336040.https://doi.org/10.1109/LRA.2020.3010742
Guo X, Choi J (2019) Human Motion Prediction via Learning Local Structure Representations and Temporal Dependencies. Paper presented at the thirtythird AAAI conference on artificial intelligence and thirtyfirst innovative applications of artificial intelligence conference and ninth symposium on educational advances in artificial intelligence AAAI, Honolulu, 27 January1 February 2019. https://doi.org/10.1609/aaai.v33i01.33012580
Li MS, Chen SH, Zhao YH, Zhang Y, Wang YF, Tian Q (2020) Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction. Paper presented at 2020 IEEE/CVF conference on computer vision and pattern recognition, IEEE, Seattle, 1319 June 2020. https://doi.org/10.1109/CVPR42600.2020.00029
Barsoum E, Kender J, Liu ZC (2018) HPGAN: Probabilistic 3D Human Motion Prediction via GAN. Paper presented at 2018 IEEE/CVF conference on computer vision and pattern recognition workshops, IEEE, Salt Lake, 1822 June 2020.https://doi.org/10.1109/CVPRW.2018.00191
Gui LY, Wang YX, Liang X, Moura JMF (2018) Adversarial GeometryAware Human Motion Prediction. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer VisionECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11208. Springer, Cham. https://doi.org/10.1007/9783030012250_48
Wang BR, Adeli E, Chiu HK, Huang DA, Niebles JC (2019) Imitation learning for human pose prediction. Paper presented at 2019 IEEE international conference on computer vision, Seoul, 27 October2 November 2019.https://doi.org/10.1109/ICCV.2019.00722
Pavllo D, Feichtenhofer C, Auli M, Grangier D (2020) Modeling human motion with quaternionbased neural networks. Int J Comput Vis 128(4): 855872.https://doi.org/10.1007/s11263019012456
Mao W, Liu MM, Salzmann M (2020) History repeats itself: Human motion prediction via motion attention. Paper presented at 2020 16th European conference on computer vision, Springer, Cham, 2328 August 2020.https://doi.org/10.1007/9783030585686_28
Mao W, Liu MM, Salzmann M, Li HD (2021) Multilevel motion attention for human motion prediction. Int J Comput Vis 129(9): 25132535.https://doi.org/10.1007/s11263021014837
Hermes L, Hammer B, Schilling M (2021) Application of Graph Convolutions in a Lightweight Model for Skeletal Human Motion Forecasting. arXiv preprint arXiv:2110.04810. https://arxiv.org/abs/2110.04810
MartínezGonzález A, Villamizar M, Odobez J M (2021) Pose Transformers (POTR): Human Motion Prediction with NonAutoregressive Transformers. Paper presented at 2021 IEEE/CVF international conference on computer vision Workshops, IEEE, Montreal, 1117 October 2021. https://doi.org/10.1109/ICCVW54120.2021.00257
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Paper presented at the 31st international conference on neural information processing systems, ACM, Long Beach, 49 December 2017.
Jiang T, CamgÖz NC, Bowden R (2021) Skeletor: Skeletal Transformers for Robust BodyPose Estimation. Paper presented at 2021 IEEE/CVF conference on computer vision and pattern recognition workshops, IEEE, Nashville, 1925 June 2021.https://doi.org/10.1109/CVPRW53098.2021.00378
Mao WA, Ge YT, Shen CH, Tian Z, Wang XL, Wang ZB (2021) Tfpose: Direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320.https://arxiv.org/abs/2103.15320
Aksan E, Kaufmann M, Cao P, Hilliges O (2021) A Spatiotemporal Transformer for 3D Human Motion Prediction. Paper presented at the 2021 international conference on 3D Vision, IEEE, London, 13 December 2021.https://doi.org/10.1109/3DV53792.2021.00066
Kipf TN, Welling M (2017) Semisupervised classification with graph convolutional networks. Paper presented at the 5th international conference on learning representations, OpenReview, Toulon, 2426 April 2017.
He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. Paper presented at 2016 IEEE conference on computer vision and pattern recognition, IEEE, Las Vegas, 2730 June 2016.https://doi.org/10.1109/cvpr.2016.90
Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization. Paper presented at 3rd international conference on learning representations, ICLR, San Diego, 79 May 2015.
Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7): 13251339.https://doi.org/10.1109/TPAMI.2013.248
Liu ZG, Lyu K, Wu S, Chen HP, Hao YB, Ji SL (2021) Aggregated MultiGANs for Controlled 3D Human Motion Prediction. Proc AAAI Conf Artif Intell 35(3): 22252232.
Bourached A, Griffiths RR, Gray R, Jha A, Nachev P (2022) Generative ModelEnhanced Human Motion Prediction. Appl AI Lett 3(2):e63. https://doi.org/10.1002/ail2.63
Acknowledgements
Thanks to Wei Mao of the Australian National University for his guidance and assistance during the experiment of this paper.
Funding
This work was supported in part by the Key Program of NSFC (Grant No. U1908214), Program for Innovative Research Team in University of Liaoning Province (LT2020015), the Support Plan for Key Field Innovation Team of Dalian (2021RT06), the Science and Technology Innovation Fund of Dalian (Grant No. 2020JJ25CY001).
Author information
Authors and Affiliations
Contributions
Lujing Chen, Rui Liu, Xin Yang, Dongsheng Zhou, Qiang Zhang, and Xiaopeng Wei participated in the literature search, data analysis, manuscript writing and editing; all the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, L., Liu, R., Yang, X. et al. STTGnet: a Spatiotemporal network for human motion prediction based on transformer and graph convolution network. Vis. Comput. Ind. Biomed. Art 5, 19 (2022). https://doi.org/10.1186/s42492022001125
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s42492022001125
Keywords
 Human motion prediction
 Transformer
 Gragh convolutional network