Visual temporal attention

From HandWiki
Jump to: navigation, search
Video frames of the Parallel Bars action category in the UCF-101 dataset[1] (a) The highest ranking four frames in video temporal attention weights, in which the athlete is performing on the parallel bars; (b) The lowest ranking four frames in video temporal attention weights, in which the athlete is standing on the ground. All weights are predicted by the ATW CNN algorithm[2]. The highly weighted video frames generally captures the most distinctive movements relevant to the action category.

Visual temporal attention is a special case of visual attention that involves directing attention to specific instant of time. Similar to its spatial counterpart visual spatial attention, these attention modules have been widely implemented in video analytics in computer vision to provide enhanced performance and human interpretable explanation[3] of deep learning models.

As visual spatial attention mechanism allows human and/or computer vision systems to focus more on semantically more substantial regions in space, visual temporal attention modules enable machine learning algorithms to emphasize more on critical video frames in video analytics tasks, such as human action recognition. In convolutional neural network-based systems, the prioritization introduced by the attention mechanism is regularly implemented as a linear weighting layer with parameters determined by labeled training data[3].

Application in Action Recognition

ATW CNN architecture[4]. Three CNN streams are used to process spatial RGB images, temporal optical flow images, and temporal warped optical flow images, respectively. An attention model is employed to assign temporal weights between snippets for each stream/modality. Weighted sum is used to fuse predictions from the three streams/modalities.

Recent video segmentation algorithms often exploits both spatial and temporal attention mechanisms[2][4]. Research in human action recognition has accelerated significantly since the introduction of powerful tools such as Convolutional Neural Networks (CNNs). However, effective methods for incorporation of temporal information into CNNs are still being actively explored. Motivated by the popular recurrent attention models in natural language processing, the Attention-aware Temporal Weighted CNN (ATW CNN) is proposed[4] in videos, which embeds a visual attention model into a temporal weighted multi-stream CNN. This attention model is implemented as temporal weighting and it effectively boosts the recognition performance of video representations. Besides, each stream in the proposed ATW CNN framework is capable of end-to-end training, with both network parameters and temporal weights optimized by stochastic gradient descent (SGD) with back-propagation. Experimental results show that the ATW CNN attention mechanism contributes substantially to the performance gains with the more discriminative snippets by focusing on more relevant video segments.

See also


  1. Center, UCF (2013-10-17). "UCF101 - Action Recognition Data Set". 
  2. 2.0 2.1 Zang, Jinliang; Wang, Le; Liu, Ziyi; Zhang, Qilin; Hua, Gang; Zheng, Nanning (2018). "Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition". IFIP Advances in Information and Communication Technology. Cham: Springer International Publishing. pp. 97–108. doi:10.1007/978-3-319-92007-8_9. ISBN 978-3-319-92006-1. 
  3. 3.0 3.1 "NIPS 2017". 2017-10-20. 
  4. 4.0 4.1 4.2 Wang, Le; Zang, Jinliang; Zhang, Qilin; Niu, Zhenxing; Hua, Gang; Zheng, Nanning (2018-06-21). "Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network". Sensors (MDPI AG) 18 (7): 1979. doi:10.3390/s18071979. ISSN 1424-8220. PMID 29933555. PMC 6069475. CC-BY icon.svg Material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.