Convolutional Transformer Fusion Blocks for Multi-Modal Gesture Recognition
Gesture recognition defines an SHAMPOO SWEET PEA important information channel in human-computer interaction.Intuitively, combining inputs from multiple modalities improves the recognition rate.In this work, we explore multi-modal video-based gesture recognition tasks by fusing spatio-temporal representation of relevant distinguishing features from