Joint Spotting and Recognition of Micro-Expressions in Long Videos via Temporal Representation Learning

Mingyang Yin; Aiguo Song; Chaolong Qin

doi:10.1109/icrai68431.2025.11396696

Abstract

1 min read

Facial micro-expressions (MEs) are brief and subtle facial movements that reveal genuine emotions, making them critical cues for affective analysis and human-robot interaction. In this work, we propose a unified framework for micro-expression spotting and recognition in long video sequences. Our model is built upon the Perceiver IO architecture, which enables scalable temporal modeling across variable-length sequences by encoding global context into a fixed-size latent space. To simultaneously address spotting and recognition, we adopt a dual-branch decoder that estimates frame-wise expression likelihoods and emotional categories, respectively. To enhance stability and boundary sensitivity, we introduce two auxiliary learning strategies: a soft KL-divergence-based consistency loss that enforces emotion prediction coherence within expression segments, and a boundary-aware contrastive loss that sharpens temporal boundaries between expressive and neutral frames. In addition, we introduce a phased supervision scheme that leverages ground truth segments in early training and pseudolabels in later stages. Experiments conducted on CAS(ME)<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> and SAMM-LV demonstrate that our method achieves state-of-the-art performance.

Joint Spotting and Recognition of Micro-Expressions in Long Videos via Temporal Representation Learning

Abstract

Discussion(0)

Related publications

<i>K</i>-Means Clustering-Based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition in Human–Robot Interaction

Federated Edge Learning via Integrated Sensing, Computation, and Communication

Studies on Hyperspectral Face Recognition in Visible Spectrum With Feature Band Selection

VirFace: Enhancing Face Recognition via Unlabeled Shallow Data

Glocal Energy-based Learning for Few-Shot Open-Set Recognition

Related publications

Article2022
<i>K</i>-Means Clustering-Based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition in Human–Robot Interaction
Article2022

Article2023
Federated Edge Learning via Integrated Sensing, Computation, and Communication
Article2023

Article2010
Studies on Hyperspectral Face Recognition in Visible Spectrum With Feature Band Selection
Article2010

Article2021
VirFace: Enhancing Face Recognition via Unlabeled Shallow Data
Article2021

Article2023
Glocal Energy-based Learning for Few-Shot Open-Set Recognition
Article2023