Facial micro-expressions (MEs) are brief and subtle facial movements that reveal genuine emotions, making them critical cues for affective analysis and human-robot interaction. In this work, we propose a unified framework for micro-expression spotting and recognition in long video sequences. Our model is built upon the Perceiver IO architecture, which enables scalable temporal modeling across variable-length sequences by encoding global context into a fixed-size latent space. To simultaneously address spotting and recognition, we adopt a dual-branch decoder that estimates frame-wise expression likelihoods and emotional categories, respectively. To enhance stability and boundary sensitivity, we introduce two auxiliary learning strategies: a soft KL-divergence-based consistency loss that enforces emotion prediction coherence within expression segments, and a boundary-aware contrastive loss that sharpens temporal boundaries between expressive and neutral frames. In addition, we introduce a phased supervision scheme that leverages ground truth segments in early training and pseudolabels in later stages. Experiments conducted on CAS(ME)<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> and SAMM-LV demonstrate that our method achieves state-of-the-art performance.
Discussion(0)
No comments yet. Be the first to comment.