The efficient integration of visual and tactile information is essential for slip detection and evaluation of grasping stability. However, existing research generally combines visual or tactile modalities as prior information, without fully exploring mechanisms for fusing complementary modalities. In this article, we propose a visual–tactile fusion Transformer (VTF-Trans) for slip detection, designed to handle unaligned data in different formats and facilitate cross-modal information exchange. The main advantages of the proposed method are summarized as follows: first, VTF-Trans employs an improved dual-stream transformer for feature extraction. In addition, we introduce a gated modal attention module to further refine cross-modal fusion. Compared with the existing methods, VTF-Trans effectively integrates useful information from different modalities across multiple scales. Second, to extract deep multimodal information, we propose a cross-modal attention (CMA) mechanism. By defining cross-affinity based on single-modality affinities (token metrics), CMA naturally alleviates the gap between different modalities and domains. Third, we evaluate VTF-trans on three datasets and conduct unknown object grasping experiments. Compared with the state-of-the-art methods, VTF-Trans achieves the highest accuracy for robotic grasping and slip detection, highlighting its superior performance and practical applicability.
Discussion(0)
No comments yet. Be the first to comment.