Other languages

Semantic Alignment on Action for Image Captioning

Authors: Da Huo, Marc A. Kastner, Takatsugu Hirayama, Takahiro Komamizu, Yasutomo Kawanishi, Ichiro Ide

Abstract:

Image captioning is a popular task in vision and language processing, which aims to generate textual descriptions for images. Previously, it simply used image and text as input with self-attention to capture global dependencies. Recent research further uses objects detected from the input image, so-called object tags, as anchor points to ease alignment between image and text with the attention mechanism. However, they only consider object information in images, while neglecting the actions and object interactions that also appear in the image, which causes actions not caught properly in image captioning. To tackle this previously underrepresented dimension of the semantic alignment, we take account of actions on the semantic level. Specifically, our work focuses on human actions and interactions, which ensures that more salient parts of the image get captioned. We introduce a new type of tag, called action tag, to anchor the action information. First, we provide a method for obtaining such action tags using an action detection model which predicts actions in the image. Next, we leverage these action tags into the captioning model. Experimental results indicate that the proposed action tags can help learn action semantics and catch the salient actions leading to perceived improvements in common performance. Experimental results on MS-COCO Karpathy test split show that the proposed model achieves good scores in BLEU-4 and CIDEr metrics, using action tags as anchors. Furthermore, the number of action tags (no more than 5) is smaller than that of object tags (commonly more than 20), which means there is a potential to reduce FLOPs by reducing the total sequence length. It indicates the potential for efficient reasoning and may be applied to daily activity scenes in the future.

Type: Journal paper at IEEE Access, vol. 13, pp. 199615-199629

Publication date: October 2025

DOI: 10.1109/ACCESS.2025.3631093

Attached Files

preprint

If you have questions or ideas about this research, feel free to leave a comment below or send me an email. I will reply quickly.