IntentVC 2025: The ACM Multimedia Grand Challenge on Intention-Oriented Controllable Video Captioning

Back to publications

Authors: Takahiro Komamizu, Marc A. Kastner, Yasutomo Kawanishi, Trung Thanh Nguyen, Junan Chen

Abstract:

The IntentVC Challenge, held in conjunction with ACM Multimedia 2025, introduces a novel benchmark for intention-oriented controllable video captioning. Unlike conventional captioning methods that generate generic, scene-level summaries, IntentVC focuses on intention-specific generation. Participants are required to produce captions explicitly conditioned on user-defined intentions, such as emphasizing a specific object tracked within a video. To support this task, the challenge provides an extended version of the LaSOT dataset annotated with intention-focused captions across 70 object categories. A standardized evaluation protocol and public leaderboard enable fair and reproducible comparison among submitted methods. By advancing research in personalized and adaptive video understanding, IntentVC offers a platform for exploring controllable vision-language modeling with practical relevance for accessibility, retrieval, and human-AI interaction. As a result, a total of 23 teams and 58 active participants have participated, and a total of 1,443 entries have been submitted. More information and resources are available at https://sites.google.com/view/intentvc/.

Type: 33rd ACM International Conference on Multimedia (MM ’25)

Publication date: To be published in Oct 2025

DOI: 10.1145/3746027.3762057


If you have questions or ideas about this research, feel free to leave a comment below or send me an email. I will reply quickly.
© 2013-2025 Marc A. Kastner. Powered by KirbyCMS. Some rights reserved. Privacy policy.