Abstract:
The IntentVC Challenge, held in conjunction with ACM Multimedia 2025, introduces a novel benchmark for intention-oriented controllable video captioning. Unlike conventional captioning methods that generate generic, scene-level summaries, IntentVC focuses on intention-specific generation. Participants are required to produce captions explicitly conditioned on user-defined intentions, such as emphasizing a specific object tracked within a video. To support this task, the challenge provides an extended version of the LaSOT dataset annotated with intention-focused captions across 70 object categories. A standardized evaluation protocol and public leaderboard enable fair and reproducible comparison among submitted methods. By advancing research in personalized and adaptive video understanding, IntentVC offers a platform for exploring controllable vision-language modeling with practical relevance for accessibility, retrieval, and human-AI interaction. As a result, a total of 23 teams and 58 active participants have participated, and a total of 1,443 entries have been submitted. More information and resources are available at https://sites.google.com/view/intentvc/.
Type: 33rd ACM International Conference on Multimedia (MM ’25)
Publication date: To be published in Oct 2025