Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.
VinaBench augments existing visual-textual narrative pairs with discourse and commonsense knowledge constraints, to offer scaffolds for learning consistent and faithful visual narrative generation and its evaluation.
We prompt hybrid VLMs and LLMs to annotate the VinaBench knowledge constraints. Our expert study verifies that the annotations are reliable, with high acceptance rates for all types of constraint labels, each with fairly low percentage of disagreement cases between the experts.
The visual-textual narrative pairs in VinaBench are sampled from three diverse visual storytelling datasets, including Visual Writing Prompts (VWP), Storyboard20K and StorySalon.
Prior full-reference metrics, e.g., CLIP image embedding similarity (CLIP-I), directly match model generations to gold reference images, which may skew the evaluation to irrelevant details in the specific gold reference. Besides, prior metrics possess no consistency measure of visual narratives, which may overlook inconsistent flaws in the model generation.
We propose a series of reference-free VQAScore-based metrics based on the annotated commonsense and discourse constraints in VinaBench, to better assess visual-textual narrative alignment and visual narrative consistency.
We test three baseline VLMs that are optimized for visual narrative generation, including StoryGen, ARLDM, and MM-Interleaved, and evaluate the model generations with and without the augmentation of VinaBench knwoledge constraitns.
Takeaway I: Learning with VinaBench knowledge constraints significantly improves visual narrative consistency and alignment to input textual narrative.
Takeaway II: Visual narratives generated by VLMs obviously fall behind the gold references, indicating still a large room of improvement.
Our human evaluation rates each visual narrative on a Likert scale from 1 to 5 (higher is better), regarding to five aspects. We also study the correlation of different automatic metrics to the human evaluation, including the average of our alignment and consistency metrics, denoted as Alignment and Consistency, compared with the CLIP embedding similarity measure to the gold reference (CLIP-I) and to the input text (CLIP-T).
Takeaway III: Human evaluation supports the results of our automatic evaluation.
Takeaway IV: Compared to the metrics based on CLIP similarity (CLIP-I and CLIP-T), our proposed alignment and consistency metrics have better correlation with human evaluation.
We analyze how the faithfulness of knowledge constraints affects the faithfulness of the output visual narrative (to the input textual narrative).
Takeaway V: There is a positive correlation between the knowledge constraints and the output visual narrative, w.r.t. their alignment to the input textual narrative, which highlights the significance of planning intermediate constraints to promote faithful visual narrative generation.
We study two groups of visual narratives generated by MM-Interleaved and ARLDM, each with (w/) and without (w/o) VinaBench constraints, compared to the gold reference.
Takeaway VI: Even with the augmentation of knowledge constraints, the visual narratives generated by VLMs still suffer from obvious inconsistency flaws.
@inproceedings{gao2025vinabench,
title={VinaBench: Benchmark for Faithful and Consistent Visual Narratives},
author={Gao, Silin and Mathew, Sheryl and Mi, Li and Mamooler, Sepideh and Zhao, Mengjie and Wakaki, Hiromi and Mitsufuji, Yuki and Montariol, Syrielle and Bosselut, Antoine},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
This website is adapted from VLM2-Bench, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Our VinaBench is available under the CC-BY 4.0 license for academic use with proper attribution. The images and annotations in this benchmark are intended solely for research purposes. Users are responsible for ensuring that their use of the data complies with applicable intellectual property laws and ethical guidelines. We encourage users to verify the sources and ensure compliance with any terms of service or licensing agreements.