VinaBench

Benchmark for Faithful and Consistent Visual Narratives

Silin Gao¹, Sheryl Mathew^1,3, Li Mi¹, Sepideh Mamooler¹, Mengjie Zhao², Hiromi Wakaki², Yuki Mitsufuji², Syrielle Montariol¹, Antoine Bosselut¹

¹EPFL ²Sony ³CMU

arXiv Code

HF Dataset

Human narratives are often transformed from text (e.g., scripts) into visual media (e.g., storyboards), which requires implicit knowledge to constrain that generated visual contents are aligned with the text and show coherent narrative elements (e.g., characters). Can vision language models (VLMs) learn the underlying knowledge constraints to generate more faithful and self-consistent visual narratives?

Abstract

Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

VinaBench

VinaBench augments existing visual-textual narrative pairs with discourse and commonsense knowledge constraints, to offer scaffolds for learning consistent and faithful visual narrative generation and its evaluation.

Discourse constraints capture the narrative elements that may be connected across different scenes (i.e., images), which highlight that these narrative elements must be consistently manifested.
- Global features describe character attributes and image appearance style, which are expected to remain static throughout the narrative.
- Scene features trace the dynamics of basic narrative elements, including each scene's presented characters, time of day and location.

Commonsense constraints fill the manifestation gap between input textual narrative (often visually under-specified) and output visual narrative, which incorporate the required commonsense inferences for faithful visual narrative generation.
- Image captions extract visual details in each narrative scene.
- Commonsense links ground visual entities (extracted from the image captions) to their associated textual narrative entities.

Data Construction

We prompt hybrid VLMs and LLMs to annotate the VinaBench knowledge constraints. Our expert study verifies that the annotations are reliable, with high acceptance rates for all types of constraint labels, each with fairly low percentage of disagreement cases between the experts.

The visual-textual narrative pairs in VinaBench are sampled from three diverse visual storytelling datasets, including Visual Writing Prompts (VWP), Storyboard20K and StorySalon.

Evaluation

Prior full-reference metrics, e.g., CLIP image embedding similarity (CLIP-I), directly match model generations to gold reference images, which may skew the evaluation to irrelevant details in the specific gold reference. Besides, prior metrics possess no consistency measure of visual narratives, which may overlook inconsistent flaws in the model generation.

We propose a series of reference-free VQAScore-based metrics based on the annotated commonsense and discourse constraints in VinaBench, to better assess visual-textual narrative alignment and visual narrative consistency.

Experiments

We test three baseline VLMs that are optimized for visual narrative generation, including StoryGen, ARLDM, and MM-Interleaved, and evaluate the model generations with and without the augmentation of VinaBench knowledge constraints.

Without (w/o) VinaBench Constraints: the VLM is trained to generate the visual narrative images given only the textual narrative.
With (w/) VinaBench Constraints: a LLM (Llama3.1-70B-Instruct) is first trained to generate the constraints based on the textual narrative, then the VLM learns to generate the visual narrative images given the concatenation of textual narrative and LLM-generated constraints.

Takeaway I: Learning with VinaBench knowledge constraints significantly improves visual narrative consistency and alignment to input textual narrative.

Takeaway II: Visual narratives generated by VLMs obviously fall behind the gold references, indicating still a large room of improvement.

Human Evaluation & Correlation of Metrics

Our human evaluation rates each visual narrative on a Likert scale from 1 to 5 (higher is better), regarding to five aspects. We also study the correlation of different automatic metrics to the human evaluation, including the average of our alignment and consistency metrics, denoted as Alignment and Consistency, compared with the CLIP embedding similarity measure to the gold reference (CLIP-I) and to the input text (CLIP-T).

Takeaway III: Human evaluation supports the results of our automatic evaluation.

Takeaway IV: Compared to the metrics based on CLIP similarity (CLIP-I and CLIP-T), our proposed alignment and consistency metrics have better correlation with human evaluation.

Correlation between Visual Narrative Generation and Knowledge Constraints

We analyze how the faithfulness of knowledge constraints affects the faithfulness of the output visual narrative (to the input textual narrative).

Takeaway V: There is a positive correlation between the knowledge constraints and the output visual narrative, w.r.t. their alignment to the input textual narrative, which highlights the significance of planning intermediate constraints to promote faithful visual narrative generation.

Case Study

We study two groups of visual narratives generated by MM-Interleaved and ARLDM, each with (w/) and without (w/o) VinaBench constraints, compared to the gold reference.

Takeaway VI: Even with the augmentation of knowledge constraints, the visual narratives generated by VLMs still suffer from obvious inconsistency flaws.

Conclusion

In this work, we propose a new benchmark VinaBench, which provides a reliable ground for generative vision models to learn faithful and consistent visual narratives with discourse and commonsense constraints.

In view of the shortcomings of visual narrative evaluation, VinaBench also proposes new metrics to more closely assess the consistency of visual narrative generations and their alignment with the input textual narrative.

Our results indicate that model-generated visual narratives have considerable room for improvement to reach the level of human visual storytelling, which raise the call for future study on more robust visual narrative generators.

BibTeX


        @inproceedings{gao2025vinabench,
          title={VinaBench: Benchmark for Faithful and Consistent Visual Narratives},
          author={Gao, Silin and Mathew, Sheryl and Mi, Li and Mamooler, Sepideh and Zhao, Mengjie and Wakaki, Hiromi and Mitsufuji, Yuki and Montariol, Syrielle and Bosselut, Antoine},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2025}
        }

Contact

Silin Gao: silin.gao@epfl.ch

Antoine Bosselut: antoine.bosselut@epfl.ch

Acknowledgement

This website is adapted from VLM²-Bench, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License

Our VinaBench is available under the CC-BY 4.0 license for academic use with proper attribution. The images and annotations in this benchmark are intended solely for research purposes. Users are responsible for ensuring that their use of the data complies with applicable intellectual property laws and ethical guidelines. We encourage users to verify the sources and ensure compliance with any terms of service or licensing agreements.