Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains

Yurii Paniv,Artur Kiulian,Dmytro Chaplynskyi,M. Khandoga,Anton Polishko,Tetiana Bas,Guillermo Gabrielli

Published 2024 in Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)

ABSTRACT

While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO). The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset, translated the VQA benchmark into Ukrainian, and measured performance degradation relative to original English versions. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.

PUBLICATION RECORD

Publication year
2024
Venue
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)
Publication date
2024-11-22
Fields of study
Physics, Chemistry, Computer Science, Linguistics, Education
Identifiers
DOI 10.48550/arXiv.2411.14647 arXiv 2411.14647
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MMTEB: Massive Multilingual Text Embedding Benchmark
2025cited by this paper
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation
2024influential reference
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
2024cited by this paper
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
2024cited by this paper
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
2024cited by this paper
Setting up the Data Printer with Improved English to Ukrainian Machine Translation
2024cited by this paper
PaliGemma: A versatile 3B VLM for transfer
2024cited by this paper
M5 - A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
2024cited by this paper
Ukrainian Visual Word Sense Disambiguation Benchmark
2024cited by this paper
The UNLP 2024 Shared Task on Fine-Tuning Large Language Models for Ukrainian
2024cited by this paper
Benchmarking Vision Language Models for Cultural Understanding
2024cited by this paper
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
2023cited by this paper
Extension Multi30K: Multimodal Dataset for Integrated Vision and Language Research in Ukrainian
2023influential reference
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
2022cited by this paper
BERTScore: Evaluating Text Generation with BERT
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
Multi30K: Multilingual English-German Image Descriptions
2016cited by this paper
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
2016cited by this paper
VQA: Visual Question Answering
2015influential reference
Visual7W: Grounded Question Answering in Images
2015cited by this paper
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
2014cited by this paper
Scalable training of L1-regularized log-linear models
2007cited by this paper

CITED BY

What are Foundation Models Cooking in the Post-Soviet World?
2025cites this paper
Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks
2025cites this paper
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
2025cites this paper