Type and Complexity Signals in Multilingual Question Representations

Published 2025 in Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

ABSTRACT

This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce the availability of pre-trained linguistic information.

PUBLICATION RECORD

Publication year
2025
Venue
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Publication date
2025-10-07
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/2025.mrl-main.28 arXiv 2510.06304
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
2024cited by this paper
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
2023influential reference
Sentence Complexity in Context
2021cited by this paper
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
2020cited by this paper
Profiling-UD: a Tool for Linguistic Profiling of Texts
2020influential reference
Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
2020cited by this paper
Information-Theoretic Probing for Linguistic Structure
2020cited by this paper
Probing Multilingual Sentence Representations With X-Probe
2019cited by this paper
What do you learn from context? Probing for sentence structure in contextualized word representations
2019influential reference
BERT Rediscovers the Classical NLP Pipeline
2019cited by this paper
What Does BERT Learn about the Structure of Language?
2019cited by this paper
Designing and Interpreting Probes with Control Tasks
2019cited by this paper
Unsupervised Cross-lingual Representation Learning at Scale
2019cited by this paper
What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties
2018cited by this paper
UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task
2018influential reference
Does String-Based Neural MT Learn Source Syntax?
2016cited by this paper
XGBoost: A Scalable Tree Boosting System
2016cited by this paper
Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
2016cited by this paper
Large-scale evidence of dependency length minimization in 37 languages
2015cited by this paper
What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation
2015cited by this paper
Position of interrogative phrases in content questions
2013cited by this paper
Efficiency and complexity in grammars
2004cited by this paper

CITED BY

No citing papers are available for this paper.