Towards Realistic Error Models for Tabular Data

Philipp Jung,Sebastian Jäger,Nicholas Chandler,Felix Biessmann

Published 2025 in ACM Journal of Data and Information Quality

ABSTRACT

Errors in data are a key challenge in modern data management and processing systems. Monitoring and mitigating risks associated with errors in data transformations and downstream applications, such as Machine Learning (ML) model training, requires a profound understanding of error generation and impact of errors on data pipelines. Unfortunately, scientific progress in the field is facing two main challenges: For one, research on data errors often does not adhere to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, which impedes reproducibility and comparisons. Second, existing data error models are oversimplified and fail to capture the complex statistical dependencies underlying the types and distributions of errors observed in real-world data. Building on prior work in the database management systems and statistics literature, we extend the theory on missing values to encompass a broader range of errors in tables and provide an overview of relevant error types. Combining error sampling mechanisms often observed in real data with a comprehensive categorization of errors, we introduce a latent factor model for tabular data errors that is simple to implement and can effectively model realistic error dependencies. Error sampling is decoupled from error types, which allows for simple extensions with more error types or sampling mechanisms. Using established benchmarks, we evaluate our model in two application scenarios, data cleaning and tabular ML tasks. In a comprehensive suite of experiments we demonstrate the impact of realistic error models on data cleaning benchmarks. Our results also show that a simple generative error model captures a wide range of error mechanisms and offers a convenient formalization of data perturbations to improve the generalizability, robustness and reproducibility of data cleaning research.

PUBLICATION RECORD

  • Publication year

    2025

  • Venue

    ACM Journal of Data and Information Quality

  • Publication date

    2025-11-08

  • Fields of study

    Computer Science

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-68 of 68 references · Page 1 of 1

CITED BY