Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models

A. Belfiore,Jonathan Passerat-Palmbach,Dmitrii Usynin

Published 2025 in arXiv.org

ABSTRACT

The increased availability of genetic data has transformed genomics research, but raised many privacy concerns regarding its handling due to its sensitive nature. This work explores the use of language models (LMs) for the generation of synthetic genetic mutation profiles, leveraging differential privacy (DP) for the protection of sensitive genetic data. We empirically evaluate the privacy guarantees of our DP modes by introducing a novel Biologically-Informed Hybrid Membership Inference Attack (biHMIA), which combines traditional black box MIA with contextual genomics metrics for enhanced attack power. Our experiments show that both small and large transformer GPT-like models are viable synthetic variant generators for small-scale genomics, and that our hybrid attack leads, on average, to higher adversarial success compared to traditional metric-based MIAs.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-10
Fields of study
Biology, Computer Science
Identifiers
DOI 10.48550/arXiv.2511.07503 arXiv 2511.07503
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review
2024cited by this paper
Opportunities for basic, clinical, and bioethics research at the intersection of machine learning and genomics
2023cited by this paper
Synthetic Data Revolutionizes Rare Disease Research: How Large Language Models and Generative AI are Overcoming Data Scarcity and Privacy Challenges
2023cited by this paper
Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data
2023cited by this paper
Membership Inference Attacks on Machine Learning: A Survey
2021cited by this paper
Extracting Training Data from Large Language Models
2020cited by this paper
A comprehensive survey and analysis of generative models in machine learning
2020cited by this paper
A Survey of Privacy Attacks in Machine Learning
2020cited by this paper
Re‐identifiability of genomic data and the GDPR
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
A brief history of bioinformatics
2018influential reference
ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models
2018cited by this paper
Privacy Challenges of Genomic Big Data.
2017cited by this paper
Exposed! A Survey of Attacks on Private Data
2017cited by this paper
Membership Inference Attacks Against Machine Learning Models
2016cited by this paper
Deep Learning with Differential Privacy
2016cited by this paper
A global reference for human genetic variation
2015influential reference
Quantifying Genomic Privacy via Inference Attack with High-Order SNV Correlations
2015cited by this paper
Privacy in the Genomic Era
2014influential reference
Bayesian method to predict individual SNP genotypes from gene expression data
2012cited by this paper
Genetic exceptionalism
2010cited by this paper
Sequence analysis Advance Access publication June 7, 2011 The variant call format and VCFtools
2010cited by this paper
Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays
2008influential reference
Advances on natural language processing
2007cited by this paper
The Human Genome Project: Lessons from Large-Scale Biology
2003cited by this paper
Information
2001cited by this paper
Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation
year unknowncited by this paper

CITED BY

No citing papers are available for this paper.