Self-Boosting Large Language Models with Synthetic Preference Data

Qingxiu Dong,Li Dong,Xingxing Zhang,Zhifang Sui,Furu Wei

Published 2024 in International Conference on Learning Representations

ABSTRACT

Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

PUBLICATION RECORD

Publication year
2024
Venue
International Conference on Learning Representations
Publication date
2024-10-09
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2410.06961 arXiv 2410.06961
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
2024cited by this paper
Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena
2024cited by this paper
Direct Preference Knowledge Distillation for Large Language Models
2024cited by this paper
Towards Comprehensive Preference Data Collection for Reward Modeling
2024cited by this paper
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
2024influential reference
Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment
2024cited by this paper
KTO: Model Alignment as Prospect Theoretic Optimization
2024cited by this paper
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
2024cited by this paper
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment
2024cited by this paper
SimPO: Simple Preference Optimization with a Reference-Free Reward
2024influential reference
Self-Play Preference Optimization for Language Model Alignment
2024influential reference
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
2024cited by this paper
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
2024influential reference
Self-Rewarding Language Models
2024cited by this paper
Self-playing Adversarial Language Game Enhances LLM Reasoning
2024cited by this paper
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
2024influential reference
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
2024cited by this paper
Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment
2024cited by this paper
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback
2024cited by this paper
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
2024cited by this paper
Better Alignment with Instruction Back-and-Forth Translation
2024cited by this paper
The Llama 3 Herd of Models
2024cited by this paper
Won’t Get Fooled Again: Answering Questions with False Premises
2023cited by this paper
GPT-4 Technical Report
2023cited by this paper
Self-Refine: Iterative Refinement with Self-Feedback
2023cited by this paper
WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions
2023influential reference
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
2023cited by this paper
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
2023cited by this paper
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
2023cited by this paper
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
2023influential reference
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
2023influential reference
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
2023cited by this paper
Reinforced Self-Training (ReST) for Language Modeling
2023cited by this paper
Textbooks Are All You Need II: phi-1.5 technical report
2023cited by this paper
SELF: Self-Evolution with Language Feedback
2023influential reference
Safer-Instruct: Aligning Language Models with Automated Preference Data
2023cited by this paper
Describing Differences between Text Distributions with Natural Language
2022cited by this paper
Training language models to follow instructions with human feedback
2022cited by this paper
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
2022cited by this paper
Constitutional AI: Harmlessness from AI Feedback
2022cited by this paper
Explaining Patterns in Data with Language Models via Interpretable Autoprompting
2022cited by this paper
Self-Instruct: Aligning Language Models with Self-Generated Instructions
2022influential reference
PROST: Physical Reasoning about Objects through Space and Time
2021cited by this paper
Understanding Dataset Difficulty with V-Usable Information
2021cited by this paper
WebGPT: Browser-assisted question-answering with human feedback
2021cited by this paper
A General Language Assistant as a Laboratory for Alignment
2021cited by this paper
Training Verifiers to Solve Math Word Problems
2021cited by this paper
TruthfulQA: Measuring How Models Mimic Human Falsehoods
2021cited by this paper
Measuring Massive Multitask Language Understanding
2020cited by this paper
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
2019cited by this paper
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
2019cited by this paper
An Adversarial Winograd Schema Challenge at Scale
2019cited by this paper
HellaSwag: Can a Machine Really Finish Your Sentence?
2019cited by this paper
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
2018cited by this paper
XNLI: Evaluating Cross-lingual Sentence Representations
2018cited by this paper
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
2018cited by this paper

CITED BY

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
2026cites this paper
Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
2026cites this paper
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
2026cites this paper
Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale
2026cites this paper
Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization
2025cites this paper
From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment
2025cites this paper
Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
2025cites this paper
Data Swarms: Optimizable Generation of Synthetic Evaluation Data
2025cites this paper
RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models
2025cites this paper
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
2025cites this paper
A Survey on Efficient Large Language Model Training: From Data-centric Perspectives
2025cites this paper
Self-Improving Model Steering
2025cites this paper
SGPO: Self-Generated Preference Optimization based on Self-Improver
2025influential citation
Icon2: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
2025influential citation
SeaPO: Strategic Error Amplification for Robust Preference Optimization of Large Language Models
2025cites this paper
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
2025influential citation
Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
2025cites this paper
What Matters in Data for DPO?
2025cites this paper
The Best Instruction-Tuning Data are Those That Fit
2025cites this paper
Evolving LLMs'Self-Refinement Capability via Synergistic Training-Inference Optimization
2025influential citation
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation
2025cites this paper
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
2025cites this paper
Learning from Self Critique and Refinement for Faithful LLM Summarization
2025cites this paper
MapTrace: Scalable Data Generation for Route Tracing on Maps
2025cites this paper
Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans
2025cites this paper
LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue
2025cites this paper
GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization
2025cites this paper
AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research
2025cites this paper
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
2024cites this paper
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration
2024influential citation
Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation
2024cites this paper
Evolving LLMs’ Self-Refinement Capability via Iterative Preference Optimization
year unknowncites this paper
RMB OOST : R EWARD M ODEL T RAINING W ITH P REFERENCE -C ONDITIONAL M ULTI -A SPECT S YN - THETIC D ATA G ENERATION
year unknowncites this paper