Concrete Problems in AI Safety

Dario Amodei,Chris Olah,J. Steinhardt,P. Christiano,John Schulman,Dandelion Mané

Published 2016 in arXiv.org

ABSTRACT

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

PUBLICATION RECORD

Publication year
2016
Venue
arXiv.org
Publication date
2016-06-21
Fields of study
Computer Science, Engineering
Identifiers
arXiv 1606.06565
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Multi-objective Optimization
2018cited by this paper
GENERATIVE ADVERSARIAL NETS
2018cited by this paper
The future of employment: How susceptible are jobs to computerisation?
2017cited by this paper
Superintelligence: Paths, Dangers, Strategies
2017cited by this paper
Estimating individual treatment effect: generalization bounds and algorithms
2016cited by this paper
Safe Exploration in Finite Markov Decision Processes with Gaussian Processes
2016cited by this paper
Cooperative Inverse Reinforcement Learning
2016cited by this paper
Quantilizers: A Safer Alternative to Maximizers for Limited Optimization
2016cited by this paper
Avoiding Wireheading with Value Reinforcement Learning
2016cited by this paper
Deep Exploration via Bootstrapped DQN
2016cited by this paper
Uniform Coherence
2016cited by this paper
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks
2016cited by this paper
Hiring by Algorithm: Predicting and Preventing Disparate Impact
2016cited by this paper
Safely Interruptible Agents
2016cited by this paper
The Risk of Automation for Jobs in OECD Countries: A Comparative Analysis
2016cited by this paper
Learning Representations for Counterfactual Inference
2016cited by this paper
Modelling and Simulation for Autonomous Systems
2016cited by this paper
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
2016cited by this paper
Asymptotic Convergence in Online Learning with Unbounded Delays
2016cited by this paper
Deep Learning with Differential Privacy
2016cited by this paper
Bounding and Minimizing Counterfactual Error
2016cited by this paper
Trusted Machine Learning for Probabilistic Models
2016cited by this paper
Mastering the game of Go with deep neural networks and tree search
2016cited by this paper
Self-Modification of Policy and Utility Function in Rational Agents
2016cited by this paper
Parametric Bounded Löb's Theorem and Robust Cooperation of Bounded Agents
2016cited by this paper
Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction
2016cited by this paper
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
2016cited by this paper
The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies
2016cited by this paper
Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples
2016cited by this paper
Unsupervised Risk Estimation with only Structural Assumptions
2016cited by this paper
Practical Black-Box Attacks against Machine Learning
2016cited by this paper
Distantly supervised information extraction using bootstrapped patterns
2015cited by this paper
High-Confidence Off-Policy Evaluation
2015cited by this paper
Incremental Knowledge Base Construction Using DeepDive
2015cited by this paper
Human-level control through deep reinforcement learning
2015cited by this paper
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
2015cited by this paper
Weight Uncertainty in Neural Networks
2015cited by this paper
Inceptionism: Going Deeper into Neural Networks
2015cited by this paper
Learning the Preferences of Ignorant, Inconsistent Agents
2015cited by this paper
Neural GPUs Learn Algorithms
2015cited by this paper
Massively Multitask Networks for Drug Discovery
2015cited by this paper
On-the-Job Learning with Bayesian Decision Theory
2015cited by this paper
A comprehensive survey on safe reinforcement learning
2015cited by this paper
Ethical guidelines for a superintelligence
2015cited by this paper
Toward Idealized Decision Theory
2015cited by this paper
A Formally Verified Hybrid System for the Next-Generation Airborne Collision Avoidance System
2015cited by this paper
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
2015cited by this paper
Calibrated Structured Prediction
2015cited by this paper
High-Dimensional Continuous Control Using Generalized Advantage Estimation
2015cited by this paper
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
2015cited by this paper
The Security of Latent Dirichlet Allocation
2015cited by this paper
Learning Fair Classifiers
2015cited by this paper
Research Priorities for Robust and Beneficial Artificial Intelligence
2015cited by this paper
Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning
2015cited by this paper
Motivated Value Selection for Artificial Agents
2015cited by this paper
Understanding Neural Networks Through Deep Visualization
2015cited by this paper
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
2015cited by this paper
Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners
2015cited by this paper
Estimating Accuracy from Unlabeled Data
2014cited by this paper
Nobel Lecture: Uncertainty Outside and Inside Economic Models
2014cited by this paper
Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing
2014cited by this paper
Utility function security in artificially intelligent agents
2014cited by this paper
Machine Learning: The High Interest Credit Card of Technical Debt
2014cited by this paper
Estimating the accuracies of multiple classifiers without labeled data
2014cited by this paper
Domain-Adversarial Neural Networks
2014cited by this paper
Reinforcement Learning and the Reward Engineering Principle
2014cited by this paper
Optimizing the CVaR via Sampling
2014cited by this paper
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits
2014cited by this paper
Deep neural networks are easily fooled: High confidence predictions for unrecognizable images
2014cited by this paper
Policy Gradients Beyond Expectations: Conditional Value-at-Risk
2014cited by this paper
Differential Privacy and Machine Learning: a Survey and Review
2014cited by this paper
Amplify scientific discovery with artificial intelligence
2014cited by this paper
Explaining and Harnessing Adversarial Examples
2014cited by this paper
Active Reward Learning
2014cited by this paper
Neural Turing Machines
2014cited by this paper
Empowerment - an Introduction
2013cited by this paper
Learning Fair Representations
2013cited by this paper
Causal discovery with continuous additive noise models
2013cited by this paper
Formal verification of distributed aircraft controllers
2013cited by this paper
Intriguing properties of neural networks
2013cited by this paper
Robust Markov Decision Processes
2013cited by this paper
Safe Exploration in Markov Decision Processes
2012cited by this paper
Counterfactual reasoning and learning systems: the example of computational advertising
2012cited by this paper
A Method of Moments for Mixture Models and Hidden Markov Models
2012cited by this paper
Towards Formal Verification of Freeway Traffic Control
2012cited by this paper
Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation
2012cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Unsupervised Supervised Learning II: Margin-Based Classification without Labels
2011cited by this paper
Domain Adaptation with Coupled Subspaces
2011cited by this paper
Learning What to Value
2011cited by this paper
Finite-time regional verification of stochastic non-linear systems
2011cited by this paper
Delusion, Survival, and Intelligent Agents
2011cited by this paper
Unbiased look at dataset bias
2011cited by this paper
Towards Making Unlabeled Data Never Hurt
2011cited by this paper
Towards fully autonomous driving: Systems and algorithms
2011cited by this paper
Model-based Utility Functions
2011cited by this paper
Fairness through awareness
2011cited by this paper
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
2010cited by this paper
The security of machine learning
2010cited by this paper
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
2010cited by this paper

CITED BY

Clarify Before You Draw: Proactive Agents for Robust Text-to-CAD Generation
2026cites this paper
PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering
2026cites this paper
The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning
2026cites this paper
Minimal Computational Preconditions for Subjective Perspective in Artificial Agents
2026cites this paper
Replicable Constrained Bandits
2026cites this paper
A Mathematical Theory of Agency and Intelligence
2026cites this paper
Toward an artifact that designs itself: generative design science research approach
2026cites this paper
Responsible AI for General-Purpose Systems: Overview, Challenges, and A Path Forward
2026cites this paper
TNCOA: Efficient Exploration via Observation‐Action Constraint on Trajectory‐Based Intrinsic Reward
2026cites this paper
Model-Based Data-Efficient and Robust Reinforcement Learning
2026cites this paper
Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs
2026cites this paper
Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook
2026cites this paper
De-Decay: Defusing Computer Vision Model Degradation through Scalable and Actionable Human-Data Alignment
2026cites this paper
Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
2026cites this paper
Drone-Aided Secure Task Offloading Optimization for Internet of Vehicles: Review, Challenges and Method
2026cites this paper
Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making
2026cites this paper
Why AI Alignment Failure Is Structural: Learned Human Interaction Structures and AGI as an Endogenous Evolutionary Shock
2026cites this paper
Polyphonic Intelligence: Constraint-Based Emergence, Pluralistic Inference, and Non-Dominating Integration
2026cites this paper
A Generative AI-Driven Reliability Layer for Action-Oriented Disaster Resilience
2026influential citation
BAP-SRL: Bayesian Adaptive Priority Safe Reinforcement Learning for Vehicle Motion Planning at Mixed Traffic Intersections
2026cites this paper
SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization
2026cites this paper
Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
2026cites this paper
Quantitative Validation of Artificial Precognition Adaptive Cognized Control: Real-World Performance Evaluation Across Automotive and Railway Operational Deployments
2026cites this paper
Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing
2026cites this paper
Autonomous Reward Shaping via Self-Generated Trajectories for Sparse-Reward Reinforcement Learning
2026cites this paper
Steering the Singularity: How Venture Capital Shapes the Governance and Future of Superintelligence
2026influential citation
MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference
2026cites this paper
A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents
2026cites this paper
IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
2026cites this paper
Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments
2026cites this paper
A Deep Learning-Based CSI Prediction Method for LiFi Systems
2026cites this paper
Cultural diversity and artificial intelligence
2026cites this paper
Language Models’ Hall of Mirrors Problem: Why AI Alignment Requires Peircean Semiosis
2026cites this paper
Can AI mediation improve democratic deliberation?
2026cites this paper
Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs
2026cites this paper
The Alignment Tax: Why Safety Shouldn’t Slow Innovation
2026cites this paper
LERA: Reinstating Judgment as a Structural Precondition for Execution in Automated Systems
2026cites this paper
Institutional AI: A Governance Framework for Distributional AGI Safety
2026cites this paper
What Do Learned Models Measure?
2026cites this paper
Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale
2026cites this paper
Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
2026cites this paper
Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed
2026cites this paper
Safer Policy Compliance with Dynamic Epistemic Fallback
2026cites this paper
How should AI Safety Benchmarks Benchmark Safety?
2026cites this paper
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding
2026cites this paper
Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
2026cites this paper
SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for Reinforcement Learning from Human Feedback (RLHF)
2026cites this paper
Learning with Adaptive Prototype Manifolds for Out-of-Distribution Detection
2026cites this paper
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
2026cites this paper
An Interpretable Chest X-ray Classification Framework Using Prototype Memory and Counterfactual Consistency
2026cites this paper
Robust Trust
2026cites this paper
Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation
2026cites this paper
Capability-Oriented Training Induced Alignment Risk
2026cites this paper
Intelligent AI Delegation
2026cites this paper
QuRL: Efficient Reinforcement Learning with Quantized Rollout
2026cites this paper
Automatically Finding Reward Model Biases
2026cites this paper
Kalman-Inspired Runtime Stability and Recovery in Hybrid Reasoning Systems
2026cites this paper
Intent Laundering: AI Safety Datasets Are Not What They Seem
2026cites this paper
The End of Pretraining for Large Language Models: The Future of Agentic and AI Reasoning Beyond Peak Data
2026cites this paper
The Invisible Gorilla Effect in Out-of-distribution Detection
2026cites this paper
DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning
2026cites this paper
The Headless Firm: How AI Reshapes Enterprise Boundaries
2026cites this paper
IT2-ENFIS: Interval Type-2 Exclusionary Neuro-Fuzzy Inference System, an Attempt Toward Trustworthy Regression Learning
2026cites this paper
A Generalized Apprenticeship Learning Framework for Capturing Evolving Student Pedagogical Strategies
2026cites this paper
From algorithmic hallucinations to alien minds: Addressing the ideator's dilemma through entrepreneurial work
2026cites this paper
Specification-Guided Reinforcement Learning
2026cites this paper
Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs
2026cites this paper
Adaptive Conformal Prediction via Bayesian Uncertainty Weighting for Hierarchical Healthcare Data
2026cites this paper
AI Social Responsibility as Reachability: Execution-Level Semantics for the Social Responsibility Stack
2026cites this paper
Understanding Reward Hacking in Text-to-Image Reinforcement Learning
2026cites this paper
Integration of A Web Mdvr Howen Vehicle Surveillance System (Vss) and An Artificial Intelligence Based in Car Camera (Icc) For Fleet Safety PT. Putra Perkasa Abadi Jobsite Adaro Indonesia
2026cites this paper
Dynamic Intelligence Ceilings: Measuring Long-Horizon Limits of Planning and Creativity in Artificial Systems
2026cites this paper
The universal theory of core values in intelligent systems (UTCVIS): a systems, philosophical, and ethical inquiry
2026cites this paper
A white-box prompt injection attack on embodied AI agents driven by large language models
2026cites this paper
Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant
2026cites this paper
Agent Contracts: A Formal Framework for Resource-Bounded Autonomous AI Systems
2026cites this paper
AI Deployment Authorisation: A Global Standard for Machine-Readable Governance of High-Risk Artificial Intelligence
2026cites this paper
Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment
2026cites this paper
Is BatchEnsemble a Single Model? On Calibration and Diversity of Efficient Ensembles
2026cites this paper
Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values
2026cites this paper
The Relativity of AGI: Distributional Axioms, Fragility, and Undecidability
2026cites this paper
Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning
2026influential citation
Status Hierarchies in Language Models
2026cites this paper
DeRaDiff: Denoising Time Realignment of Diffusion Models
2026cites this paper
Factored Causal Representation Learning for Robust Reward Modeling in RLHF
2026cites this paper
Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling
2026cites this paper
Toward Fully Autonomous Driving: AI, Challenges, Opportunities, and Needs
2026cites this paper
Beyond Medical Chatbots: Meddollina and the Rise of Continuous Clinical Intelligence
2026cites this paper
TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
2026cites this paper
A Structured Approach to Safety Case Construction for AI Systems
2026cites this paper
From Asimov’s robot laws to the SET framework: integrating safety, ethics, and transparency in science, technology, and innovation policy
2026cites this paper
From Worst Case to Conditional Frontiers in Reinforcement Learning
2026cites this paper
Reward-free Alignment for Conflicting Objectives
2026cites this paper
Interpretability in Deep Time Series Models Demands Semantic Alignment
2026cites this paper
An AI ethics framework for a trustworthy autonomous drone system to support battlefield casualty triage
2026cites this paper
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity
2026cites this paper
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
2026cites this paper
Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation
2026cites this paper
Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing
2026cites this paper
Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
2026cites this paper