Conditional Computation in Neural Networks for faster models

Emmanuel Bengio,Pierre-Luc Bacon,Joelle Pineau,Doina Precup

Published 2015 in arXiv.org

ABSTRACT

Deep learning has become the state-of-art tool in many applications, but the evaluation and training of deep models can be time-consuming and computationally expensive. The conditional computation approach has been proposed to tackle this problem (Bengio et al., 2013; Davis & Arel, 2013). It operates by selectively activating only parts of the network at a time. In this paper, we use reinforcement learning as a tool to optimize conditional computation policies. More specifically, we cast the problem of learning activation-dependent policies for dropping out blocks of units as a reinforcement learning problem. We propose a learning scheme motivated by computation speed, capturing the idea of wanting to have parsimonious activations while maintaining prediction accuracy. We apply a policy gradient algorithm for learning policies that optimize this loss function and propose a regularization mechanism that encourages diversification of the dropout policy. We present encouraging empirical results showing that this approach improves the speed of computation without impacting the quality of the approximation.

PUBLICATION RECORD

Publication year
2015
Venue
arXiv.org
Publication date
2015-11-19
Fields of study
Computer Science
Identifiers
arXiv 1511.06297
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Pattern Recognition And Machine Learning
2016cited by this paper
Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks
2016cited by this paper
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2015cited by this paper
DRAW: A Recurrent Neural Network For Image Generation
2015cited by this paper
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
2015cited by this paper
Advances in Neural Information Processing Systems 27
2014cited by this paper
Recurrent Models of Visual Attention
2014cited by this paper
Deterministic Policy Gradient Algorithms
2014cited by this paper
Deep Networks with Internal Selective Attention through Feedback Connections
2014cited by this paper
Deep Sequential Neural Network
2014cited by this paper
Low-Rank Approximations for Conditional Feedforward Computation in Deep Neural Networks
2013cited by this paper
Dropout Training as Adaptive Regularization
2013cited by this paper
Adaptive dropout for training deep neural networks
2013cited by this paper
A Survey on Policy Search for Robotics
2013cited by this paper
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
2013cited by this paper
Improving neural networks by preventing co-adaptation of feature detectors
2012cited by this paper
Reading Digits in Natural Images with Unsupervised Feature Learning
2011influential reference
Deep learning via Hessian-free optimization
2010cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010cited by this paper
Learning Multiple Layers of Features from Tiny Images
2009cited by this paper
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
2004cited by this paper
Reinforcement Learning: An Introduction
1998cited by this paper
Gradient-based learning applied to document recognition
1998influential reference
Learning long-term dependencies with gradient descent is difficult
1994cited by this paper
Markov Decision Processes: Discrete Stochastic Dynamic Programming
1994cited by this paper
Fast Exact Multiplication by the Hessian
1994cited by this paper
Untersuchungen zu dynamischen neuronalen Netzen
1991cited by this paper
Learning representations by back-propagating errors
1986cited by this paper

CITED BY

Soft decision trees for survival analysis
2026cites this paper
DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks
2026cites this paper
Excitation: Momentum For Experts
2026cites this paper
Generalizing GNNs with Tokenized Mixture of Experts
2026cites this paper
LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation
2026cites this paper
AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
2026cites this paper
Robustness of Mixtures of Experts to Feature Noise
2026cites this paper
Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching
2026cites this paper
Laws of Learning Dynamics and the Core of Learners
2026cites this paper
Speaker Adaptive Mixture of Weight-Decomposed LoRA Experts for On-Device End-to-End ASR
2025cites this paper
Topology-Assisted Spatio-Temporal Pattern Disentangling for Scalable MARL in Large-scale Autonomous Traffic Control
2025cites this paper
Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding
2025cites this paper
Empowering Quantum Error Traceability with MoE for Automatic Calibration
2025cites this paper
Void in Language Models
2025cites this paper
Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation
2025cites this paper
Zero-Overhead Introspection for Adaptive Test-Time Compute
2025cites this paper
CARN: Complexity-Aware Routing Network for Efficient and Adaptive Inference
2025cites this paper
Context-aware Sparse Spatiotemporal Learning for Event-based Vision
2025cites this paper
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts
2025cites this paper
Maximum Score Routing For Mixture-of-Experts
2025cites this paper
Lookup multivariate Kolmogorov-Arnold Networks
2025cites this paper
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
2025cites this paper
SparsyFed: Sparse Adaptive Federated Training
2025cites this paper
Improving Routing in Sparse Mixture of Experts with Graph of Tokens
2025cites this paper
Visual Instance-aware Prompt Tuning
2025cites this paper
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
2025cites this paper
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
2025cites this paper
Efficiency Robustness of Dynamic Deep Learning Systems
2025cites this paper
Learning to Inference Adaptively for Multimodal Large Language Models
2025cites this paper
Load Balancing Mixture of Experts with Similarity Preserving Routers
2025cites this paper
Learning Unmasking Policies for Diffusion Language Models
2025cites this paper
Neural network task specialization via domain constraining
2025cites this paper
MS-NET-v2: modular selective network optimized by systematic generation of expert modules
2025cites this paper
Tight Clusters Make Specialized Experts
2025cites this paper
AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens
2025cites this paper
Dynamic Early-Exit Convolutional Neural Networks for Edge Vision: The Benefits, The Challenges, and the Road Ahead
2025cites this paper
SAL-YOLO-DeepSeek: a lightweight real-time detection and LLM-driven decision framework for intelligent escalator safety monitoring
2025cites this paper
Mixture of Experts Detection System for Enhanced Automatic Annotation in Photovoltaic Modules
2025cites this paper
AIM-W: An Input-Adaptive Framework for Interoperable Physiological Modeling on Wearables
2025cites this paper
Modeling Expert Interactions in Sparse Mixture of Experts via Graph Structures
2025cites this paper
Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
2025cites this paper
Downsized and Compromised?: Assessing the Faithfulness of Model Compression
2025cites this paper
Switch-Based Multi-Part Neural Network
2025cites this paper
Leveraging Deep Q-Network Agents with Dynamic Routing Mechanisms in Convolutional Neural Networks for Enhanced and Reliable Classification of Alzheimer's Disease from MRI Scans
2025cites this paper
Context-aware Dynamic Pruning for Speech Foundation Models
2025cites this paper
Mixture of Experts in Large Language Models
2025cites this paper
A theory of initialisation’s impact on specialisation
2025cites this paper
Optimization of Layer Skipping and Frequency Scaling for Convolutional Neural Networks under Latency Constraint
2025cites this paper
Sequential Policy Gradient for Adaptive Hyperparameter Optimization
2025cites this paper
Learning When Not to Attend Globally
2025cites this paper
MPKD-DCFI: multi-path knowledge distillation via dynamic contextual feature interaction
2025cites this paper
Online deep learning’s role in conquering the challenges of streaming data: a survey
2025cites this paper
Algorithm for Describing Neuronal Electric Operation
2025cites this paper
Learning to Skip the Middle Layers of Transformers
2025cites this paper
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters
2024cites this paper
Intrinsic User-Centric Interpretability through Global Mixture of Experts
2024cites this paper
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
2024cites this paper
The Entanglement of Communication and Computing in Enabling Edge Intelligence
2024cites this paper
Multi-Path Routing for Conditional Information Gain Trellis Using Cross-Entropy Search and Reinforcement Learning
2024cites this paper
Learning More Generalized Experts by Merging Experts in Mixture-of-Experts
2024cites this paper
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection
2024cites this paper
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
2024cites this paper
Enhancing Fast Feed Forward Networks with Load Balancing and a Master Leaf Node
2024cites this paper
Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning
2024cites this paper
Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning
2024cites this paper
Video Relationship Detection Using Mixture of Experts
2024cites this paper
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization
2024cites this paper
Unraveling the Mystery of Scaling Laws: Part I
2024cites this paper
Conditional computation in neural networks: Principles and research trends
2024cites this paper
Conditional Information Gain Trellis
2024cites this paper
GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
2024cites this paper
Fast-Inf: Ultra-Fast Embedded Intelligence on the Batteryless Edge
2024cites this paper
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
2024cites this paper
Sparsely Gated Mixture of Experts Neural Network For Linearization of RF Power Amplifiers
2024cites this paper
Efficient Sparse Training with Structured Dropout
2024cites this paper
InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks
2024cites this paper
Unveiling The Matthew Effect Across Channels: Assessing Layer Width Sufficiency via Weight Norm Variance
2024cites this paper
Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management
2024cites this paper
More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing
2024cites this paper
Skipping Computations in Multimodal LLMs
2024cites this paper
Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts
2024cites this paper
Natural Language Processing and Neurosymbolic AI: The Role of Neural Networks with Knowledge-Guided Symbolic Approaches
2024cites this paper
Retrieval with Learned Similarities
2024cites this paper
Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization
2024cites this paper
Routers in Vision Mixture of Experts: An Empirical Study
2024cites this paper
LONG EXPOSURE: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
2024cites this paper
Scaling Diffusion Transformers to 16 Billion Parameters
2024cites this paper
Active reinforcement learning versus action bias and hysteresis: control with a mixture of experts and nonexperts
2024cites this paper
Mixture of A Million Experts
2024cites this paper
Training-Free Activation Sparsity in Large Language Models
2024cites this paper
A Survey on Mixture of Experts in Large Language Models
2024cites this paper
MoDification: Mixture of Depths Made Easy
2024cites this paper
A Survey on Mixture of Experts
2024cites this paper
TIPS: Topologically Important Path Sampling for Anytime Neural Networks
2023cites this paper
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization
2023cites this paper
Lifting the Curse of Capacity Gap in Distilling Language Models
2023cites this paper
Keyword-Specific Acoustic Model Pruning for Open-Vocabulary Keyword Spotting
2023cites this paper
GradMDM: Adversarial Attack on Dynamic Networks
2023cites this paper
Memorization Capacity of Neural Networks with Conditional Computation
2023cites this paper
Deep Convolutional Tables: Deep Learning Without Convolutions
2023cites this paper