Deep Learning Optimization Techniques: Beyond Adam

Adam optimizer has become the default choice for training deep neural networks. Its adaptive learning rates and momentum-based updates make it robust across many applications. However, as deep learning models grow larger and training dynamics become more complex, researchers and practitioners have developed advanced optimization techniques that can outperform Adam in specific scenarios.

From layer-wise learning rate adaptation to sharpness-aware minimization, these techniques address fundamental challenges in training deep networks: generalization, convergence speed, and stability. Understanding when and how to apply these methods can significantly impact your model's performance.

Limitations of Adam

While Adam is excellent for many tasks, it has known limitations:

Generalization gaps: Adam often generalizes worse than SGD with momentum on image classification tasks
Convergence issues: Can converge to suboptimal solutions in some cases
Learning rate sensitivity: Default hyperparameters may not be optimal
Memory overhead: Maintains momentum for each parameter

These limitations have motivated research into more sophisticated optimization techniques.

Advanced Optimization Techniques

1. LAMB (Layer-wise Adaptive Moments)

LAMB applies layer-wise learning rate adaptation, using the gradient norm to automatically adjust learning rates per layer. This enables training with very large batch sizes—up to 8k—without loss of accuracy.

Key innovation: Each layer gets its own learning rate based on the ratio of the layer norm to the gradient norm. This prevents exploding or vanishing updates in deep networks.

When to use: Large batch training, very deep networks (BERT, Transformers), distributed training

2. LARS (Layer-wise Adaptive Learning Rates)

LARS extends the layer-wise learning rate concept, computing local gradients for each layer and using weight norm to scale learning rates. It was instrumental in training ResNet with 8k batch size on ImageNet.

Key innovation: Separate trust region for each layer based on weight norm, enabling stable training with large learning rates.

When to use: Large batch image classification, training very wide networks

3. Sharpness-Aware Minimization (SAM)

SAM explicitly optimizes for flat minima, which have been shown to generalize better than sharp minima. It performs gradient descent on the worst-case perturbation within an epsilon neighborhood.

Key innovation: Seeks parameters that are not just low-loss but stable to small perturbations—this correlates with better generalization.

Implementation: Requires two forward-backward passes per step, roughly doubling computation but often improving accuracy by 1-2%.

When to use: When generalization is critical, especially in image classification and language modeling

4. AdamW with Weight Decay

Properly decoupled weight decay (L2 regularization) from Adam's adaptive learning rates. Original Adam applied weight decay in a way that interacted poorly with adaptive learning rates.

Key innovation: Decouples weight decay from the gradient-based update, producing more effective regularization.

When to use: Default choice for most transformer-based models (BERT, GPT)

5. Ranger (RAdam + Lookahead)

Ranger combines RAdam (Rectified Adam) with Lookahead optimization. RAdam provides learning rate warmup for stable early training, while Lookahead averages weights over time for more stable convergence.

Key innovation: Two-level optimization: inner optimizer (RAdam) plus slow-weight averaging across k steps.

When to use: When training is unstable, especially with small datasets or unusual architectures

6. NovoGrad

NovoGrad computes gradients' first moment per layer rather than per parameter, reducing memory usage while maintaining adaptive learning rates.

Key innovation: Layer-wise momentum reduces memory footprint while providing similar benefits to Adam.

When to use: Memory-constrained training, very large models

7. AdamW with Decoupled Weight Decay and Warmup

Combining warmup (gradually increasing learning rate), decoupled weight decay, and cosine annealing often outperforms all other configurations for transformer models.

Key insight: Simple modifications to Adam—proper weight decay and learning rate scheduling—can match or exceed more complex optimizers.

Practical Implementation Guide

Choosing an Optimizer

Default choice: AdamW with cosine annealing and warmup—works well for most deep learning tasks.

For transformers: AdamW with weight decay 0.01, learning rate 1e-4 to 1e-3, warmup 10% of steps.

For CNNs: SGD with momentum often generalizes better; consider LAMB for large batch training.

For unstable training: Try Ranger or add gradient clipping.

Learning Rate Scheduling

Learning rate scheduling often matters more than optimizer choice:

Warmup: Gradually increase LR from 0 to target; prevents early instability
Cosine annealing: Smooth decay following cosine curve; popular for transformers
Step decay: Reduce LR at specific milestones; simple and effective
Polynomial decay: Smooth decay to near-zero; used in some transformer implementations

Gradient Clipping

Essential for preventing exploding gradients, especially in RNNs and transformers:

Global norm clipping: Clip gradient norm to max (typically 1.0)
Value clipping: Clip individual gradients to [-max, max]

Performance Comparison

Optimizer	Memory	Convergence	Generalization	Best For
AdamW	High	Fast	Good	Default, Transformers
SGD+Momentum	Low	Slower	Best	CNNs, Classic ML
LAMB	High	Fast	Good	Large batch, Distributed
SAM	High	Slower	Best	Generalization critical
Ranger	High	Stable	Good	Unstable training

Code Examples

AdamW with Warmup and Cosine Annealing (PyTorch)

from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts

optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)

for epoch in range(num_epochs):
    train(...)
    scheduler.step()

LAMB Optimizer

# Using PyTorch's Lamb optimizer
from torch.optim.lamb import Lamb

optimizer = Lamb(model.parameters(), lr=1e-4)

# Or implement custom LAMB
for p in group['params']:
    grad = p.grad
    exp_avg = state['exp_avg']
    exp_avg_sq = state['exp_avg_sq']
    
    # Layer-wise LR
    update = exp_avg / (sqrt(exp_avg_sq) + eps)
    ratio = update.norm() / (p.grad.norm() + eps)
    ratio = max(ratio, beta1)  # Trust coefficient
    
    p.add_(update * lr * ratio)

SAM (Sharpness-Aware Minimization)

# Simplified SAM implementation
class SAM(Optimizer):
    def step(self, closure):
        # First forward-backward: compute gradient at w
        loss = closure()
        self.base_optimizer.zero_grad()
        loss.backward()
        
        # Compute perturbation
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None: continue
                self.state[p]['eps'] = p.grad * group['eps']
                p.add_(self.state[p]['eps'])
        
        # Second forward-backward: compute gradient at w + epsilon
        closure(backward=True)
        
        # Apply gradient
        self.base_optimizer.step()

Adam remains an excellent default choice, but advanced optimization techniques can provide meaningful improvements in specific scenarios. LAMB enables large batch training, SAM improves generalization, and proper learning rate scheduling often matters more than optimizer choice.

Start with AdamW with warmup and cosine annealing—this works well for most applications. Only switch to more complex optimizers when you have specific needs: large batch training (LAMB), generalization-critical tasks (SAM), or unstable training (Ranger).

Remember that optimization is just one piece of the training puzzle. Data quality, architecture choice, and regularization often matter more than optimizer selection.