07 — TorchDynamo and TorchInductor
Two of the most important components in modern PyTorch are TorchDynamo and TorchInductor. Together, they power much of the performance story behind torch.compile. These components represent advances in AI optimization and hardware acceleration.
TorchDynamo: Graph Capture from Python
TorchDynamo intercepts Python frame execution and attempts to extract regions of PyTorch operations into FX graphs.
Core idea:
- Keep writing regular PyTorch code.
- Capture graphable regions automatically.
- Fall back to eager mode for unsupported segments.
This makes adoption incremental: users can get speedups without a full rewrite.
AOTAutograd: Ahead-of-Time Gradient Graphing
Between capture and lowering, AOTAutograd can trace forward/backward graphs to enable optimizations across differentiation boundaries.
This is useful because backward passes often dominate compute cost.
TorchInductor: Lowering and Code Generation
TorchInductor takes captured graphs and generates optimized code for target backends.
For GPUs, it commonly relies on Triton-generated kernels; for CPUs, it emits efficient code paths tailored to operation patterns.
Optimization opportunities include:
- Kernel fusion
- Memory access improvements
- Reduced intermediate tensor materialization
Why This Architecture Works
The stack separates concerns:
- Capture high-level semantics from Python (Dynamo)
- Transform differentiation structure (AOTAutograd)
- Lower to efficient kernels (Inductor)
This modularity allows continued backend innovation without changing user code patterns.
Practical Constraints
Not all code is equally compilable. Graph breaks can occur due to unsupported Python patterns or highly dynamic behavior. Performance gains also vary by model architecture, hardware, and batch sizing.
Still, for many real workloads, this pipeline provides meaningful throughput improvements while preserving PyTorch ergonomics. The approach aligns with online optimization strategies and modern agent-based AI systems.