NanoChat Optimization
Project Description
NanoChat is a lightweight, educational LLM framework created by Andrej Karpathy that provides a minimal, end-to-end pipeline for building language models. This case study demonstrates how Artemis Intelligence was used to systematically optimize NanoChat's performance across training and inference workflows.
Project Repository: github.com/karpathy/nanochat
Goals
The purpose of this project is to apply Artemis Intelligence to NanoChat in order to:
- Optimize performance
Improve training speed, inference latency, memory usage, and KV-cache efficiency. - Benchmark and validate improvements
Measure before vs after using automated profiling. - Support feature experimentation
Use the Planner to propose and integrate new features (sampling, profiling hooks, training configs).
Optimization Workflows
1. Identify and fix bottlenecks
Optimise → Add Target → Agent Analysis → Detect Bottlenecks → Plan to Fix
2. Discover new feature opportunities
Using the Planner's "add new feature" instruction.
3. Scan for critical issues
Optimizations Implemented
| Optimization | Problem Addressed | Solution Applied |
|---|---|---|
| Auto Batch Size Discovery | Low GPU utilization (~35%) | Exponential search + binary refinement with safety margins |
| KV-Cache for Inference | O(T²) attention recomputation | Prefill once, cache key-value pairs for autoregressive decoding |
| torch.compile Integration | Python overhead, unoptimized kernels | Static shape compilation with operator fusion |
| Token Broadcasting Fix | Duplicate tokens across sequences | Independent sampling per sequence |
Performance Results
Training Optimizations
| Optimization | Training Speedup | Training Time Saved | Notes |
|---|---|---|---|
| Auto Batch | 1.90× (SFT), 1.04× (Base/Mid) | 47% / 4% | Maximizes GPU utilization |
| torch.compile | 1.67× | 40% | Operator fusion + static shapes |
| Combined | 3.17× (SFT), 1.74× (Base/Mid) | 68% / 43% | Compound effect |
Inference Optimizations
| Configuration | Wall-Clock Time | Throughput | Speedup | Time Saved |
|---|---|---|---|---|
| Baseline (batch=1, no KV-cache) | 21.41s | 81.4 tok/s | 1.0× | - |
| Optimized (batch=93, with KV-cache) | 2.04s | 855.7 tok/s | 10.5× | 91% |
Note: Throughput-based calculations suggest 88× theoretical speedup (99% time saved), but we benchmarked 10.5× speedup (91% time saved) in real wall-clock time. The difference is due to system-level overhead (memory management, kernel launches, synchronization).
Combined Results Summary
- SFT Training → 68% time reduction (3.17× faster)
- Base/Mid Training → 43% time reduction (1.74× faster)
- Inference → 91% time reduction (10.5× faster)
Detailed Optimizations
1. Auto Batch Size Discovery
Artemis Detection: Agent analysis revealed GPU utilization at only ~35% during SFT training, indicating severe under-batching.
Overview
Automated batch discovery using:
- Exponential growth
- Binary search
- 15% safety margin
- Batch caching
Discovery time: ~20 seconds
Why It Matters
- Removes manual tuning (10–30 minutes)
- Prevents underutilization (batch=4 → ~35% GPU usage)
- Adapts to GPU, model, and sequence length
- SFT benefits dramatically (severely under-batched)
SFT Training
| Metric | Hardcoded | Auto Batch | Improvement |
|---|---|---|---|
| Throughput | 50,164 tok/s | 95,337 tok/s | 1.90× |
| Time Saved | — | — | 47% |
| Batch Size | 4 | 93 | 23× larger |
| GPU Util | ~35% | ~92% | +2.6× |
Base/Mid Training
| Metric | Hardcoded | Auto Batch | Improvement |
|---|---|---|---|
| Throughput | baseline | baseline +4% | 1.04× |
| Time Saved | — | — | 4% |
The same auto-batching technique resulted in a 1.90× speedup for SFT but only 1.04× for base training, mainly because their initial configurations were different. Base/Mid training was already near optimal (batch size 32 → 64), while SFT began heavily under-batched (batch size 4 → 93), giving it much more room to improve.
2. KV-Cache for Inference
Artemis Detection: Planner's feature discovery identified missing KV-cache implementation—a standard optimization in modern LLM inference pipelines.
Problem
Autoregressive decoding recomputes attention every step (O(T²)).
For 200 generated tokens → ~29,900 operations.
Solution
Cache keys/values:
- Prefill once
- Recompute only for new tokens
Complexity becomes O(T) (≈250 ops).
Measured Real-World Performance
Testing on NVIDIA A100-SXM4-80GB with 178.5M parameter model:
| Configuration | Wall-Clock Time | Throughput | Speedup |
|---|---|---|---|
| Baseline (batch=1, no cache) | 21.41s | 81.4 tok/s | 1.0× |
| Optimized (batch=93, with cache) | 2.04s | 855.7 tok/s | 10.5× |
Benchmarked Time Saved: 91%
Why KV-cache Is Not Used in Training
Training uses full-sequence teacher forcing, not autoregressive decoding.
3. torch.compile Integration
Artemis Detection: Agent analysis identified Python overhead and unoptimized kernel execution in training loops.
Overview
PyTorch 2.x graph compilation:
- Operator fusion
- Kernel specialization
- Removal of Python overhead
Using:
model = torch.compile(model, dynamic=False)
Measured Results (100-step benchmark)
| Metric | No Compile | Compile | Improvement |
|---|---|---|---|
| Total Time | 64.4s | 38.6s | 40% faster |
| Throughput | 101,654 tok/s | 169,919 tok/s | 1.67× |
Why It Works
- 3–4 kernels fused into 1
- Static shapes allow optimized kernels
- Removes slow Python loops
Compilation requires discipline: Static shapes enable dramatic speedups (1.67×) but require careful tensor management.
4. Token Broadcasting Bug Fix
Artemis Detection: Scan caught an edge case where the first sampled token was incorrectly duplicated across all sequences.
Problem
The first sampled token was duplicated across all sequences.
Effects:
- No gradient diversity
- Reduced effective batch size
- Poor generation quality
Fix
tokens = torch.tensor([tokens] * micro_batch_size, device=device)
Impact
- 23× more unique samples
- Correct gradient accumulation
- Improved generation diversity
Combined Performance Impact
Training Pipeline
SFT Training Results:
| Configuration | Throughput | Speedup | Time Saved |
|---|---|---|---|
| Baseline | 50,164 tok/s | 1.0× | - |
| + Auto Batch | 95,337 tok/s | 1.90× | 47% |
| + torch.compile | 159,213 tok/s | 3.17× | 68% |
Base/Mid Training Results:
| Configuration | Throughput | Speedup | Time Saved |
|---|---|---|---|
| Baseline | baseline | 1.0× | - |
| + Auto Batch | +4% | 1.04× | 4% |
| + torch.compile | +77% | 1.74× | 43% |
Inference Pipeline
| Configuration | Wall-Clock Time | Throughput | Speedup | Time Saved |
|---|---|---|---|---|
| Baseline (batch=1, no KV-cache) | 21.41s | 81.4 tok/s | 1.0× | - |
| Optimized (batch=93, with KV-cache) | 2.04s | 855.7 tok/s | 10.5× | 91% |
Technical Environment
- Hardware: NVIDIA A100-SXM4-80GB
- Model Size: 178.5M parameters
- Framework: PyTorch 2.x with torch.compile
- Precision: BFloat16
- Discovery Time: ~20 seconds for batch size optimization
- Benchmark Scripts:
test_combined_optimizations.py,measure_torch_compile_realistic.py
Conclusion
Through systematic application of Artemis Intelligence workflows, NanoChat achieved:
- 3.17× faster SFT training (68% time reduction)
- 1.74× faster base/Mid training (43% time reduction)
- 10.5× faster inference (91% time reduction)
- Automated batch optimization eliminating 10-30 minutes of manual tuning per configuration
The training pipeline optimizations proved most impactful, with auto-batching addressing severe GPU underutilization (35% → 92%) and torch.compile providing consistent 1.67× speedups. Inference optimizations delivered substantial improvements, with benchmarked 91% time reduction in production use.
These improvements make NanoChat more practical for real experimentation and research, while maintaining its transparent, hackable codebase designed for educational purposes.