NanoChat Optimization

Project Description

NanoChat is a lightweight, educational LLM framework created by Andrej Karpathy that provides a minimal, end-to-end pipeline for building language models. This case study demonstrates how Artemis Intelligence was used to systematically optimize NanoChat's performance across training and inference workflows.

Project Repository: github.com/karpathy/nanochat

Goals

The purpose of this project is to apply Artemis Intelligence to NanoChat in order to:

Optimize performance
Improve training speed, inference latency, memory usage, and KV-cache efficiency.
Benchmark and validate improvements
Measure before vs after using automated profiling.
Support feature experimentation
Use the Planner to propose and integrate new features (sampling, profiling hooks, training configs).

Optimization Workflows

Optimizations Implemented

Optimization	Problem Addressed	Solution Applied
Auto Batch Size Discovery	Low GPU utilization (~35%)	Exponential search + binary refinement with safety margins
KV-Cache for Inference	O(T²) attention recomputation	Prefill once, cache key-value pairs for autoregressive decoding
torch.compile Integration	Python overhead, unoptimized kernels	Static shape compilation with operator fusion
Token Broadcasting Fix	Duplicate tokens across sequences	Independent sampling per sequence

Performance Results

Training Optimizations

Optimization	Training Speedup	Training Time Saved	Notes
Auto Batch	1.90× (SFT), 1.04× (Base/Mid)	47% / 4%	Maximizes GPU utilization
torch.compile	1.67×	40%	Operator fusion + static shapes
Combined	3.17× (SFT), 1.74× (Base/Mid)	68% / 43%	Compound effect

Inference Optimizations

Configuration	Wall-Clock Time	Throughput	Speedup	Time Saved
Baseline (batch=1, no KV-cache)	21.41s	81.4 tok/s	1.0×	-
Optimized (batch=93, with KV-cache)	2.04s	855.7 tok/s	10.5×	91%

Note: Throughput-based calculations suggest 88× theoretical speedup (99% time saved), but we benchmarked 10.5× speedup (91% time saved) in real wall-clock time. The difference is due to system-level overhead (memory management, kernel launches, synchronization).

Combined Results Summary

SFT Training → 68% time reduction (3.17× faster)
Base/Mid Training → 43% time reduction (1.74× faster)
Inference → 91% time reduction (10.5× faster)

Detailed Optimizations

1. Auto Batch Size Discovery

Artemis Detection: Agent analysis revealed GPU utilization at only ~35% during SFT training, indicating severe under-batching.

Overview

Automated batch discovery using:

Exponential growth
Binary search
15% safety margin
Batch caching

Discovery time: ~20 seconds

Why It Matters

Removes manual tuning (10–30 minutes)
Prevents underutilization (batch=4 → ~35% GPU usage)
Adapts to GPU, model, and sequence length
SFT benefits dramatically (severely under-batched)

SFT Training

Metric	Hardcoded	Auto Batch	Improvement
Throughput	50,164 tok/s	95,337 tok/s	1.90×
Time Saved	—	—	47%
Batch Size	4	93	23× larger
GPU Util	~35%	~92%	+2.6×

Base/Mid Training

Metric	Hardcoded	Auto Batch	Improvement
Throughput	baseline	baseline +4%	1.04×
Time Saved	—	—	4%

note

The same auto-batching technique resulted in a 1.90× speedup for SFT but only 1.04× for base training, mainly because their initial configurations were different. Base/Mid training was already near optimal (batch size 32 → 64), while SFT began heavily under-batched (batch size 4 → 93), giving it much more room to improve.

2. KV-Cache for Inference

Artemis Detection: Planner's feature discovery identified missing KV-cache implementation—a standard optimization in modern LLM inference pipelines.

Problem

Autoregressive decoding recomputes attention every step (O(T²)).
For 200 generated tokens → ~29,900 operations.

Solution

Cache keys/values:

Prefill once
Recompute only for new tokens

Complexity becomes O(T) (≈250 ops).

Measured Real-World Performance

Testing on NVIDIA A100-SXM4-80GB with 178.5M parameter model:

Configuration	Wall-Clock Time	Throughput	Speedup
Baseline (batch=1, no cache)	21.41s	81.4 tok/s	1.0×
Optimized (batch=93, with cache)	2.04s	855.7 tok/s	10.5×

Benchmarked Time Saved: 91%

Why KV-cache Is Not Used in Training

Training uses full-sequence teacher forcing, not autoregressive decoding.

3. torch.compile Integration

Artemis Detection: Agent analysis identified Python overhead and unoptimized kernel execution in training loops.

Overview

PyTorch 2.x graph compilation:

Operator fusion
Kernel specialization
Removal of Python overhead

Using:

model = torch.compile(model, dynamic=False)

Measured Results (100-step benchmark)

Metric	No Compile	Compile	Improvement
Total Time	64.4s	38.6s	40% faster
Throughput	101,654 tok/s	169,919 tok/s	1.67×

Why It Works

3–4 kernels fused into 1
Static shapes allow optimized kernels
Removes slow Python loops

Compilation requires discipline: Static shapes enable dramatic speedups (1.67×) but require careful tensor management.

4. Token Broadcasting Bug Fix

Artemis Detection: Scan caught an edge case where the first sampled token was incorrectly duplicated across all sequences.

Problem

The first sampled token was duplicated across all sequences.

Effects:

No gradient diversity
Reduced effective batch size
Poor generation quality

Fix

tokens = torch.tensor([tokens] * micro_batch_size, device=device)

Impact

23× more unique samples
Correct gradient accumulation
Improved generation diversity

Combined Performance Impact

Training Pipeline

SFT Training Results:

Configuration	Throughput	Speedup	Time Saved
Baseline	50,164 tok/s	1.0×	-
+ Auto Batch	95,337 tok/s	1.90×	47%
+ torch.compile	159,213 tok/s	3.17×	68%

Base/Mid Training Results:

Configuration	Throughput	Speedup	Time Saved
Baseline	baseline	1.0×	-
+ Auto Batch	+4%	1.04×	4%
+ torch.compile	+77%	1.74×	43%

Inference Pipeline

Configuration	Wall-Clock Time	Throughput	Speedup	Time Saved
Baseline (batch=1, no KV-cache)	21.41s	81.4 tok/s	1.0×	-
Optimized (batch=93, with KV-cache)	2.04s	855.7 tok/s	10.5×	91%

Technical Environment

Hardware: NVIDIA A100-SXM4-80GB
Model Size: 178.5M parameters
Framework: PyTorch 2.x with torch.compile
Precision: BFloat16
Discovery Time: ~20 seconds for batch size optimization
Benchmark Scripts: test_combined_optimizations.py, measure_torch_compile_realistic.py

Conclusion

Through systematic application of Artemis Intelligence workflows, NanoChat achieved:

3.17× faster SFT training (68% time reduction)
1.74× faster base/Mid training (43% time reduction)
10.5× faster inference (91% time reduction)
Automated batch optimization eliminating 10-30 minutes of manual tuning per configuration

The training pipeline optimizations proved most impactful, with auto-batching addressing severe GPU underutilization (35% → 92%) and torch.compile providing consistent 1.67× speedups. Inference optimizations delivered substantial improvements, with benchmarked 91% time reduction in production use.

These improvements make NanoChat more practical for real experimentation and research, while maintaining its transparent, hackable codebase designed for educational purposes.

Project Description​

Goals​

Optimization Workflows​

1. Identify and fix bottlenecks​

2. Discover new feature opportunities​

3. Scan for critical issues​

Optimizations Implemented​

Performance Results​

Training Optimizations​

Inference Optimizations​

Combined Results Summary​

Detailed Optimizations​

1. Auto Batch Size Discovery​

2. KV-Cache for Inference​

3. torch.compile Integration​

4. Token Broadcasting Bug Fix​

Combined Performance Impact​

Training Pipeline​

Inference Pipeline​

Technical Environment​

Conclusion​