Skip to main content

NanoChat Optimization

Project Description

NanoChat is a lightweight, educational LLM framework created by Andrej Karpathy that provides a minimal, end-to-end pipeline for building language models. This case study demonstrates how Artemis Intelligence was used to systematically optimize NanoChat's performance across training and inference workflows.

Project Repository: github.com/karpathy/nanochat

Goals

The purpose of this project is to apply Artemis Intelligence to NanoChat in order to:

  • Optimize performance
    Improve training speed, inference latency, memory usage, and KV-cache efficiency.
  • Benchmark and validate improvements
    Measure before vs after using automated profiling.
  • Support feature experimentation
    Use the Planner to propose and integrate new features (sampling, profiling hooks, training configs).

Optimization Workflows

1. Identify and fix bottlenecks

OptimiseAdd TargetAgent Analysis → Detect Bottlenecks → Plan to Fix

2. Discover new feature opportunities

Using the Planner's "add new feature" instruction.

3. Scan for critical issues

ScanPlan to Fix

Optimizations Implemented

OptimizationProblem AddressedSolution Applied
Auto Batch Size DiscoveryLow GPU utilization (~35%)Exponential search + binary refinement with safety margins
KV-Cache for InferenceO(T²) attention recomputationPrefill once, cache key-value pairs for autoregressive decoding
torch.compile IntegrationPython overhead, unoptimized kernelsStatic shape compilation with operator fusion
Token Broadcasting FixDuplicate tokens across sequencesIndependent sampling per sequence

Performance Results

Training Optimizations

OptimizationTraining SpeedupTraining Time SavedNotes
Auto Batch1.90× (SFT), 1.04× (Base/Mid)47% / 4%Maximizes GPU utilization
torch.compile1.67×40%Operator fusion + static shapes
Combined3.17× (SFT), 1.74× (Base/Mid)68% / 43%Compound effect

Inference Optimizations

ConfigurationWall-Clock TimeThroughputSpeedupTime Saved
Baseline (batch=1, no KV-cache)21.41s81.4 tok/s1.0×-
Optimized (batch=93, with KV-cache)2.04s855.7 tok/s10.5×91%

Note: Throughput-based calculations suggest 88× theoretical speedup (99% time saved), but we benchmarked 10.5× speedup (91% time saved) in real wall-clock time. The difference is due to system-level overhead (memory management, kernel launches, synchronization).

Combined Results Summary

  • SFT Training → 68% time reduction (3.17× faster)
  • Base/Mid Training → 43% time reduction (1.74× faster)
  • Inference → 91% time reduction (10.5× faster)

Detailed Optimizations

1. Auto Batch Size Discovery

Artemis Detection: Agent analysis revealed GPU utilization at only ~35% during SFT training, indicating severe under-batching.

Overview

Automated batch discovery using:

  • Exponential growth
  • Binary search
  • 15% safety margin
  • Batch caching

Discovery time: ~20 seconds

Why It Matters

  • Removes manual tuning (10–30 minutes)
  • Prevents underutilization (batch=4 → ~35% GPU usage)
  • Adapts to GPU, model, and sequence length
  • SFT benefits dramatically (severely under-batched)

SFT Training

MetricHardcodedAuto BatchImprovement
Throughput50,164 tok/s95,337 tok/s1.90×
Time Saved47%
Batch Size49323× larger
GPU Util~35%~92%+2.6×

Base/Mid Training

MetricHardcodedAuto BatchImprovement
Throughputbaselinebaseline +4%1.04×
Time Saved4%
note

The same auto-batching technique resulted in a 1.90× speedup for SFT but only 1.04× for base training, mainly because their initial configurations were different. Base/Mid training was already near optimal (batch size 32 → 64), while SFT began heavily under-batched (batch size 4 → 93), giving it much more room to improve.

2. KV-Cache for Inference

Artemis Detection: Planner's feature discovery identified missing KV-cache implementation—a standard optimization in modern LLM inference pipelines.

Problem

Autoregressive decoding recomputes attention every step (O(T²)).
For 200 generated tokens → ~29,900 operations.

Solution

Cache keys/values:

  • Prefill once
  • Recompute only for new tokens

Complexity becomes O(T) (≈250 ops).

Measured Real-World Performance

Testing on NVIDIA A100-SXM4-80GB with 178.5M parameter model:

ConfigurationWall-Clock TimeThroughputSpeedup
Baseline (batch=1, no cache)21.41s81.4 tok/s1.0×
Optimized (batch=93, with cache)2.04s855.7 tok/s10.5×

Benchmarked Time Saved: 91%

Why KV-cache Is Not Used in Training

Training uses full-sequence teacher forcing, not autoregressive decoding.

3. torch.compile Integration

Artemis Detection: Agent analysis identified Python overhead and unoptimized kernel execution in training loops.

Overview

PyTorch 2.x graph compilation:

  • Operator fusion
  • Kernel specialization
  • Removal of Python overhead

Using:

model = torch.compile(model, dynamic=False)

Measured Results (100-step benchmark)

MetricNo CompileCompileImprovement
Total Time64.4s38.6s40% faster
Throughput101,654 tok/s169,919 tok/s1.67×

Why It Works

  • 3–4 kernels fused into 1
  • Static shapes allow optimized kernels
  • Removes slow Python loops

Compilation requires discipline: Static shapes enable dramatic speedups (1.67×) but require careful tensor management.

4. Token Broadcasting Bug Fix

Artemis Detection: Scan caught an edge case where the first sampled token was incorrectly duplicated across all sequences.

Problem

The first sampled token was duplicated across all sequences.

Effects:

  • No gradient diversity
  • Reduced effective batch size
  • Poor generation quality

Fix

tokens = torch.tensor([tokens] * micro_batch_size, device=device)

Impact

  • 23× more unique samples
  • Correct gradient accumulation
  • Improved generation diversity

Combined Performance Impact

Training Pipeline

SFT Training Results:

ConfigurationThroughputSpeedupTime Saved
Baseline50,164 tok/s1.0×-
+ Auto Batch95,337 tok/s1.90×47%
+ torch.compile159,213 tok/s3.17×68%

Base/Mid Training Results:

ConfigurationThroughputSpeedupTime Saved
Baselinebaseline1.0×-
+ Auto Batch+4%1.04×4%
+ torch.compile+77%1.74×43%

Inference Pipeline

ConfigurationWall-Clock TimeThroughputSpeedupTime Saved
Baseline (batch=1, no KV-cache)21.41s81.4 tok/s1.0×-
Optimized (batch=93, with KV-cache)2.04s855.7 tok/s10.5×91%

Technical Environment

  • Hardware: NVIDIA A100-SXM4-80GB
  • Model Size: 178.5M parameters
  • Framework: PyTorch 2.x with torch.compile
  • Precision: BFloat16
  • Discovery Time: ~20 seconds for batch size optimization
  • Benchmark Scripts: test_combined_optimizations.py, measure_torch_compile_realistic.py

Conclusion

Through systematic application of Artemis Intelligence workflows, NanoChat achieved:

  • 3.17× faster SFT training (68% time reduction)
  • 1.74× faster base/Mid training (43% time reduction)
  • 10.5× faster inference (91% time reduction)
  • Automated batch optimization eliminating 10-30 minutes of manual tuning per configuration

The training pipeline optimizations proved most impactful, with auto-batching addressing severe GPU underutilization (35% → 92%) and torch.compile providing consistent 1.67× speedups. Inference optimizations delivered substantial improvements, with benchmarked 91% time reduction in production use.

These improvements make NanoChat more practical for real experimentation and research, while maintaining its transparent, hackable codebase designed for educational purposes.