ALE Benchmark Optimization
Project Overview
Goal: Optimize AI agents for competitive programming using the ALE Benchmark through evolutionary prompt engineering and advanced search strategies
Before: Baseline zero-shot agent (66% acceptance rate)
After: Optimized agent with CoT and evolutionary search (75% acceptance rate, +13.6%)
Target Users: AI researchers, competitive programming enthusiasts, teams building production-grade agents
Use Cases:
- Optimizing AI agents for algorithm design problems
- Evolutionary prompt refinement for competitive programming
- Cost-effective model selection (small models vs. large flagship models)
- Benchmarking agentic capabilities under real-world constraints
Background
AI agents are transitioning from experiments to production systems. The challenge: should you use the most powerful (and expensive) models, or can smaller, optimized models achieve comparable performance through intelligent engineering?
This is an optimization problem. When you have a narrowly defined goal—like competitive programming—you don't need maximum generalist capability. Smaller models (under 10B parameters) can match larger ones on specific tasks while running 15× faster and costing 10-30× less [Belcak et al., 2024]. The key insight: most agentic tasks are repetitive, scoped, and non-conversational.
The ALE Benchmark [Imajuku et al., 2024] provides a rigorous testbed for this question: can smaller, tuned models be evolved into agents that deliver reliable, efficient solutions at scale? ALE focuses on heuristic optimization problems derived from real AtCoder contests, providing 40 challenging test cases that resist saturation and offer significant room for improvement.
Implementation
We explored two optimization paths to evolve the baseline agent's performance.
Phase 1: Prompt Optimization (CoT + Self-Correction)
Evolved prompts from simple chain-of-thought to sophisticated problem-solving with embedded self-correction. The baseline approach uses straightforward reasoning:
Baseline Prompt Structure — Simple chain-of-thought reasoning:
# From agent_baseline/prompts/prompt.py
SOLUTION APPROACH:
STEP 1 - PROBLEM UNDERSTANDING:
• Identify problem type (graph, optimization, simulation, etc.)
• Determine complexity constraints
• Extract key patterns and requirements
STEP 2 - ALGORITHM SELECTION:
• Choose primary approach based on problem type
• Consider alternative methods
• Plan implementation strategy
STEP 3 - IMPLEMENTATION STRATEGY:
• Select data structures
• Plan solution flow
• Consider edge cases
STEP 4 - CODE GENERATION:
• Implement the algorithm
• Add necessary optimizations
• Ensure correctness before optimization
Optimized Prompt Structure — Enhanced with embedded self-correction:
# From agent_optimized/prompts/prompt.py (383 lines)
STEP 5 - SELF-CORRECTION CHECKLIST:
Before writing the final code, mentally walk through your planned implementation:
• Empty inputs / smallest allowed input size
• Inputs at maximum constraint values
• Duplicates, sorted/reversed order, or special values if relevant
• Data types: Are all chosen types large enough? (e.g 64-bit for 10^18)
• Does the solution logic handle all branches and early-exit cases?
• Time complexity: Will it meet the problem's constraints?
• Account for all relevant boundary and off-by-one errors
• If floating-point: Will precision issues impact correctness?
• For graph problems: disconnected graphs, self-loops, multiple edges?
• Are all edge cases described in the problem statement addressed?
Results: Acceptance Rate +13.6% (75.0%), Performance +8.9% (546.0 ± 61.5 points), Cost maintained ~$25/eval.
Phase 2: Search Optimization (Genetic Algorithm + Advanced Search)
Employed genetic algorithms with beam search and taboo memory for systematic solution space exploration. The optimized agent implements multiple complementary search strategies:
A Search with Heuristics* — Informed search for solution space exploration:
# From agent_optimized/search/algorithms.py
class AStarSearch:
def __init__(self, heuristic_fn=None, timing_manager=None):
self.open_set: List[Tuple[float, SolutionCandidate]] = []
self.closed_set: Set[str] = set()
self.g_scores: Dict[str, float] = {} # Path cost from start
self.f_scores: Dict[str, float] = {} # g + heuristic (h)
def add_candidate(self, candidate: SolutionCandidate, g_score=None):
"""Add candidate with g_score (cost) and computed f_score (priority)"""
h_score = self.heuristic_fn(candidate) # Remaining work estimate
f_score = g_score + h_score # A* priority
heapq.heappush(self.open_set, (f_score, candidate))
def _default_heuristic(self, candidate: SolutionCandidate) -> float:
"""Heuristic: remaining distance to perfect solution"""
remaining_score = max(0, 1.0 - candidate.score)
if candidate.judge_result == "WRONG_ANSWER":
remaining_score += 0.3 # More work needed
elif candidate.judge_result == "TIME_LIMIT_EXCEEDED":
remaining_score += 0.2 # Optimization possible
return remaining_score * 0.8
Genetic Algorithm — Population-based evolutionary search:
# From agent_optimized/search/algorithms.py
class GeneticAlgorithm:
def __init__(self, population_size=50, mutation_rate=0.1,
crossover_rate=0.7, elite_size=10):
self.population: List[SolutionCandidate] = []
self.generation = 0
def _crossover(self, parent1, parent2) -> SolutionCandidate:
"""Combine two solutions at random crossover point"""
code1_lines = parent1.code.split('\n')
code2_lines = parent2.code.split('\n')
point1 = random.randint(1, len(code1_lines) - 1)
point2 = random.randint(1, len(code2_lines) - 1)
# Inherit from both parents
offspring_code = '\n'.join(code1_lines[:point1] + code2_lines[point2:])
return SolutionCandidate(offspring_code, parent1.language,
score=(parent1.score + parent2.score) / 2)
def evolve_generation(self):
"""Evolve population: select elite, create offspring, mutate"""
self.population.sort(key=lambda x: x.score, reverse=True)
# Keep best solutions (elitism)
new_population = self.population[:self.elite_size]
# Generate offspring through crossover and mutation
while len(new_population) < self.population_size:
parent1 = self._tournament_selection()
parent2 = self._tournament_selection()
child = self._crossover(parent1, parent2)
if random.random() < self.mutation_rate:
child = self._mutate(child)
new_population.append(child)
self.population = new_population
self.generation += 1
Taboo Search — Memory-based search to avoid revisiting solutions:
# From agent_optimized/search/algorithms.py
class TabooSearch:
def __init__(self, max_size=1000):
self.lru_cache: OrderedDict[str, SolutionCandidate] = OrderedDict()
self.access_count: Dict[str, int] = defaultdict(int)
def is_taboo(self, candidate: SolutionCandidate) -> bool:
"""Check if solution was recently explored"""
candidate_hash = candidate.get_hash()
if candidate_hash in self.lru_cache:
self.lru_cache.move_to_end(candidate_hash) # Update LRU
self.access_count[candidate_hash] += 1
return True
return False
def add_to_taboo(self, candidate: SolutionCandidate):
"""Mark solution as recently explored"""
candidate_hash = candidate.get_hash()
if len(self.lru_cache) > self.max_size:
# Evict least recently used
oldest_hash, _ = self.lru_cache.popitem(last=False)
self.access_count.pop(oldest_hash, None)
self.lru_cache[candidate_hash] = candidate
Parallel Search Orchestration — Coordinate multiple algorithms:
# From agent_optimized/search/algorithms.py
class ParallelSearchOrchestrator:
def __init__(self, max_workers=None):
self.executor = ThreadPoolExecutor(max_workers=max_workers or 8)
self.timing_manager = TimingManager()
self.taboo_search = TabooSearch(max_size=2000)
self.work_stealing_queue = WorkStealingQueue()
def run_parallel_search(self, initial_candidates,
search_algorithms=None, max_iterations=10):
"""Execute multiple search strategies in parallel"""
# Launch A*, genetic algorithm, and beam search simultaneously
# Each algorithm explores independently
# Work stealing allows load balancing when some threads finish early
# Taboo search shared across all algorithms prevents duplicate work
Results: Acceptance Rate +9.3% (72.2%), Performance 530.9 ± 31.2 points with reduced variance, 29% runtime reduction, ~$24/eval.
Results
Key Improvements:
- Prompt Optimization: +13.6% acceptance rate (66% → 75%), reduced errors through self-correction
- Search Optimization: +9.3% acceptance rate, 29% faster execution, more consistent performance
- Cost Efficiency: Maintained sub-$26 per evaluation while improving performance
- Problem Recovery: Critical problems improved dramatically (e.g., ahc007: -61 → 669 points)
Repository
Complete source code available at github.com/turintech/ale-bench-optimisation
Quick start:
git clone https://github.com/turintech/ale-bench-optimisation
cd ale-bench-optimisation
pip install -r requirements.txt
python3 main_optimized.py --problems ahc001
References
- Belcak, P., Heinrich, G., Diao, S., Fu, Y., Dong, X., Muralidharan, S., Lin, Y. C., & Molchanov, P. (2024). "Small Language Models are the Future of Agentic AI." arXiv preprint arXiv:2406.02153.
- Imajuku, Y., Horie, K., Iwata, Y., Aoki, K., Takahashi, N., & Akiba, T. (2024). "ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering." arXiv preprint arXiv:2406.09050.