ALE Benchmark Optimization

Project Overview

Goal: Optimize AI agents for competitive programming using the ALE Benchmark through evolutionary prompt engineering and advanced search strategies

Before: Baseline zero-shot agent (66% acceptance rate)

After: Optimized agent with CoT and evolutionary search (75% acceptance rate, +13.6%)

Target Users: AI researchers, competitive programming enthusiasts, teams building production-grade agents

Use Cases:

Optimizing AI agents for algorithm design problems
Evolutionary prompt refinement for competitive programming
Cost-effective model selection (small models vs. large flagship models)
Benchmarking agentic capabilities under real-world constraints

Background

AI agents are transitioning from experiments to production systems. The challenge: should you use the most powerful (and expensive) models, or can smaller, optimized models achieve comparable performance through intelligent engineering?

This is an optimization problem. When you have a narrowly defined goal—like competitive programming—you don't need maximum generalist capability. Smaller models (under 10B parameters) can match larger ones on specific tasks while running 15× faster and costing 10-30× less [Belcak et al., 2024]. The key insight: most agentic tasks are repetitive, scoped, and non-conversational.

The ALE Benchmark [Imajuku et al., 2024] provides a rigorous testbed for this question: can smaller, tuned models be evolved into agents that deliver reliable, efficient solutions at scale? ALE focuses on heuristic optimization problems derived from real AtCoder contests, providing 40 challenging test cases that resist saturation and offer significant room for improvement.

Implementation

We explored two optimization paths to evolve the baseline agent's performance.

Phase 1: Prompt Optimization (CoT + Self-Correction)

Evolved prompts from simple chain-of-thought to sophisticated problem-solving with embedded self-correction. The baseline approach uses straightforward reasoning:

Baseline Prompt Structure — Simple chain-of-thought reasoning:

# From agent_baseline/prompts/prompt.py
SOLUTION APPROACH:

STEP 1 - PROBLEM UNDERSTANDING:
• Identify problem type (graph, optimization, simulation, etc.)
• Determine complexity constraints
• Extract key patterns and requirements

STEP 2 - ALGORITHM SELECTION:
• Choose primary approach based on problem type
• Consider alternative methods
• Plan implementation strategy

STEP 3 - IMPLEMENTATION STRATEGY:
• Select data structures
• Plan solution flow
• Consider edge cases

STEP 4 - CODE GENERATION:
• Implement the algorithm
• Add necessary optimizations
• Ensure correctness before optimization

Optimized Prompt Structure — Enhanced with embedded self-correction:

# From agent_optimized/prompts/prompt.py (383 lines)
STEP 5 - SELF-CORRECTION CHECKLIST:
Before writing the final code, mentally walk through your planned implementation:
    • Empty inputs / smallest allowed input size
    • Inputs at maximum constraint values
    • Duplicates, sorted/reversed order, or special values if relevant
    • Data types: Are all chosen types large enough? (e.g 64-bit for 10^18)
    • Does the solution logic handle all branches and early-exit cases?
    • Time complexity: Will it meet the problem's constraints?
    • Account for all relevant boundary and off-by-one errors
    • If floating-point: Will precision issues impact correctness?
    • For graph problems: disconnected graphs, self-loops, multiple edges?
    • Are all edge cases described in the problem statement addressed?

Results: Acceptance Rate +13.6% (75.0%), Performance +8.9% (546.0 ± 61.5 points), Cost maintained ~$25/eval.

Phase 2: Search Optimization (Genetic Algorithm + Advanced Search)

Employed genetic algorithms with beam search and taboo memory for systematic solution space exploration. The optimized agent implements multiple complementary search strategies:

A Search with Heuristics* — Informed search for solution space exploration:

# From agent_optimized/search/algorithms.py
class AStarSearch:
    def __init__(self, heuristic_fn=None, timing_manager=None):
        self.open_set: List[Tuple[float, SolutionCandidate]] = []
        self.closed_set: Set[str] = set()
        self.g_scores: Dict[str, float] = {}  # Path cost from start
        self.f_scores: Dict[str, float] = {}  # g + heuristic (h)

    def add_candidate(self, candidate: SolutionCandidate, g_score=None):
        """Add candidate with g_score (cost) and computed f_score (priority)"""
        h_score = self.heuristic_fn(candidate)  # Remaining work estimate
        f_score = g_score + h_score             # A* priority
        heapq.heappush(self.open_set, (f_score, candidate))

    def _default_heuristic(self, candidate: SolutionCandidate) -> float:
        """Heuristic: remaining distance to perfect solution"""
        remaining_score = max(0, 1.0 - candidate.score)
        if candidate.judge_result == "WRONG_ANSWER":
            remaining_score += 0.3  # More work needed
        elif candidate.judge_result == "TIME_LIMIT_EXCEEDED":
            remaining_score += 0.2  # Optimization possible
        return remaining_score * 0.8

Genetic Algorithm — Population-based evolutionary search:

# From agent_optimized/search/algorithms.py
class GeneticAlgorithm:
    def __init__(self, population_size=50, mutation_rate=0.1,
                 crossover_rate=0.7, elite_size=10):
        self.population: List[SolutionCandidate] = []
        self.generation = 0

    def _crossover(self, parent1, parent2) -> SolutionCandidate:
        """Combine two solutions at random crossover point"""
        code1_lines = parent1.code.split('\n')
        code2_lines = parent2.code.split('\n')
        point1 = random.randint(1, len(code1_lines) - 1)
        point2 = random.randint(1, len(code2_lines) - 1)
        # Inherit from both parents
        offspring_code = '\n'.join(code1_lines[:point1] + code2_lines[point2:])
        return SolutionCandidate(offspring_code, parent1.language,
                                score=(parent1.score + parent2.score) / 2)

    def evolve_generation(self):
        """Evolve population: select elite, create offspring, mutate"""
        self.population.sort(key=lambda x: x.score, reverse=True)
        # Keep best solutions (elitism)
        new_population = self.population[:self.elite_size]
        # Generate offspring through crossover and mutation
        while len(new_population) < self.population_size:
            parent1 = self._tournament_selection()
            parent2 = self._tournament_selection()
            child = self._crossover(parent1, parent2)
            if random.random() < self.mutation_rate:
                child = self._mutate(child)
            new_population.append(child)
        self.population = new_population
        self.generation += 1

Taboo Search — Memory-based search to avoid revisiting solutions:

# From agent_optimized/search/algorithms.py
class TabooSearch:
    def __init__(self, max_size=1000):
        self.lru_cache: OrderedDict[str, SolutionCandidate] = OrderedDict()
        self.access_count: Dict[str, int] = defaultdict(int)

    def is_taboo(self, candidate: SolutionCandidate) -> bool:
        """Check if solution was recently explored"""
        candidate_hash = candidate.get_hash()
        if candidate_hash in self.lru_cache:
            self.lru_cache.move_to_end(candidate_hash)  # Update LRU
            self.access_count[candidate_hash] += 1
            return True
        return False

    def add_to_taboo(self, candidate: SolutionCandidate):
        """Mark solution as recently explored"""
        candidate_hash = candidate.get_hash()
        if len(self.lru_cache) > self.max_size:
            # Evict least recently used
            oldest_hash, _ = self.lru_cache.popitem(last=False)
            self.access_count.pop(oldest_hash, None)
        self.lru_cache[candidate_hash] = candidate

Parallel Search Orchestration — Coordinate multiple algorithms:

# From agent_optimized/search/algorithms.py
class ParallelSearchOrchestrator:
    def __init__(self, max_workers=None):
        self.executor = ThreadPoolExecutor(max_workers=max_workers or 8)
        self.timing_manager = TimingManager()
        self.taboo_search = TabooSearch(max_size=2000)
        self.work_stealing_queue = WorkStealingQueue()

    def run_parallel_search(self, initial_candidates,
                           search_algorithms=None, max_iterations=10):
        """Execute multiple search strategies in parallel"""
        # Launch A*, genetic algorithm, and beam search simultaneously
        # Each algorithm explores independently
        # Work stealing allows load balancing when some threads finish early
        # Taboo search shared across all algorithms prevents duplicate work

Results: Acceptance Rate +9.3% (72.2%), Performance 530.9 ± 31.2 points with reduced variance, 29% runtime reduction, ~$24/eval.

Results

Key Improvements:

Prompt Optimization: +13.6% acceptance rate (66% → 75%), reduced errors through self-correction
Search Optimization: +9.3% acceptance rate, 29% faster execution, more consistent performance
Cost Efficiency: Maintained sub-$26 per evaluation while improving performance
Problem Recovery: Critical problems improved dramatically (e.g., ahc007: -61 → 669 points)

Repository

Complete source code available at github.com/turintech/ale-bench-optimisation

Quick start:

git clone https://github.com/turintech/ale-bench-optimisation
cd ale-bench-optimisation
pip install -r requirements.txt
python3 main_optimized.py --problems ahc001

References

Belcak, P., Heinrich, G., Diao, S., Fu, Y., Dong, X., Muralidharan, S., Lin, Y. C., & Molchanov, P. (2024). "Small Language Models are the Future of Agentic AI." arXiv preprint arXiv:2406.02153.
Imajuku, Y., Horie, K., Iwata, Y., Aoki, K., Takahashi, N., & Akiba, T. (2024). "ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering." arXiv preprint arXiv:2406.09050.

Project Overview​

Background​

Implementation​

Phase 1: Prompt Optimization (CoT + Self-Correction)​

Phase 2: Search Optimization (Genetic Algorithm + Advanced Search)​

Results​

Repository​

References​