Optimizing Problem-Solving Agents with Artemis

Author: William Evers-Hood

What is Math-Odyssey?

Math Odyssey is a comprehensive benchmark containing 387 mathematics problems spanning algebra, calculus, geometry, and number theory-from high school exercises to advanced Olympiad-style challenges. These problems serve as excellent proxies for real-world business tasks that require complex text comprehension, logical reasoning, and precise calculation skills. The applications for automating these capabilities span Finance, Healthcare, and any data-driven business where accurate problem-solving drives operational efficiency.

Introduction

How can we optimize problem-solving agents to deliver maximum business value? With Artemis, the answer lies in intelligent configuration and systematic optimization. Rather than chasing theoretical maximums or pushing for marginal accuracy gains, Artemis enables rapid iteration and precise balancing of multiple objectives-cost, speed, and accuracy-to align agent performance with practical business needs.

In this use case, we demonstrate how Artemis transforms a general-purpose agent framework into a highly efficient problem-solving system, achieving a 37% cost reduction while maintaining strong performance on the Math Odyssey benchmark. This showcases Artemis's core strength: making sophisticated AI optimization accessible and practical for real-world deployment.

Our Approach

Agent Framework

We chose CrewAI as the framework for this experiment due to its strong market presence and wide adoption by major companies. According to CrewAI Inc., their framework is currently used by 60% of Fortune 500 companies [CrewAI Inc., 2025]. CrewAI is a general-purpose, agentic framework with extensive open-source support. Our objective is to demonstrate how Artemis can transform broadly accessible frameworks for specific, challenging tasks, showing that widely used tools can achieve exceptional performance when properly optimized.

Base LLM

We selected Gemini 2.5 Flash for this optimization-a model that strikes an ideal balance between capability and efficiency. This choice demonstrates how Artemis can leverage modern, efficient models within agentic frameworks to achieve enterprise-grade performance at a fraction of the cost of flagship models. The agentic structure amplifies the model's capabilities, enabling it to tackle complex problem-solving tasks that traditionally required the most expensive LLMs.

Agent Optimization Strategy

Artemis optimizes agents through intelligent manipulation of parameters, prompts, and problem structure, efficiently exploring the configuration space to discover optimal settings. This approach works with any agent framework, making it universally applicable across different platforms and implementations.

For this optimization, Artemis Intelligence was applied for 3 cycles with 5 instances generated at each cycle. Results were verified against a subset of the benchmark to ensure compatibility. Artemis then assessed the relative changes in efficiency and continued forward to generate new configurations using the best candidates from the current generation via a genetic algorithm.

In the case of these optimizations the prompt was:

ex. This code is designed to solve math problems agentically. I would like it to be optimized to reduce the number of Prompt tokens and Completion tokens used to solve a given problem. Please accomplish this task by manipulating the parameters, attributes, and most importantly token limits in each file. Please balance efficiency gains with this codes ability to solve math problems.

Why Configuration Matters More Than Models

The Math Odyssey benchmark illuminates why: real-world problems require coordinating multiple reasoning strategies. The benchmark's 387 problems span algebra, calculus, geometry, and number theory-domains that require diverse approaches, verification strategies, and error handling, and it does not require significant extrapolation to see that many business use cases will require breadth and complexity from an agent.

12+ specialized prompts For each agents role, background, task, and expected output. As well as system level prompts applying to all tasks together regarding general execution practices
Over 20 configurable parameters including temperatures, retry thresholds, verbosity, agent execution style, reporting style, token limits, etc.

Configuration Complexity

📝

12+ Prompts

Specialized

⚙️

20+ Params

Configurable

🔢

>10^20

Combinations

The configuration space is vast when considering natural language variations in prompts, parameter settings, and multi-agent coordination patterns. With over 10^20 possible combinations-magnitudes larger than the estimated 7.5×10^18 grains of sand on Earth-finding optimal settings through manual tuning would require significant effort from highly skilled engineers. Artemis automates this exploration, discovering configurations that would be impractical to find manually.

Optimized vs Original Agent

Optimally configured prompts: The optimized agent uses prompts that are as brief as possible without handicapping the LLM with too little context.
Configured Parameters: The Crewai framework has 20+ parameters to influence agent behaviour. Artemis is furnished with a list of these parameters and configures them optimally.
Hard Cost Limits: The optimized agent's largest efficiency gain comes from Artemis setting and testing hard limits on the expense applied to each problem through token limits. Some tweaking and basic analysis is reuqired to find optimal configuration for each application.

Results

Results graph

37% cost reduction: Artemis achieved dramatic efficiency gains, reducing operational costs by over a third.

Results graph

36% median cost reduction: The optimization created consistent efficiency improvements across the entire problem set, not just outliers.
Performance maintained: These efficiency gains came with minimal impact on accuracy.

Results graph

Artemis successfully achieved its optimization objective: a 37% reduction in evaluation costs and 36% reduction in median cost per problem, while maintaining strong accuracy (only ~6% reduction, not statistically significant). This demonstrates the power of intelligent configuration over brute-force approaches.

Semantic Layer Differences

Semantic Layer difference

Artemis transforms simple baseline prompts into action-focused, precise instructions. While simplicity is often valued in prompt engineering, Artemis discovers that strategic complexity-when properly structured-drives both accuracy and efficiency. The optimized prompts use clear, direct language that better aligns model behavior with business objectives.

Semantic Layer differences

Artemis systematically refines language for clarity and efficiency. For example, vague instructions like "learn about it" become "Directly solve," while verbose requests transform into precise directives: "Provide a solution... with minimal explanation... necessary calculations and the final answer."

Parameter Differences

Parameters Differences

Parameter Differences

Artemis implements strategic parameter adjustments that maximize efficiency. Key optimizations include intelligent token limits that prevent runaway costs while maintaining quality, and enhanced semantic definitions that focus agents on efficient execution patterns.

Workflow

Artemis transforms the optimization workflow by automating the discovery of optimal configurations. This allows teams to focus on strategic decisions and business logic while Artemis handles the complex task of parameter tuning. The result is faster deployment, better performance, and the ability to rapidly explore different cost-performance trade-offs to meet specific business requirements.

Conclusion

Artemis delivers measurable value as an intelligent optimization system for AI agents. On the Math Odyssey benchmark, it demonstrated that true optimization isn't about marginal accuracy gains, but about achieving the right balance of cost, speed, and performance for your specific use case.

By automating the discovery of optimal configurations, Artemis enables:

37% cost reduction without sacrificing quality
Rapid deployment of optimized agents
Strategic focus for development teams
Scalable optimization across different frameworks and models

For businesses deploying AI agents at scale, Artemis represents a critical capability: the ability to systematically optimize performance while controlling costs, making enterprise AI both powerful and practical.

References

Fang, M., Wan, X., Lu, F., Xing, F., & Zou, K. (2024). MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data. arXiv. https://arxiv.org/abs/2406.18321

CrewAI Inc. (2025). CrewAI (Version 0.177.0) [Computer software]. Available from https://github.com/crewAIInc/crewAI

Environmental Literacy Council. “How Many Grains of Sand Are on the Earth?” Environmental Literacy Council. Accessed September 2025. URL: https://enviroliteracy.org/how-many-grains-of-sand-are-there-on-earth/

What is Math-Odyssey?​

Introduction​

Our Approach​

Agent Framework​

Base LLM​

Agent Optimization Strategy​

Why Configuration Matters More Than Models​

Configuration Complexity

Optimized vs Original Agent​

Results​

Semantic Layer Differences​

Parameter Differences​

Workflow​

Conclusion​

References​

What is Math-Odyssey?

Introduction

Our Approach

Agent Framework

Base LLM

Agent Optimization Strategy

Why Configuration Matters More Than Models

Optimized vs Original Agent

Results

Semantic Layer Differences

Parameter Differences

Workflow

Conclusion

References