SimpleTES: Evaluation-Driven Scaling Framework

SimpleTES is a C++-backed, Python-driven framework designed for open-ended problem solving where the path to a solution depends less on longer model outputs and more on iterative, evaluator-guided search. It implements the propose → evaluate → refine loop as a first-class runtime primitive—allocating finite test-time compute across parallel exploration, refinement depth, candidate selection, and history-aware prompting. The project emerged from the paper Evaluation-driven Scaling for Scientific Discovery, and its reference implementation has discovered state-of-the-art solutions on 21 open-ended tasks across domains including program synthesis, circuit design, physical law discovery, and numerical optimization. It is not a language model, nor a fine-tuning toolkit. It is a scheduler and policy engine that orchestrates repeated interactions between a language model and an external evaluator—such as a compiler, verifier, simulator, or runtime profiler.

What it does

Executes adaptive, budget-aware search over candidate solutions using a fixed evaluator budget N, distributed across four tunable levers: parallel chains (--num-chains), local best-of-K selection (--k-candidates), feedback-compounding refinement depth, and history-to-prompt selection policies (--selector).
Supports pluggable LLM backends—including local models via vLLM or Hugging Face Transformers—and integrates with task-specific evaluators written in Python.
Ships with a runnable benchmark tree in datasets/, covering 21 problems with predefined seeds, assets, and evaluators (e.g., verifying synthesized Python programs or scoring circuit area/latency).
Provides two entry points: main_wizard.py for interactive exploration and main.py for scripted, reproducible, or cluster-based execution.
Releases the highest-scoring artifacts from the paper in best_results/, including discovered programs, circuit netlists, and fitted equations—each accompanied by evaluation logs and scores.

Getting it running

SimpleTES requires Python 3.11 or later. The recommended installation uses uv, a fast Python package manager:

uv sync
uv sync --extra vllm  # enables vLLM token-forcing backend (optional)

If uv is unavailable, use pip:

pip install -e .

After installation, launch an interactive session with:

python main_wizard.py

This starts a guided CLI that lets you select a task from the datasets/ collection, choose an LLM backend, and adjust search parameters like --num-chains or --k-candidates. For scripted or batch runs, use main.py:

python main.py --task datasets/program_synthesis/humaneval_plus_v2/ --num-chains 8 --k-candidates 3

The framework expects evaluators to be implemented as Python classes conforming to a simple interface—evaluate(candidate) must return a numeric score and optional metadata. Task definitions live in datasets/ as self-contained directories with evaluator.py, prompt_template.txt, and seeds.json.

Who this is for

Researchers and engineers working on test-time scaling, automated scientific discovery, or evaluator-coupled reasoning—especially where correctness, runtime behavior, or physical constraints matter more than linguistic fluency. It suits use cases like generating verified code, optimizing hardware designs under timing constraints, or fitting interpretable models to experimental data. If your problem has a well-defined, scriptable evaluator (e.g., a SAT solver, a physics simulator, or a kernel profiler), and you want to systematically allocate compute across exploration and refinement—not just token count—SimpleTES provides the runtime scaffolding.

It is not intended for end users seeking turnkey AI assistants, chat interfaces, or fine-tuned model deployments. There is no web UI, no API server, and no model hosting layer. It assumes familiarity with Python scripting, CLI tooling, and evaluator design. The simpletes/ subdirectory contains the core engine—scheduler logic, policy dispatch, and LLM interaction—but is not meant to be imported and used as a library outside the provided CLI workflows.

How it compares

SimpleTES diverges from mainstream test-time scaling approaches like Chain-of-Thought, Self-Consistency, or Tree-of-Thought, which typically increase compute by sampling more tokens or more answer candidates from a static prompt. Instead, it treats the evaluator as the central driver and scales the loop itself. It shares conceptual ground with programs like AlphaDev (search over assembly kernels) or DreamCoder (probabilistic program induction), but is more general in its task interface and policy configuration. Unlike reinforcement learning frameworks (e.g., RLlib) or Bayesian optimization toolkits (e.g., BoTorch), SimpleTES does not require reward shaping, gradient updates, or probabilistic modeling—it relies on deterministic, evaluator-provided scores and explicit search levers.

It is lighter than full-scale autonomous agent systems (e.g., AutoGen or Microsoft’s Agent Framework) but more structured than ad-hoc prompt engineering or simple best-of-N sampling. Its reliance on external evaluators makes it less portable than pure LLM-based solvers, but more grounded in objective outcomes.

The project has 82 stars on GitHub and is hosted at https://github.com/wq-will/SimpleTES, with documentation and a live demo available at https://www.wizardquant.com/will/simpletes.

SimpleTES: A general framework for strategically scaling evaluation-driven di

Comments

Comments

Related Posts

ESPForge: Visual Tool for ESPHome YAML with 41 Boards and 99 Components

Solar Forecast Card: Visualize solar forecasts in Home Assistant dashboards

mythos-agent: AI Code-Review Assistant for Application Security

BenchJack: Scans AI Agent Benchmarks for Hackability Vulnerabilities