public benchmark · Harbor based

sim-benchmark.

A public benchmark for developing and evaluating LLM agents on real CAE and EDA simulation workflows.

Current public suite: 31 tasks across LTspice circuits and OpenFOAM 11 CFD cases, with artifact-grounded scoring and no LLM judge.

Last updated May 10, 2026

// leaderboard

Current public rows.

Model leaderboard score · 0 to 1

Model	LTspice circuits 20 tasks	OpenFOAM fluids 11 tasks
Claude Opus 4.6	0.986	0.918
MiniMax-M2.7-highspeed	0.899	0.804
MiniMax-M2.5-highspeed	0.899	0.706
MiniMax-M2.7	0.838	0.675

Current public suite: 31 tasks, with 20 LTspice circuits and 11 OpenFOAM CFD cases. OpenFOAM 11 covers Bénard convection, dam-break multiphase, DNS turbulence, oblique shock, non-Newtonian flow, pitzdaily BFS, and lid-driven cavities. Scores are 0 to 1; GitHub contains the produced files and run details.

Public repo → OpenFOAM 11 results → OpenFOAM 11 artifacts →

// what is tested

From prompt to trusted simulation evidence.

A passing agent has to do more than name the right physics. It has to operate the software, get a run to complete, and report numerical results that can be checked from produced files.

Operate Can it use real simulation tools? The model works inside LTspice and OpenFOAM environments, not a toy text prompt.

Recover Can it debug the workflow? Solver setup, convergence, logs, and post-processing all create failure modes beyond ordinary coding tasks.

Evidence Can engineers trust the output? The verifier checks produced artifacts. There is no LLM judge and no credit for unreplayable numbers.

The technical contract, task files, leaderboard artifacts, and reproducing guide are public in the GitHub repo.

// why this matters

Two ways to use it.

The public benchmark is a starting point for both model builders and engineering teams. Labs can use it to develop industrial simulation capability; CAE leaders can use it to evaluate whether agents can actually run simulation work before choosing a model.

For LLM labs Develop industrial simulation capability. Use a public task suite where models must handle real tools, long-running workflows, numerical outputs, and solver artifacts.

For CAE leaders Evaluate AI automation before choosing a model. Compare whether agents can actually run simulation work, not just explain engineering concepts in a chat window.