public benchmark · Harbor based

sim-benchmark.

A public benchmark for developing and evaluating LLM agents on real CAE and EDA simulation workflows.

// leaderboard

Current public rows.

Model leaderboard score · 0 to 1

Model	LTspice circuits 20 tasks	OpenFOAM fluids 3 tasks
Claude Opus 4.6	0.986	1.000
MiniMax-M2.5-highspeed	0.936	0.408
MiniMax-M2.7	0.884	0.284

Initial task set: 20 LTspice circuit tasks and 3 OpenFOAM fluid tasks. Scores are 0 to 1; GitHub contains the produced files and run details.

Public repo → Full leaderboard → Result artifacts →

// what is tested

From prompt to trusted simulation evidence.

A passing agent has to do more than name the right physics. It has to operate the software, get a run to complete, and report numerical results that can be checked from produced files.

Operate Can it use real simulation tools? The model works inside LTspice and OpenFOAM environments, not a toy text prompt.

Recover Can it debug the workflow? Solver setup, convergence, logs, and post-processing all create failure modes beyond ordinary coding tasks.

Evidence Can engineers trust the output? The verifier checks produced artifacts. There is no LLM judge and no credit for unreplayable numbers.

The technical contract, task files, leaderboard artifacts, and reproducing guide are public in the GitHub repo.

// why this matters

Two ways to use it.

The public benchmark is a starting point for both model builders and engineering teams. Labs can use it to develop industrial simulation capability; CAE leaders can use it to evaluate whether agents can actually run simulation work before choosing a model.

For LLM labs Develop industrial simulation capability. Use a public task suite where models must handle real tools, long-running workflows, numerical outputs, and solver artifacts.

For CAE leaders Evaluate AI automation before choosing a model. Compare whether agents can actually run simulation work, not just explain engineering concepts in a chat window.