sim-benchmark.
A public benchmark for developing and evaluating LLM agents on real CAE and EDA simulation workflows.
Current public suite: 31 tasks across LTspice circuits and OpenFOAM 11 CFD cases, with artifact-grounded scoring and no LLM judge.
Last updated
Current public rows.
| Model | LTspice circuits 20 tasks | OpenFOAM fluids 11 tasks |
|---|---|---|
| Claude Opus 4.6 | 0.986 | 0.918 |
| MiniMax-M2.7-highspeed | 0.899 | 0.804 |
| MiniMax-M2.5-highspeed | 0.899 | 0.706 |
| MiniMax-M2.7 | 0.838 | 0.675 |
Current public suite: 31 tasks, with 20 LTspice circuits and 11 OpenFOAM CFD cases. OpenFOAM 11 covers Bénard convection, dam-break multiphase, DNS turbulence, oblique shock, non-Newtonian flow, pitzdaily BFS, and lid-driven cavities. Scores are 0 to 1; GitHub contains the produced files and run details.
From prompt to trusted simulation evidence.
A passing agent has to do more than name the right physics. It has to operate the software, get a run to complete, and report numerical results that can be checked from produced files.
The technical contract, task files, leaderboard artifacts, and reproducing guide are public in the GitHub repo.
Two ways to use it.
The public benchmark is a starting point for both model builders and engineering teams. Labs can use it to develop industrial simulation capability; CAE leaders can use it to evaluate whether agents can actually run simulation work before choosing a model.