alloy-llm-benchmark

Scripts for running an Alloy benchmark workflow:

generate Alloy models from English descriptions,
score generated models against reference models and known instances.

1) Setup

Python environment

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install openai

API key setup

This repository expects your OpenAI key in:

./secret/key

Create the file and paste only the raw key value into it.

2) Java requirements (for generation and scoring)

Generation and scoring use Java tools:

scoring/alloy-diff.jar (Java 17)
scoring/org.alloytools.alloy.dist-6.2.0.jar (Java 17)
scoring/CompoSAT.jar (Java 8)

Generation syntax checks and scoring require explicit Java homes:

JAVA_HOME_17
JAVA_HOME_8

Set these in your shell before running scoring:

export JAVA_HOME_17="/path/to/jdk-17"
export JAVA_HOME_8="/path/to/jdk-8"

Verify both:

echo "$JAVA_HOME_17"
"$JAVA_HOME_17/bin/java" -version
"$JAVA_HOME_17/bin/javac" -version

echo "$JAVA_HOME_8"
"$JAVA_HOME_8/bin/java" -version

Expected major versions:

JAVA_HOME_17/bin/java reports 17
JAVA_HOME_8/bin/java reports 1.8 (or 8)

3) Benchmark layout

The default run in this repository uses:

benchmark/descriptions/   # input English descriptions (.md)
benchmark/outputs/        # generated Alloy outputs (.als)
benchmark/models/         # reference Alloy models (.als)
benchmark/instances/      # XML instances grouped by model/scope
benchmark/generalInstances/ # exact-scope XML instances grouped by model/scope

Prompt scaffolding is read from:

prompts/english-alloy-prefix.txt
prompts/english-alloy-suffix.txt

4) Generate Alloy outputs from descriptions

Run:

python3 scripts/main.py benchmark/descriptions benchmark/outputs

What this does:

reads each .md description file,
wraps it with the prompt prefix and suffix,
calls the LLM in parallel,
checks the generated .als syntax,
if syntax is invalid, retries up to 2 times (3 attempts total) using a repair prompt that includes prior attempts and syntax errors,
writes all attempts as <model>.attempt1.als, <model>.attempt2.als, <model>.attempt3.als (as needed),
writes/overwrites <model>.als with the final attempt for compatibility.

5) Score outputs against models and instances

Run:

python3 scripts/score.py benchmark/outputs benchmark/models benchmark/instances benchmark/generalInstances

Optional explicit report path:

python3 scripts/score.py benchmark/outputs benchmark/models benchmark/instances benchmark/generalInstances benchmark/outputs/scores.txt

What this does for each model:

selects the final attempt file for each model (<model>.attemptN.als if present, otherwise <model>.als),
computes syntax-attempt score (/3): 3/3 if attempt 1 is syntactically valid, 2/3 if attempt 2, 1/3 if attempt 3, 0/3 if none,
runs Ringert/SemDiff implications in both directions for scopes 1..max(composat_max_scope, general_max_scope),
checks CompoSAT instances from the original model against the output model (original => output),
checks general instances from the original model against the output model (original => output),
runs CompoSAT on the output model for scopes 1..composat_max_scope, then checks those instances against the original model (output => original),
runs InstanceGenerator on the output model for scopes 1..general_max_scope, then checks those instances against the original model (output => original).

Notes:

CompoSAT runs in scoring have a 300s timeout per scope.
InstanceGenerator runs in scoring also have a 300s timeout per scope.
If a timeout occurs, scores.txt includes a clear TIMEOUT line listing the timed-out scope(s).
Multiple models can be scored in parallel. Control it with SCORE_MODEL_WORKERS (default: up to 4, capped by model count).
CompoSAT scopes and general-instance scopes for a single model are each processed sequentially.

Example:

SCORE_MODEL_WORKERS=3 python3 scripts/score.py benchmark/outputs benchmark/models benchmark/instances benchmark/generalInstances

The scoring report is written to:

benchmark/outputs/scores.txt

6) Model configuration

The OpenAI model is currently set in:

scripts/openAI.py

Look for the model= argument in client.chat.completions.create(...).

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
benchmark		benchmark
prompts		prompts
rm		rm
scoring		scoring
scripts		scripts
trialRun1		trialRun1
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

alloy-llm-benchmark

1) Setup

Python environment

API key setup

2) Java requirements (for generation and scoring)

3) Benchmark layout

4) Generate Alloy outputs from descriptions

5) Score outputs against models and instances

6) Model configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

alloy-llm-benchmark

1) Setup

Python environment

API key setup

2) Java requirements (for generation and scoring)

3) Benchmark layout

4) Generate Alloy outputs from descriptions

5) Score outputs against models and instances

6) Model configuration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages