Scripts for running an Alloy benchmark workflow:
- generate Alloy models from English descriptions,
- score generated models against reference models and known instances.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install openaiThis repository expects your OpenAI key in:
./secret/key
Create the file and paste only the raw key value into it.
Generation and scoring use Java tools:
scoring/alloy-diff.jar(Java 17)scoring/org.alloytools.alloy.dist-6.2.0.jar(Java 17)scoring/CompoSAT.jar(Java 8)
Generation syntax checks and scoring require explicit Java homes:
JAVA_HOME_17JAVA_HOME_8
Set these in your shell before running scoring:
export JAVA_HOME_17="/path/to/jdk-17"
export JAVA_HOME_8="/path/to/jdk-8"Verify both:
echo "$JAVA_HOME_17"
"$JAVA_HOME_17/bin/java" -version
"$JAVA_HOME_17/bin/javac" -version
echo "$JAVA_HOME_8"
"$JAVA_HOME_8/bin/java" -versionExpected major versions:
JAVA_HOME_17/bin/javareports 17JAVA_HOME_8/bin/javareports 1.8 (or 8)
The default run in this repository uses:
benchmark/descriptions/ # input English descriptions (.md)
benchmark/outputs/ # generated Alloy outputs (.als)
benchmark/models/ # reference Alloy models (.als)
benchmark/instances/ # XML instances grouped by model/scope
benchmark/generalInstances/ # exact-scope XML instances grouped by model/scope
Prompt scaffolding is read from:
prompts/english-alloy-prefix.txt
prompts/english-alloy-suffix.txt
Run:
python3 scripts/main.py benchmark/descriptions benchmark/outputsWhat this does:
- reads each
.mddescription file, - wraps it with the prompt prefix and suffix,
- calls the LLM in parallel,
- checks the generated
.alssyntax, - if syntax is invalid, retries up to 2 times (3 attempts total) using a repair prompt that includes prior attempts and syntax errors,
- writes all attempts as
<model>.attempt1.als,<model>.attempt2.als,<model>.attempt3.als(as needed), - writes/overwrites
<model>.alswith the final attempt for compatibility.
Run:
python3 scripts/score.py benchmark/outputs benchmark/models benchmark/instances benchmark/generalInstancesOptional explicit report path:
python3 scripts/score.py benchmark/outputs benchmark/models benchmark/instances benchmark/generalInstances benchmark/outputs/scores.txtWhat this does for each model:
- selects the final attempt file for each model (
<model>.attemptN.alsif present, otherwise<model>.als), - computes syntax-attempt score (
/3):3/3if attempt 1 is syntactically valid,2/3if attempt 2,1/3if attempt 3,0/3if none, - runs Ringert/SemDiff implications in both directions for scopes
1..max(composat_max_scope, general_max_scope), - checks CompoSAT instances from the original model against the output model (
original => output), - checks general instances from the original model against the output model (
original => output), - runs CompoSAT on the output model for scopes
1..composat_max_scope, then checks those instances against the original model (output => original), - runs
InstanceGeneratoron the output model for scopes1..general_max_scope, then checks those instances against the original model (output => original).
Notes:
- CompoSAT runs in scoring have a
300stimeout per scope. InstanceGeneratorruns in scoring also have a300stimeout per scope.- If a timeout occurs,
scores.txtincludes a clearTIMEOUTline listing the timed-out scope(s). - Multiple models can be scored in parallel. Control it with
SCORE_MODEL_WORKERS(default: up to4, capped by model count). - CompoSAT scopes and general-instance scopes for a single model are each processed sequentially.
Example:
SCORE_MODEL_WORKERS=3 python3 scripts/score.py benchmark/outputs benchmark/models benchmark/instances benchmark/generalInstancesThe scoring report is written to:
benchmark/outputs/scores.txt
The OpenAI model is currently set in:
scripts/openAI.py
Look for the model= argument in client.chat.completions.create(...).