OpenAdapt-ML is the model-agnostic, domain-agnostic ML engine for GUI automation agents. It:
- Turns GUI interaction trajectories (screens + actions + goals) into datasets.
- Fine-tunes vision-language models (VLMs) via LoRA/QLoRA.
- Exposes a simple runtime policy API for predicting the next GUI action.
- Is validated end-to-end on synthetic, semantic UIs before integrating real data.
This repo contains only generic ML and data logic.
v1 focuses on a minimal but complete pipeline:
- Canonical schema for GUI interaction data:
Session,Episode,Step,Observation,Action.- Episodes are goal-conditioned.
- Steps optionally include thought/reasoning.
- Synthetic semantic UIs:
- Login screen, settings panel, browser-style screens, etc.
- Each episode encodes a concrete goal (e.g. "Log in as alice", "Clear browser cache").
- Dataset builder for goal-conditioned next-action SFT:
- Produces chat-style samples with images + messages.
- Assistant outputs parseable textual actions (e.g.
CLICK(x=0.42, y=0.73)).
- VLM adapter abstraction with a Qwen-VL implementation:
- Uses Hugging Face
transformers+peft(LoRA/QLoRA). - Designed primarily for Qwen3-VL but extensible.
- Uses Hugging Face
- Minimal training loop / CLI:
- Loads a base VLM, attaches LoRA, trains on synthetic data, saves the adapter.
- Training hyperparameters align with common Qwen SFT recipes.
- Runtime policy:
- Given a screenshot and a goal, predicts the next action.
- Parses textual actions into a structured
Actionobject.
v1 explicitly does not implement:
- Real OS hooks, replay mechanisms, or system tray integration.
- Application-specific ingestion logic, custom taxonomies, or integration layers
- Large-scale distributed training or production deployment stacks.
- Advanced workflow mining (clustering, summarization, segmentation).
Directory layout:
openadapt_ml/schemas/sessions.py— canonical dataclasses.
ingest/synthetic.py— semantic synthetic UIs and scripted episodes.
datasets/next_action.py— goal-conditioned next-action SFT builder.
models/base_adapter.py—BaseVLMAdapter, device utils.qwen_vl.py— Qwen-family adapter with LoRA/QLoRA.
training/trainer.py— minimal training loop over adapters + datasets.
runtime/policy.py—AgentPolicyfor inference.
scripts/train.py— config-driven training entrypoint.demo_policy.py— simple end-to-end demo on synthetic data.
This structure keeps schemas, ingest, datasets, models, training, and runtime cleanly separated.
All data is normalized into the following entities (Python dataclasses, type-hinted):
-
ActionType (string or
Literal):- For v1:
"click","double_click","right_click","drag","scroll","type","key_press","wait","done","failed". - We start by implementing
click,type, anddone; others are reserved for future use.
- For v1:
-
Action:
type: str— one of the action types.x: float | None— normalized [0, 1] horizontal coordinate (optional).y: float | None— normalized [0, 1] vertical coordinate (optional).text: str | None— text payload fortypeactions.raw: dict[str, Any] | None— raw extra metadata (element ids, key names, etc.).
-
Observation:
image_path: str | None— path to screenshot (PNG, etc.).meta: dict[str, Any] | None— optional metadata (window title, app name, URL, etc.).
-
Step:
t: float— timestamp or relative time.observation: Observation.action: Action— the action taken at this step.thought: str | None— optional reasoning behind the action.
-
Episode:
id: str.goal: str— required, natural-language description of the task.steps: list[Step].summary: str | None— optional high-level description.success: bool | None— optional outcome flag (ground-truth goal completion).workflow_id: str | None— optional cluster / taxonomy label (workflow type).
-
Session:
id: str.episodes: list[Episode].meta: dict[str, Any] | None— session-level metadata.
This schema is the contract between ingest, datasets, models, and runtime.
-
Training unit: dataset builders and training loops operate on
list[Episode].Sessionis primarily an ingest/container type for grouping episodes and metadata. -
Structured vs textual actions:
Actionis the canonical, structured representation stored in episodes.- The textual DSL (e.g.
CLICK(x=0.42, y=0.73)) is derived fromActionfor SFT samples and runtime prompts; it is not stored verbatim in theEpisode.
-
DONE vs success semantics:
- Synthetic generators will include a terminal
Action(type="done")once the scripted goal is achieved. Episode.successrecords whether the goal was actually achieved in that episode.- This lets us evaluate both task success and whether the policy learns to emit
DONE()at the right time.
- Synthetic generators will include a terminal
-
Thought usage in v1:
Step.thoughtis present for future ReAct-style supervision.- v1 training ignores
thoughtand only supervises the final action string. - Thought supervision is a v1.1+ extension.
We want synthetic data that:
- Looks like real desktop UIs (not abstract grids).
- Has semantically meaningful elements (username field, login button, etc.).
- Encodes clear goals and scripted sequences that achieve them.
This follows the spirit of omnimcp's synthetic_ui.py and matches patterns in VideoAgentTrek (task-oriented episodes).
Initial synthetic UIs (v1):
-
Login screen
- Fields: username, password; checkbox: "Remember Me"; link: "Forgot Password?"; button: "Login".
- Example goal:
"Log in with username 'alice' and password 'hunter2'."
-
Settings panel
- Toggles: e.g. "Enable notifications", "Send usage data".
- Buttons: "Save", "Cancel".
- Example goal:
"Disable usage data and save settings."
-
Browser-style UI (later in v1 or v1.1)
- Toolbar: back, forward, refresh, settings.
- Settings subpanel: "Clear cache", "Clear cookies".
- Example goal:
"Clear the browser cache and cookies."
Each scenario will be generated by dedicated helpers under openadapt_ml/ingest/synthetic.py.
For each scenario, we script an Episode with:
- A goal string.
- A sequence of Steps with:
- An
Observation.image_pathfor each intermediate UI state. - An
Actionwith normalized coordinates or text. - An optional
thoughtsummarizing the intent.
- An
Example (conceptual only):
- Goal:
"Clear the browser cache and cookies." - Steps:
- Step 0: click settings icon.
- Step 1: click "Privacy" section.
- Step 2: click "Clear data" button.
- Step 3: ensure cookies checkbox is enabled.
- Step 4: click "Confirm".
- Step 5:
doneaction when goal complete.
openadapt_ml/ingest/synthetic.py exposes:
def generate_synthetic_sessions(
num_sessions: int = 10,
seed: int | None = None,
output_dir: str | os.PathLike[str] | None = None,
jitter: bool = True,
) -> list[Session]:
"""Generate a list of synthetic Sessions with semantic login episodes."""All action coordinates are normalized relative to the screenshot image (range [0, 1]).
If needed, the original screen resolution can be stored in Observation.meta.
Synthetic generators save PNGs to disk and store their paths in Observation.image_path for simplicity and easy inspection.
We train the VLM via supervised fine-tuning (SFT) on chat-style samples.
For v1, we encode actions as simple textual commands, e.g.:
CLICK(x=0.42, y=0.73)TYPE(text="alice")DONE()
This keeps things:
- Parseable via regex.
- Human-readable.
- Compatible with standard Qwen SFT setups (text-only supervision on top of images).
Internal SFT sample (conceptual):
{
"images": ["/path/to/screen.png"],
"messages": [
{
"role": "system",
"content": "You are a GUI automation agent. Given a screen and a user goal, predict the single next action. Actions include CLICK, TYPE, SCROLL, DRAG, KEY, and DONE."
},
{
"role": "user",
"content": (
"Goal: Clear the browser cache and cookies.\n"
# Optional: "Previous actions: CLICK(x=0.95, y=0.10), CLICK(x=0.5, y=0.6)\n"
"Current screen: see the attached image.\n"
"Predict the next action."
),
},
{
"role": "assistant",
"content": "CLICK(x=0.95, y=0.10)",
},
],
}Later we may optionally include history and/or thoughts in the user content, but v1 can start with:
- Goal + current screen → next action.
Under openadapt_ml/datasets/next_action.py we provide:
build_next_action_sft_samples(episodes: list[Episode]) -> list[dict].NextActionDataset, a thintorch.utils.data.Datasetwrapper.- Utilities such as
format_actionfor formattingActionobjects into action strings.
The action DSL is the contract between dataset builders, adapters, the runtime policy, and external agents. For v1 it is deliberately small and strict:
- Allowed actions
CLICK(x=<float in [0,1]>, y=<float in [0,1]>)TYPE(text="...")WAIT()DONE()
- Coordinate range
xandyare normalized to[0, 1]relative to the screenshot resolution at the time of capture.
- Single action per response
- Assistant outputs supervised during training contain exactly one DSL
action, optionally preceded by a
Thought:line (ReAct-style) in higher-level prompts. - Runtime parsing in
AgentPolicyextracts a singleActionfrom the model's text; additional text is ignored.
- Assistant outputs supervised during training contain exactly one DSL
action, optionally preceded by a
- Failure behavior
- If parsing fails (no valid DSL action found), the policy returns
Action(type="failed")with the raw text attached inAction.raw. - Downstream callers must treat
failedas non-executable and may choose to retry, fall back, or log.
- If parsing fails (no valid DSL action found), the policy returns
These invariants must remain stable unless the DSL is explicitly versioned and all adapters + parsers are updated in lockstep.
For an Episode with steps [s0, s1, ..., sN] we create one SFT sample per step that has both an observation and an action, including the terminal DONE() step (if present). This allows the model to learn termination behavior.
For each step k in range(len(steps)):
image=steps[k].observation.image_pathgoal=episode.goaltarget_action= textual serialization ofsteps[k].action, e.g."CLICK(x=0.42, y=0.73)"or"DONE()".
v1 deliberately uses a one-step Markov assumption:
- Input:
(goal, current screen) - Output:
next action
History and thought are not used in v1 training, but the SFT sample format leaves room to add them later (by augmenting the user message).
openadapt_ml/models/base_adapter.py will provide a helper:
- Use
"cuda"iftorch.cuda.is_available(). - Else
"mps"iftorch.backends.mps.is_available(). - Else
"cpu".
Adapters take an explicit device but default to this helper.
The base class encodes the common API:
- Construction: wraps a Hugging Face model + processor + device.
prepare_inputs(batch: list[dict]) -> dict:- Takes a list of SFT samples and returns HF model inputs
(
input_ids,pixel_values,attention_mask,labels, etc.).
- Takes a list of SFT samples and returns HF model inputs
(
compute_loss(inputs: dict) -> torch.Tensor:- Runs the model and returns the loss (labels already in
inputs).
- Runs the model and returns the loss (labels already in
generate(sample: dict, max_new_tokens: int = 64) -> str:- Single-sample generation for runtime; returns assistant text.
Adapters are stateless model wrappers: they do not own optimizers, schedulers, or training state. All training state lives in the training loop.
openadapt_ml/models/qwen_vl.py will implement a QwenVLAdapter targeting
the Qwen-VL family via Hugging Face transformers + peft:
-
Primary target: Qwen3-VL (e.g.
Qwen/Qwen3-VL-8B-Instruct). -
Secondary / compatibility target (future): Qwen2.5-VL (e.g.
Qwen/Qwen2.5-VL-7B-Instructor compatible derivatives). -
from_pretrained(model_name: str, lora_config: LoraConfig | dict | None, load_in_4bit: bool = True, device: torch.device | None = None):- Loads a base Qwen-VL model + processor.
- Applies 4-bit loading (QLoRA) when requested (on CUDA-capable GPUs).
- Attaches a LoRA adapter using
peft.
-
prepare_inputs(v1, Qwen3-VL path):- Reinterprets generic SFT samples into Qwen-style multimodal
messageswhere the user content includes both animageentry and atextentry. - Uses
processor.apply_chat_template(..., tokenize=True, add_generation_prompt=False, return_dict=True, return_tensors="pt")to obtain model inputs. - For v1 synthetic experiments, uses full-sequence supervision
(
labels = input_ids) for simplicity; assistant-only masking is a planned refinement.
- Reinterprets generic SFT samples into Qwen-style multimodal
-
compute_lossandgeneratedelegate to the underlying HF model.
openadapt_ml/models/api_adapter.py implements ApiVLMAdapter for hosted
VLM APIs (Anthropic Claude and OpenAI GPT). This adapter is inference-only
and supports generate but not prepare_inputs or compute_loss.
Configuration is managed via openadapt_ml/config.py, which uses
pydantic-settings to load API keys from:
.envfile in the repository root (recommended for local development).- Environment variables (for CI/containers).
- Explicit
api_keyparameter passed to the adapter constructor.
Priority order: explicit parameter > settings (from .env) > environment
variables > raise error.
Example .env file:
ANTHROPIC_API_KEY=your-anthropic-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
The settings are loaded once at import time and cached in a global settings
object. This approach:
- Eliminates the need to manually export environment variables during development.
- Keeps sensitive keys out of the repository (
.envis gitignored). - Supports both local
.envworkflow and traditional environment variable injection for deployment.
We follow common Qwen-VL LoRA/QLoRA conventions, exposing a similar set of knobs. At a high level:
-
Model:
model_name: e.g."Qwen/Qwen3-VL-8B-Instruct".load_in_4bit: bool flag for QLoRA.
-
LoRA:
r(rank): e.g. 16.alpha: e.g. 32.dropout: e.g. 0.05.target_modules: e.g.["q_proj", "v_proj"].
-
Training:
num_train_epochs: 1–3.per_device_train_batch_size: small (1–4).gradient_accumulation_steps: 4–8.learning_rate: ~2e-4 (LoRA typical range).warmup_ratio: 0.03–0.1.weight_decay: 0.0–0.1.lr_scheduler_type:"cosine"or"linear".max_grad_norm: 1.0.
-
Logging / saving:
logging_steps,save_steps,save_total_limit.
-
Evaluation (optional in v1):
eval_steps: run a small synthetic eval every N training steps, if set.eval_episodes: number of synthetic episodes to sample for validation.metric: name of the primary metric (e.g."action_accuracy","coordinate_error").
These will be represented in a small TrainingConfig dataclass and a corresponding YAML/JSON schema used by scripts/train.py.
openadapt_ml/training/trainer.py exposes a simple helper that:
- Instantiates an optimizer (e.g.
AdamW). - Iterates over a
DataLoaderbuilt from the SFT dataset. - Uses the adapter to prepare inputs and compute loss.
- Supports gradient accumulation and basic logging.
- Saves LoRA weights and config at the end.
v1 assumes single-device training (one CUDA GPU, MPS, or CPU) with small per-device batch sizes and gradient accumulation. Multi-GPU / distributed training and accelerate are future extensions.
On Apple Silicon (M1/M2/M3, macOS), bitsandbytes does not currently
provide GPU-based 4-bit quantization (QLoRA). Attempts to use
load_in_4bit=True on an M-series Mac will fail or fall back to
unsupported paths, since there is no CUDA backend.
For v1 this implies:
- Local development on M2: use standard LoRA (no 4-bit) with full-/mixed-precision weights and keep synthetic datasets and training configs small for speed.
- GPU training (e.g., cloud): enable
load_in_4bit: trueand usebitsandbyteson CUDA-capable GPUs where 4-bit QLoRA is supported.
The CLI script will:
- Load a config file (YAML or JSON).
- Generate synthetic sessions (
generate_synthetic_sessions). - Flatten to episodes and build SFT samples.
- Create a
NextActionDataset. - Instantiate
QwenVLAdapterwith LoRA. - Run the training loop.
- Save:
- LoRA adapter weights.
- Training config.
- Dataset / synthetic scenario metadata.
openadapt_ml/runtime/policy.py defines an AgentPolicy that wraps a trained adapter:
__init__(adapter: BaseVLMAdapter, device: torch.device).predict_action(image: Image.Image, goal: str, history: list[Action] | None = None, think: bool = False) -> tuple[Action, str | None]:- Builds a single SFT-style sample with:
- System prompt (instructions + supported actions).
- User prompt including the goal and reference to the current screen.
- (Optional) short history summary, if provided.
- Calls
adapter.generateto get an assistant text string. - Parses the text into:
- an
Actionobject (type, x/y, text), and - an optional
thoughtstring (if we adopt aThought: ...\nAction: ...pattern later).
- an
- Builds a single SFT-style sample with:
If parsing fails (e.g. malformed text), predict_action returns an Action with type="failed" and attaches the raw text and error info in Action.raw; the thought return value is None. This lets callers handle failures gracefully (retry, fallback, or logging).
v1 expects the assistant output to be an action-only string (e.g. "CLICK(x=0.5, y=0.5)"). In later versions we may adopt a combined Thought: ...\nAction: ... format; the parser will be designed to optionally extract a thought prefix if present.
A simple scripts/demo_policy.py will:
- Load a trained LoRA adapter onto a base Qwen-VL model.
- Load a synthetic screenshot from disk.
- Call
AgentPolicy.predict_actionwith a goal string. - Print the resulting
Actionand optionally compare it to ground truth.
The v1 design intentionally leaves room for:
- Thought supervision:
- Store
thoughtinStepand optionally supervise a reasoning prefix before the action (ReAct-style).
- Store
- Action history in prompts:
- Include a short natural-language history of previous actions.
- Element-based actions:
- Move from pure coordinates to actions referencing specific UI elements (ids, bounds).
- Multi-frame context:
- Use multiple images per sample to capture temporal transitions.
- Real data ingest:
- Implement arbitrary external interaction trace ingestors outside this repo, mapping their traces into the
Session/Episodeschema.
- Implement arbitrary external interaction trace ingestors outside this repo, mapping their traces into the
- Cloud runtime:
- Separate repo (e.g.
openadapt-cloud) providing generic Docker images and job specs for training and serving adapters on Azure or other clouds.
- Separate repo (e.g.
For now, the priority is to get the v1 pipeline working end-to-end on synthetic semantic UIs, so we can demonstrate:
- Data schema coherence.
- Qwen-VL LoRA/QLoRA integration.
- Goal-conditioned next-action prediction.
- A clean runtime API for GUI agents.
The design supports two complementary evaluation modes:
In an executable environment (e.g., a desktop automation arena), we can:
- Execute actions predicted by
AgentPolicystep-by-step. - Observe the resulting UI state.
- Decide whether the goal was achieved.
Typical metrics:
- Task success rate: fraction of episodes where the agent achieves the goal within a step budget.
- Step efficiency: number of steps taken vs. scripted or optimal.
This mode depends on an external environment and is not required for the
initial synthetic, offline validation, but the AgentPolicy interface is
designed to plug into such environments.
For recorded interaction data (synthetic or real), where we have episodes with images and ground-truth actions, we can evaluate the policy without executing actions by:
- For each step, calling
AgentPolicy.predict_actionon the screenshot + goal. - Comparing the predicted
Actionto the ground-truthAction. - Aggregating metrics across steps and episodes.
Representative metrics:
- Action type accuracy.
- Coordinate error for spatial actions.
- Episode success rate (all steps correct and DONE at the right time).
openadapt_ml.evals.trajectory_matching and scripts/eval_policy.py implement
this mode for the synthetic login benchmark and are reused for future
scenarios.
- High-level overview and Quickstart: see
README.md. - Prioritized build plan and acceptance criteria: see
docs/roadmap.md.