Skip to content

alibaba/skill-up

skill-up logo

skill-up

CI Coverage Go Version License Go Report Card Release

English | 中文

📖 User Manual · 用户手册


Overview

skill-up is a CLI evaluation framework for Agent Skill developers. Declare your eval environment, dependencies, test cases, and grading strategy in evals/eval.yaml and evals/cases/*.yaml, then run evaluations locally or in CI to generate structured reports.

Warning

This project is still in an early evolution stage: the code is not yet fully stable, and some CLI commands, configuration fields, and public APIs may still change in future releases. Please review the CHANGELOG and verify compatibility before using it in production.

Features

  • Declarative Eval Config: Define evaluation environment, engine, model, and cases through YAML (eval.yaml + cases/*.yaml).
  • Multi-Engine Support: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
  • Flexible Judging: Supports rule_based, script, and agent_judge evaluation strategies.
  • Structured Reports: Outputs Anthropic-compatible grading.json, benchmark.json, benchmark.md, plus result.json, JUnit XML, and HTML reports.
  • Anthropic Compatible: Import evals.json via skill-up import, or auto-detect with --auto.
  • CI-Ready: Designed for local development and continuous integration pipelines.

Why skill-up

The official Agent Skills evaluation guide describes the right evaluation loop: write realistic cases, run with and without the Skill, grade outputs, aggregate results, and iterate. skill-up turns that workflow into a reusable CLI:

  • Replaces ad hoc run folders with a declarative eval.yaml + cases/*.yaml format.
  • Automates workspace setup, Skill installation, Agent Engine invocation, judging, and report generation.
  • Supports multiple engines (claude_code, codex, qodercli) instead of tying the workflow to one client.
  • Keeps compatibility with Anthropic-style evals.json while adding richer judges, CI-friendly commands, and structured reports.

Installation

Install with the script:

curl -fsSL https://raw.githubusercontent.com/alibaba/skill-up/main/install.sh | bash

The installer downloads the matching binary from GitHub Releases.

To build locally from a checkout, install Go 1.25 or later:

make build
# or
go build -o bin/skill-up ./cmd/skill-up

Quick Start

1. Create Eval Config

In your Skill directory, create evals/eval.yaml:

schema_version: v1alpha1

environment:
  type: none

engine:
  name: claude_code

cases:
  files:
    - evals/cases/hello-world.yaml

When evals/eval.yaml lives under a directory that contains SKILL.md, skill-up installs the current Skill automatically. The omitted fields use defaults: JSON report output, timeout_seconds: 300, max_turns: 10, and parallelism: 1.

For the full eval.yaml schema, see Writing Evals.

2. Write an Eval Case

Create evals/cases/hello-world.yaml:

input:
  prompt: |
    Please generate a Hello World program

expect:
  must_contain:
    - "Hello"
    - "World"

The case id defaults to the filename (hello-world). Add a judge block only when you need script-based or agent-based grading.

3. Validate Config

skill-up validate

This step is optional, but useful before the first run: it checks eval.yaml and all referenced case files without starting an Agent Engine.

4. Run Evaluation

skill-up run

Results are written to <skill-name>-workspace/iteration-1/.

For engineering conventions (Conventional Commits, Git hooks, golangci-lint), see CONTRIBUTING.md.

User config

skill-up auto-loads an optional user-level config that supplies default OpenTelemetry env vars and per-environment runtime kwargs. The embedded defaults are empty; downstream consumers maintain their own config file.

Discovery chain (lowest to highest precedence)

embed (empty) < user (~/.config/skill-up/config.yaml) < project ($PWD/.skill-up.yaml) < explicit (--config)
Source Path
embed empty Config{} — no vendor defaults baked in
user $SKILL_EVAL_CONFIG, else $XDG_CONFIG_HOME/skill-up/config.yaml, else ~/.config/skill-up/config.yaml
project $PWD/.skill-up.yaml
explicit --config <path> (must exist)

Missing files at the user and project layers are silently skipped; a missing --config path is a hard error. A corrupt config at any layer also fails the run.

Quickstart

skill-up init              # writes ~/.config/skill-up/config.yaml (XDG-aware)
skill-up init --local      # writes $PWD/.skill-up.yaml
skill-up init --print      # writes the template to stdout
skill-up init --force      # overwrite an existing file

Schema

schema_version: v1alpha1
kind: SkillEvalConfig

telemetry:
  service_name: skill-up                              # OTEL_SERVICE_NAME
  traces_exporter: otlp                                 # OTEL_TRACES_EXPORTER
  traces:
    endpoint: http://localhost:4317                     # OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (4317 for grpc, 4318/v1/traces for http/protobuf)
    protocol: grpc                                      # OTEL_EXPORTER_OTLP_TRACES_PROTOCOL (grpc | http/protobuf); skill-up defaults to grpc
  resource_attributes:                                  # serialized into OTEL_RESOURCE_ATTRIBUTES
    deployment.environment: local
  verbose: false                                        # if true, also enables OTEL_LOG_* payload capture

env:                                                    # arbitrary defaults, applied only-if-unset
  OTEL_EXPORTER_OTLP_HEADERS: authorization=${OTLP_TOKEN}

runtime_kwargs:                                         # keyed by environment.type
  opensandbox:
    base_url: http://localhost:8080
    # extensions: '{}'

Precedence

For environment variables: any value already set in the process environment wins; the config only fills in missing keys.

For runtime_kwargs: explicit --runtime-kwarg on run > eval.yaml environment.kwargs > user-config runtime_kwargs[environment.type].

Secrets

Prefer ${ENV_VAR} references inside the config file rather than baking secret literals. The redaction mechanism (userconfig.Redact) masks fields tagged secret:"true" when printing; currently no Config field carries the tag, but the mechanism is in place for future fields.

Importing evals.json

Use skill-up import to migrate an Anthropic-compatible evals.json into the YAML layout used by this repo:

skill-up import ./evals/evals.json --output ./evals

CLI Overview

Command Description
skill-up run [path] Run evaluation cases and produce reports
skill-up validate [path] Validate eval.yaml and case files
skill-up list-cases [path] List all cases referenced by the config
skill-up report <result.json> Generate reports from a previous run
skill-up import <evals.json> Import Anthropic evals.json to YAML cases
skill-up debug judge <input.json> Debug judge module with a JSON input
skill-up debug report <input.json> Debug report module with a JSON input

License

Apache License 2.0 — see LICENSE.

About

A CLI evaluation framework to make your Agent Skill Up.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages