[xegpu] Add matmul cost model and tile size selector#156
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a XeGPU matmul tile-parameter cost model plus device-spec plumbing, and updates the XeGPU matmul/MLP scheduling path (and examples) to auto-populate missing tiling parameters when a shape is not present in the JSON parameter DB.
Changes:
- Introduces
XeGPUSpecs(device specs DB) and amatmul_costmodelgrid-search/roofline estimator to rank valid tiling configs. - Replaces the old function-based parameter selector with
XeGPUParameterSelector, and wiresmlp_scheduleto generate/fill missing tiling parameters per layer. - Refactors examples to pass only required
(m,n,k)(plus optional--target) and rely on the schedule to complete parameters; reuses centralized constraint checks.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| lighthouse/schedule/xegpu/xegpu_specs.py | Adds device-spec DB and XeGPUSpecs used by the cost model/selector. |
| lighthouse/schedule/xegpu/xegpu_parameter_selector.py | Implements class-based param selection with JSON DB lookup + cost-model fallback. |
| lighthouse/schedule/xegpu/mlp_schedule.py | Adds pre-processing to auto-fill missing layer tile parameters via selector; moves constants to shared constraints module. |
| lighthouse/schedule/xegpu/matmul_costmodel.py | Adds config generation + simple roofline-based performance estimation. |
| lighthouse/schedule/xegpu/matmul_constraints.py | Centralizes tiling/prefetch validity checks and shared constants. |
| lighthouse/schedule/xegpu/init.py | Exposes new selector/specs/constraint helper via package exports. |
| examples/xegpu/tune_matmul_gridsearch.py | Switches to shared check_constraints and adds GPU target selection. |
| examples/xegpu/torch_matmul.py | Simplifies parameter init to (m,n,k) + optional target; removes legacy selector usage. |
| examples/xegpu/mlp.py | Passes per-layer (m,n,k) (and optional target) and relies on schedule for completion. |
| examples/xegpu/matmul.py | Passes (m,n,k) (and optional target) and relies on schedule for completion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rengolin
left a comment
There was a problem hiding this comment.
Are you matching the perf from the GA auto-tuner?
side comment: would be nice to merge the matmul and the mlp python files. Just make matmul an mlp without element-wise.
The method used in XeGPUParameterSelector is within ~5% of the best known config for compute bound cases. For memory-bound cases, the gap can be up to ~50% in some cases. We cannot really do better without tuning the prefetch and load tile configs. The next PR will introduce a tuning method that uses this cost model to generate the search candidates. It's always better or equal to what the GA tuning produces. |
You mean the matmul and mlp examples? They're separate mainly because the convenient CLI is a little different in these two cases. We could refactor the common bits for sure. I'd leave this for anthoer PR to keep the diff clearer to review. |
Another PR for sure. Even move this to Kernel Bench would be preferable than refactoring. |
Extends the XeGPU MLP/matmul schedule with a cost model that can be used to generate valid tile sizes for given (M, N, K) matmul shape.
XeGPUSpecs: Object that contains GPU specifications required by the cost model.XeGPUParameterSelector: Param selector is now a class that usesXeGPUSpecsand can generate valid tile size configurations if (M, N, K) case is not found in the existing parameter JSON file.mlp_schedulestill takes a list of param dicts, one for each layer. Only"m","n","k"entries are required however; if any parameter is missing,XeGPUParameterSelectoris called to populate the tile sizes.generate_configsgenerates valid workgroup, subgroup, and k tile size configurations and estimates their performance based on a simple roofline model. Returns configs sorted by estimated performance.generate_prefetch_tilesgenerates all valid thread cooperative prefetch strategies, sorted by the number of cooperative threads. No performance estimate is provided.Currently data types are assumed to be
float16andfloat32for A/B and C, respectively. To be generalized later.We can now execute any nicely-shaped matrix multiplication without the need to define tile sizes. If the matmul is compute-bound performance should be decent.