Skip to content

Latest commit

 

History

History
566 lines (476 loc) · 54.1 KB

File metadata and controls

566 lines (476 loc) · 54.1 KB

Changelog

v2.8.0 (2026-01-20)

🚀 Major features

  • HAMi DRA is now supported, for details, please visit: https://github.com/Project-HAMi/HAMi-DRA
  • Enable leader select among multiple schedulers (#1553)
  • Support CDI mode on NVIDIA devices (#1552)
  • Optimize HAMi webUI, please visit https://github.com/Project-HAMi/HAMi-WebUI
  • Sync with k8s-device-plugin from nvidia v0.18.0 (#1541)
  • Add hami_build_info metrics and version print (#1581)
  • Watch and hot reload the updated certificate (#1573)

🐛 Major bug fixes

  • Update HAMi-core to fix vllm-related issues: #1381 #1461 by (@archlitchi) in #1478
  • Fix: Calculation error for quotas by (@luohua13) in #1400
  • Fix: vXPU feature may not working properly on P800 node (#1569)
  • Fix scheduler allocate incorrect mig instance (#1518)

📝 What's Changed

🔨 Other Changes

Committers: Contributors

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.7.1...v2.8.0

v2.7.1 (2025-11-07)

🚀 Major features

  • No major features in this release.

🐛 Major bug fixes

📝 What's Changed

🔨 Other Changes

Committers: 🆕 New Contributors

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.7.0...v2.7.1

v2.7.0 (2025-09-26)

🚀 Major features

🐛 Major bug fixes

  • fix: Before executing MIG partitioning, suppress NVML usage in o… by (@Goend) in #1095
  • Fix golint-CI by (@archlitchi) in #1127
  • fix: override node socre failure for kunlun #1137 by (@ouyangluwei163) in #1138
  • fix: Multi-node scoring nodes are inaccurate by (@ouyangluwei163) in #1147
  • fix: An error occurred while create Iluvatar pod by (@ouyangluwei163) in #1149
  • Fix e2e CI by (@archlitchi) in #1165
  • fix: Add option for overwrite schedulerName by (@Shouren) in #1163
  • fix: using go-safecast to fix incorrect conversion of numbers by (@Shouren) in #1183
  • fix: deal with security issues reported by Trivy in image by (@Shouren) in #1189
  • fix: wrong Pod's UID and emtpy Pod's name in log of webhook.go by (@Shouren) in #1092
  • fix: concurrent map writes error in scheduler.calcScore #1269 by (@Shouren) in #1270
  • fix: release dangling node lock by (@peachest) in #1271
  • fix: fix err which retrieved incorrect NUMA node information issue #1275 by (@abstractmj) in #1276
  • fix(security): resolve issues reported by Code scanning in Security by (@Shouren) in #1280
  • fix: fix golangci-lint error by (@DSFans2014) in #1319
  • Fix: device allocation missing containers with no device request by (@FouoF) in #1299
  • fix: update int8Slice to uint8Slice for better type clarity and consistency by (@yxxhero) in #1357

📝 What's Changed

📚 Documentation

  • documentation: add Known Issues for dynamic mig support by (@Goend) in #1122
  • docs: fix broken link by (@lixd) in #1125
  • clearly list supported devices doc references at README by (@FouoF) in #1155
  • docs: update ascend910b-support docs by (@DSFans2014) in #1321

🔨 Other Changes

  • Optimize Fit-in-device logic to make it device-specific by (@archlitchi) in #1097
  • feat(scheduler): make node lock timeout configurable by (@Kevinz857) in #1117
  • featue: mig mode-change #1116 by (@ouyangluwei163) in #1124
  • feat: Add new labels in .github/release.yml by (@Shouren) in #1066
  • feat(scheduler-role): use a scoped-down role for scheduler by (@Antvirf) in #1152
  • feat(helm): optionally disable admission webhook by (@Antvirf) in #1145
  • remove redundant metrics for vgpu allocation by (@FouoF) in #1169
  • refactor: clean up code and improve maintainability by (@Wangmin362) in #1195
  • refactor: Ranging over SplitSeq is more efficient by (@Shouren) in #1239
  • feat:NodeLockTimeout set from env by (@miaobyte) in #1244
  • refactor: move watchAndFeedback function to feedback.go by (@miaobyte) in #1248
  • feat: add informer-based pod cache to reduce API server load by (@miaobyte) in #1250
  • feat: Add option to disable device plugin at values.yaml. by (@FouoF) in #1274
  • refactor(util/nodelock): replace manual polling with k8s.io/client-go/util/retry by (@mayooot) in #1252
  • refactor: Remove annotation in Devices interfaces by (@Shouren) in #1343
  • feat: update the Ascend910 scheduling policy by (@DSFans2014) in #1344
  • feat(nvidia): default gpucores=100 when memory is exclusive and cores… by (@xrwang8) in #1354
  • Prerelease-v2.6 by (@archlitchi) in #1108
  • add new reviewers Shouren and ouyangluwei163 by (@wawa0210) in #1131
  • Support topology-awareness for Kunlunxin device by (@archlitchi) in #1121
  • Support Metax sGPU Qos Policy by (@Kyrie336) in #1123
  • add global image for chart by (@calvin0327) in #1133
  • fix: Skip admission webhook when Pod's scheduler is already assigned. by (@ghostloda) in #1041
  • Add node configs to docs by (@wylswz) in #1159
  • build(deps): upgrade golang to 1.24.4 by (@Shouren) in #1172
  • build(deps): Upgrade golang image in ci to 1.24.4 by (@Shouren) in #1176
  • build(deps): Upgrade controller-runtime to 0.21.0 by (@Shouren) in #1171
  • build(deps): Dump github.com/NVIDIA/nvidia-container-toolkit by (@Shouren) in #1170
  • Add unit tests for Fit Function for enflame,hygon, metax, mthreads, nvidia by (@Wangmin362) in #1199
  • [Misc] update hami-core version by (@chaunceyjiang) in #1201
  • Improve the impl of DevicePluginConfigs.Nodeconfig overwriting NvidiaConfig by (@FouoF) in #1158
  • Add unit tests for cambricon's Fit Function by (@Wangmin362) in #1198
  • Add unit tests for Ascend's Fit Function by (@Wangmin362) in #1197
  • 修复生成 pod 请求资源时不必要的重复计算 by (@litaixun) in #1215
  • 修复更新节点注解时的日志提示词 by (@litaixun) in #1214
  • If the mem applied for the Mig device is the same as the template value,>will result in CardNotFoundCustom Filter Rule. by (@zgqqiang) in #1179
  • updated dri section to combine text for better readability by (@mpetason) in #1216
  • feat: Add nvidia gpu topoloy scheduler by (@fyp711) in #1028
  • add issue translate robot by (@wawa0210) in #1232
  • add issue translate robot by (@wawa0210) in #1234
  • perf(util/nodelock): Use clientset Patch instead of Update. by (@mayooot) in #1192
  • Update hami-core and fix readme documents by (@archlitchi) in #1240
  • Update hami-core version to fix by (@archlitchi) in #1256
  • [Snyk] Security upgrade tensorflow/tensorflow from latest-gpu to 2.20.0rc0-gpu by (@wawa0210) in #1243
  • feat: Add an action of 'Close stale issue and PRs' in github worklfow by (@Shouren) in #1083
  • Welcome fyp711 to become a HAMi member by (@wawa0210) in #1288
  • Add values readme by (@clcc2019) in #1267
  • Support Metax sGPU device health check by (@Kyrie336) in #1295
  • Optimize pkg/util.go and distribute logics to corresponding logics by (@archlitchi) in #1296
  • cleanup: Clear and correct ascend device name by (@FouoF) in #1315
  • bugfix: Nvidia card abnormal pod will still continue to schedule by (@zgqqiang) in #1336
  • FIx CI, add 910B4-1 template and fix vGPUmonitor metrics error by (@archlitchi) in #1345
  • add httpTargetPort to values.yaml by (@flpanbin) in #1356
  • Update kunlunxin documents by (@archlitchi) in #1366
  • update chart version and hami-core by (@archlitchi) in #1369

Committers: 🆕 New Contributors

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.6.1...v2.7.0

v2.6.0 (2025-06-07)

🚀 Major features

  • Optimize scheduler log
  • Support enflame gcu-share
  • Support metax GPU and metax sGPU
  • Helm chart add checksum annotation for restarting hami component after ConfigMap modification
  • Support for using RuntimeClass with nvidia devices
  • Add support for profiling via net/http/pprof package
  • Add nvidia gpu topoloy score registry to node
  • Feat: vGPUmonitor support MigInfo metrics

🐛 Major bug fixes

  • Fix stuck in driver 570+
  • Fix device memory not counted properly in comfyUI task
  • Fix cambricon devices not allocated properly
  • Fix wrong log and container request device count error
  • Fix vgpu-devices-allocated annotations are inconsistent
  • Fix removing node devices from node manager
  • Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
  • Fix device memory count error on cuMallocAsync
  • Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
  • Fix multi-process device memory count

📝 What's Changed

⬆️ Dependencies

  • Bump docker/build-push-action from 6.11.0 to 6.13.0 by (@dependabot) in #837
  • Bump golang.org/x/net from 0.26.0 to 0.35.0 by (@dependabot) in #859
  • Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by (@dependabot) in #941
  • Bump docker/login-action from 3.3.0 to 3.4.0 by (@dependabot) in #942
  • Bump docker/build-push-action from 6.13.0 to 6.15.0 by (@dependabot) in #899
  • build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0 by (@dependabot) in #1024
  • build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0 by (@dependabot) in #1052
  • build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0 by (@dependabot) in #1091

🔨 Other Changes

  • fix: Enhance GPU metrics collection and error handling in vGPU monitor by (@haitwang-cloud) in #827
  • refactor: update service configurations for device plugin and scheduler by (@haitwang-cloud) in #799
  • add ut for scheduler/score by (@shijinye) in #853
  • add ut for device/metax by (@shijinye) in #850
  • Remove duplicate log fields by (@learner0810) in #860
  • [docs] Fix default nvidia.resourceCoreName value in config.md by (@chinaran) in #842
  • Update libvgpu.so by (@archlitchi) in #876
  • update example.png by (@rockpanda) in #874
  • support ascend 910B2 by (@ouyangluwei163) in #885
  • fix docs typos by (@JinVei) in #869
  • Accelerate node score calculations using multiple goroutines by (@learner0810) in #824
  • Support Metax SGPU to sharing GPU by (@Kyrie336) in #895
  • docs: fix broken commmunity links by (@agilgur5) in #907
  • add config gpu core isolation policy for webhook by (@lengrongfu) in #901
  • feat: support scheduler replicas > 1 by (@Azusa-Yuan) in #898
  • docs: add syntax highlighting to various code blocks by (@agilgur5) in #906
  • Fix UT not be properly executed during CI phase by (@archlitchi) in #911
  • typo: fix typos in log and comment by (@popsiclexu) in #917
  • feat: Add kube-qps and kube-burst parameters. by (@chaunceyjiang) in #769
  • docs: Update MAINTAINERS file with current contributor information by (@Nimbus318) in #918
  • Nominate chaunceyjiang to reviewer by (@chaunceyjiang) in #926
  • build: update dependencies and remove unused cdiapi by (@yxxhero) in #903
  • add lengrongfu to reviewers by (@lengrongfu) in #937
  • chore: add namespace override for multi-namespace deployments by (@chinaran) in #924
  • fix: hygon dcu concurrent creation conflict by (@joy717) in #921
  • Fix the wrong describe of device registry in protocol.md by (@hurricane1988) in #910
  • chore: helm chart support scheduler webhook cert-manager by (@chinaran) in #951
  • refactor(scheduler): replace init methods with constructor functions by (@yxxhero) in #905
  • add Dependencies policy and Security policy by (@yangshiqi) in #934
  • scheduler: fix blocked the nodeNotify channel when node changes by (@Iceber) in #964
  • docs: Update Ascend910 support documentation by (@zhaikangqi331) in #988
  • update iluvatar's docs by (@yangshiqi) in #995
  • refactor: replace interface{} with any in various files by (@yxxhero) in #1000
  • scheduler: fix duplicate handling of the node label selector by (@Iceber) in #965
  • refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml by (@yxxhero) in #1002
  • update hami arch by (@wawa0210) in #1007
  • Update README.md by (@yowenter) in #1005
  • refactor: simplify code by using modern constructs by (@Shouren) in #978
  • scheduler: fix removing node devices from node manager by (@Iceber) in #966
  • feat: Add support for profiling via net/http/pprof package by (@Shouren) in #963
  • Support Enflame gcushare for enflame devices by (@archlitchi) in #1013
  • docs: Remove ACTIVE_OOM_KILLER environment variable description by (@chinaran) in #1015
  • refactor(vGPUmonitor): change Run to RunE and return errors by (@yxxhero) in #999
  • refactored the filter logs and event messages to enhance their clarity, by (@Wangmin362) in #1023
  • feat: Support for using RuntimeClass with nvidia devices by (@chinaran) in #1021
  • fix wrong log and container request device count error by (@Wangmin362) in #1020
  • feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification by (@chinaran) in #1022
  • fix vgpu-devices-allocated annotations are inconsistent #991 by (@ouyangluwei163) in #1012
  • add Enflame GCU S60 into roadmap. by (@winston-zhang-orz) in #1030
  • add nvidia-smi command show cuda version info by (@lengrongfu) in #953
  • Separate options from client to make the responsibility more clear. by (@yangshiqi) in #938
  • Add nvidia gpu topoloy score registry to node by (@lengrongfu) in #1018
  • fix(cicd): update ci.yaml to upload coverage to Codecov by (@Shouren) in #1056
  • feat(Actions): Add an action to label pr automatically by (@Shouren) in #1053
  • fix: Improve Metax GPU usability and fix related issues by (@Kyrie336) in #1063
  • fix(chart): support GKE pre-release versions via kubeVersion '-0' by (@Nimbus318) in #1072
  • fix: Dynamic GPU partitioning lacks single-GPU-level granularity. (#1… by (@Goend) in #1061
  • update maintainer information by (@wawa0210) in #1079
  • add LIBCUDA_LOG_LEVEL env to device-plugin by (@lengrongfu) in #1087
  • fix: missing apiVersion in serviceMonitor dashboard docs by (@ntheanh201) in #1077
  • test(pkg/util): Add some unit tests for pkg/util by (@Shouren) in #1067
  • feat: vGPUmonitor support MigInfo metrics by (@ouyangluwei163) in #1048
  • update hami-core version by (@lengrongfu) in #1082

Committers: 🆕 New Contributors

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.3...v2.6.0

v2.5.3 (2025-08-05)

🚀 Major features

  • No major features in this release.

🐛 Major bug fixes

  • Bug fixes related to issues #1181, #1055, #1219, #1230, #1191

📝 What's Changed

🔨 Other Changes

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.2...v2.5.3

v2.5.2 (2025-05-26)

🚀 Major features

  • No major features in this release.

🐛 Major bug fixes

  • Fix device usage metrics(31992) can't be accessed

📝 What's Changed

🔨 Other Changes

  • No other changes in this release.

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.1...v2.5.2

v2.5.1 (2025-05-06)

🚀 Major features

  • No major features in this release.

🐛 Major bug fixes

  • Fix: Update handling of version strings in Helm template and helpers.tpl by (@HJJ256) in #845
  • fix: Set passDeviceSpecsEnabled to false by default in device plugin by (@Nimbus318) in #872
  • fix: scheduler ignore KUBECONFIG env even if this environment variable is set @Shouren in #681
  • fix: correct device filter initialization order by (@Nimbus318) in #857
  • fix parseNvidiaNumaInfo index out of range by (@flpanbin) in #889
  • Fix cambricon pods not been recognized by HAMi scheduler by (@archlitchi) in #947
  • fix ubuntu base image in Dockerfile.withlib by (@flpanbin) in #944
  • fix: Add error handling for nvml.Init in NvidiaDevicePlugin by (@yxxhero) in #982
  • Fix device memory count error on cuMallocAsync by (@archlitchi) in #1029

📝 What's Changed

🔨 Other Changes

  • Release v2.5 by (@archlitchi) in #1034
  • Update tag to v2.5.1 by (@archlitchi) in #1035
  • Fix: Update handling of version strings in Helm template and helpers.tpl by (@HJJ256) in #845
  • Update libvgpu.so by (@archlitchi) in #876
  • fix: Set passDeviceSpecsEnabled to false by default in device plugin by (@Nimbus318) in #872
  • fix: scheduler ignore KUBECONFIG env even if this environment variable is set @Shouren in #681
  • fix: correct device filter initialization order by (@Nimbus318) in #857
  • fix parseNvidiaNumaInfo index out of range by (@flpanbin) in #889
  • Fix cambricon pods not been recognized by HAMi scheduler by (@archlitchi) in #947
  • fix ubuntu base image in Dockerfile.withlib by (@flpanbin) in #944
  • fix: Add error handling for nvml.Init in NvidiaDevicePlugin by (@yxxhero) in #982
  • Fix device memory count error on cuMallocAsync by (@archlitchi) in #1029
  • Bump golang.org/x/net from 0.26.0 to 0.33.0 by (@dependabot) in #839

Committers: Contributors

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.0...v2.5.1

v2.5.0 (2025-02-06)

🚀 Major features

  • Support dynamic mig feature, please refer to this document
  • Reinstall Hami will NOT crash GPU tasks
  • Put all configurations into a configMap, you can customize hami installation by modify its content: see details

🐛 Major bug fixes

  • Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
  • Fix hami-core stuck on high glib images, like 'tf-serving:latest'

📝 What's Changed

⬆️ Dependencies
  • Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by (@dependabot) in #631
  • Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by (@dependabot) in #676
  • Bump actions/upload-artifact from 4.4.3 to 4.5.0 by (@dependabot) in #717
  • Bump docker/build-push-action from 6.9.0 to 6.10.0 by (@dependabot) in #644
  • Bump docker/build-push-action from 6.10.0 to 6.11.0 by (@dependabot) in #792
🔨 Other Changes

Committers: 🆕 New Contributors

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.4.1...v2.5.0