- HAMi DRA is now supported, for details, please visit: https://github.com/Project-HAMi/HAMi-DRA
- Enable leader select among multiple schedulers (#1553)
- Support CDI mode on NVIDIA devices (#1552)
- Optimize HAMi webUI, please visit https://github.com/Project-HAMi/HAMi-WebUI
- Sync with k8s-device-plugin from nvidia v0.18.0 (#1541)
- Add hami_build_info metrics and version print (#1581)
- Watch and hot reload the updated certificate (#1573)
- Update HAMi-core to fix vllm-related issues: #1381 #1461 by (@archlitchi) in #1478
- Fix: Calculation error for quotas by (@luohua13) in #1400
- Fix: vXPU feature may not working properly on P800 node (#1569)
- Fix scheduler allocate incorrect mig instance (#1518)
- Mock-device-plugin is now ready to use, please visit: https://github.com/Project-HAMi/mock-device-plugin
- Ascend device plugin is now supporting vNPU feature for both HAMi and volcano, please visit: https://github.com/Project-HAMi/ascend-device-plugin
- Refine Node Register logic (#1499)
- Update go version to v1.25.5
- Fix release CI by (@archlitchi) in #1373
- Fix: failed clusterrolebinding when change release name or chart name by (@FouoF) in #1380
- fix: e2e ginkgo version mismatch by (@FouoF) in #1391
- fix: check pod nil in
ReleaseNodeLockby (@DSFans2014) in #1372 - fix: upgrade nvidia-mig-parted to v0.12.2 to solve security issues by (@Shouren) in #1388
- fix: scheduler flaky test by (@FouoF) in #1402
- Fix: After removing the device plugin from the gpu node, it can still… by (@luohua13) in #1456
- Fix concurrent map iteration and map write fatal error. by (@litaixun) in #1452
- fix: fix typos by (@DSFans2014) in #1434
- Fix CI error of the PR #1470, #1326, #1033 by (@archlitchi) in #1473
- Fix concurrent map read write fatal error. by (@litaixun) in #1476
- add podInfos in DeviceUsage to enhance scheduling decision by (@Kyrie336) in #1362
- Update device-numa acquisition logic by (@archlitchi) in #1403
- Improved support for iluvatar GPUs by (@qiangwei1983) in #1399
- Improve: Replace
StrategicMergePatchTypebyMergePatchTypeby (@luohua13) in #1431 - optimize schedule failure event by (@Kyrie336) in #1444
- archlitchi (@archlitchi)
- FouoF (@FouoF)
- DSFans2014 (@DSFans2014)
- Shouren (@Shouren)
- luohua13 (@luohua13)
- litaixun (@litaixun)
- Kyrie336 (@Kyrie336)
- qiangwei1983 (@qiangwei1983)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.7.1...v2.8.0
- No major features in this release.
- Update HAMi-core to fix vllm-related issues: #1381 #1461 by (@archlitchi) in #1478
- Fix: Calculation error for quotas by (@luohua13) in #1400
- Fix release CI by (@archlitchi) in #1373
- Fix: failed clusterrolebinding when change release name or chart name by (@FouoF) in #1380
- fix: e2e ginkgo version mismatch by (@FouoF) in #1391
- fix: check pod nil in
ReleaseNodeLockby (@DSFans2014) in #1372 - fix: upgrade nvidia-mig-parted to v0.12.2 to solve security issues by (@Shouren) in #1388
- fix: scheduler flaky test by (@FouoF) in #1402
- Fix: After removing the device plugin from the gpu node, it can still… by (@luohua13) in #1456
- Fix concurrent map iteration and map write fatal error. by (@litaixun) in #1452
- fix: fix typos by (@DSFans2014) in #1434
- Fix CI error of the PR #1470, #1326, #1033 by (@archlitchi) in #1473
- Fix concurrent map read write fatal error. by (@litaixun) in #1476
- add podInfos in DeviceUsage to enhance scheduling decision by (@Kyrie336) in #1362
- Update device-numa acquisition logic by (@archlitchi) in #1403
- Improved support for iluvatar GPUs by (@qiangwei1983) in #1399
- Improve: Replace
StrategicMergePatchTypebyMergePatchTypeby (@luohua13) in #1431 - optimize schedule failure event by (@Kyrie336) in #1444
- Release v2.7.1 by (@archlitchi) in #1480
- luohua13 (@luohua13)
- qiangwei1983 (@qiangwei1983)
- eltociear (@eltociear)
- daixiang0 (@daixiang0)
- zhegemingzimeibanquan (@zhegemingzimeibanquan)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.7.0...v2.7.1
- Metax sGPU topology aware by (@Kyrie336) in #1193
- NVIDIA Resourcequota by (@FouoF) in #1359
- Kunlunxin topology-aware scheduling by (@FouoF) in #1141
- Kunlunxin vxpu sopport #1016 by (@ouyangluwei163) (@archlitchi) in #1337
- Enflame GCU topology-awareness (#1040) by (@zhaikangqi331) in #1334
- AWS-neuron device and device-core allocation by (@archlitchi) in #1238
- Aggregated Scheduling Failure Events by (@Wangmin362) in #1333
- fix: Before executing MIG partitioning, suppress NVML usage in o… by (@Goend) in #1095
- Fix golint-CI by (@archlitchi) in #1127
- fix: override node socre failure for kunlun #1137 by (@ouyangluwei163) in #1138
- fix: Multi-node scoring nodes are inaccurate by (@ouyangluwei163) in #1147
- fix: An error occurred while create Iluvatar pod by (@ouyangluwei163) in #1149
- Fix e2e CI by (@archlitchi) in #1165
- fix: Add option for overwrite schedulerName by (@Shouren) in #1163
- fix: using go-safecast to fix incorrect conversion of numbers by (@Shouren) in #1183
- fix: deal with security issues reported by Trivy in image by (@Shouren) in #1189
- fix: wrong Pod's UID and emtpy Pod's name in log of webhook.go by (@Shouren) in #1092
- fix: concurrent map writes error in scheduler.calcScore #1269 by (@Shouren) in #1270
- fix: release dangling node lock by (@peachest) in #1271
- fix: fix err which retrieved incorrect NUMA node information issue #1275 by (@abstractmj) in #1276
- fix(security): resolve issues reported by Code scanning in Security by (@Shouren) in #1280
- fix: fix golangci-lint error by (@DSFans2014) in #1319
- Fix: device allocation missing containers with no device request by (@FouoF) in #1299
- fix: update int8Slice to uint8Slice for better type clarity and consistency by (@yxxhero) in #1357
- documentation: add Known Issues for dynamic mig support by (@Goend) in #1122
- docs: fix broken link by (@lixd) in #1125
- clearly list supported devices doc references at README by (@FouoF) in #1155
- docs: update ascend910b-support docs by (@DSFans2014) in #1321
- Optimize Fit-in-device logic to make it device-specific by (@archlitchi) in #1097
- feat(scheduler): make node lock timeout configurable by (@Kevinz857) in #1117
- featue: mig mode-change #1116 by (@ouyangluwei163) in #1124
- feat: Add new labels in .github/release.yml by (@Shouren) in #1066
- feat(scheduler-role): use a scoped-down role for scheduler by (@Antvirf) in #1152
- feat(helm): optionally disable admission webhook by (@Antvirf) in #1145
- remove redundant metrics for vgpu allocation by (@FouoF) in #1169
- refactor: clean up code and improve maintainability by (@Wangmin362) in #1195
- refactor: Ranging over SplitSeq is more efficient by (@Shouren) in #1239
- feat:NodeLockTimeout set from env by (@miaobyte) in #1244
- refactor: move watchAndFeedback function to feedback.go by (@miaobyte) in #1248
- feat: add informer-based pod cache to reduce API server load by (@miaobyte) in #1250
- feat: Add option to disable device plugin at values.yaml. by (@FouoF) in #1274
- refactor(util/nodelock): replace manual polling with k8s.io/client-go/util/retry by (@mayooot) in #1252
- refactor: Remove annotation in Devices interfaces by (@Shouren) in #1343
- feat: update the
Ascend910scheduling policy by (@DSFans2014) in #1344 - feat(nvidia): default gpucores=100 when memory is exclusive and cores… by (@xrwang8) in #1354
- Prerelease-v2.6 by (@archlitchi) in #1108
- add new reviewers Shouren and ouyangluwei163 by (@wawa0210) in #1131
- Support topology-awareness for Kunlunxin device by (@archlitchi) in #1121
- Support Metax sGPU Qos Policy by (@Kyrie336) in #1123
- add global image for chart by (@calvin0327) in #1133
- fix: Skip admission webhook when Pod's scheduler is already assigned. by (@ghostloda) in #1041
- Add node configs to docs by (@wylswz) in #1159
- build(deps): upgrade golang to 1.24.4 by (@Shouren) in #1172
- build(deps): Upgrade golang image in ci to 1.24.4 by (@Shouren) in #1176
- build(deps): Upgrade controller-runtime to 0.21.0 by (@Shouren) in #1171
- build(deps): Dump github.com/NVIDIA/nvidia-container-toolkit by (@Shouren) in #1170
- Add unit tests for Fit Function for enflame,hygon, metax, mthreads, nvidia by (@Wangmin362) in #1199
- [Misc] update hami-core version by (@chaunceyjiang) in #1201
- Improve the impl of DevicePluginConfigs.Nodeconfig overwriting NvidiaConfig by (@FouoF) in #1158
- Add unit tests for cambricon's Fit Function by (@Wangmin362) in #1198
- Add unit tests for Ascend's Fit Function by (@Wangmin362) in #1197
- 修复生成 pod 请求资源时不必要的重复计算 by (@litaixun) in #1215
- 修复更新节点注解时的日志提示词 by (@litaixun) in #1214
- If the mem applied for the Mig device is the same as the template value,>will result in CardNotFoundCustom Filter Rule. by (@zgqqiang) in #1179
- updated dri section to combine text for better readability by (@mpetason) in #1216
- feat: Add nvidia gpu topoloy scheduler by (@fyp711) in #1028
- add issue translate robot by (@wawa0210) in #1232
- add issue translate robot by (@wawa0210) in #1234
- perf(util/nodelock): Use clientset Patch instead of Update. by (@mayooot) in #1192
- Update hami-core and fix readme documents by (@archlitchi) in #1240
- Update hami-core version to fix by (@archlitchi) in #1256
- [Snyk] Security upgrade tensorflow/tensorflow from latest-gpu to 2.20.0rc0-gpu by (@wawa0210) in #1243
- feat: Add an action of 'Close stale issue and PRs' in github worklfow by (@Shouren) in #1083
- Welcome fyp711 to become a HAMi member by (@wawa0210) in #1288
- Add values readme by (@clcc2019) in #1267
- Support Metax sGPU device health check by (@Kyrie336) in #1295
- Optimize pkg/util.go and distribute logics to corresponding logics by (@archlitchi) in #1296
- cleanup: Clear and correct ascend device name by (@FouoF) in #1315
- bugfix: Nvidia card abnormal pod will still continue to schedule by (@zgqqiang) in #1336
- FIx CI, add 910B4-1 template and fix vGPUmonitor metrics error by (@archlitchi) in #1345
- add httpTargetPort to values.yaml by (@flpanbin) in #1356
- Update kunlunxin documents by (@archlitchi) in #1366
- update chart version and hami-core by (@archlitchi) in #1369
- Kevinz857 (@Kevinz857)
- FouoF (@FouoF)
- Antvirf (@Antvirf)
- wylswz (@wylswz)
- litaixun (@litaixun)
- zgqqiang (@zgqqiang)
- mpetason (@mpetason)
- fyp711 (@fyp711)
- mayooot (@mayooot)
- miaobyte (@miaobyte)
- peachest (@peachest)
- abstractmj (@abstractmj)
- clcc2019 (@clcc2019)
- DSFans2014 (@DSFans2014)
- xrwang8 (@xrwang8)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.6.1...v2.7.0
- Optimize scheduler log
- Support enflame gcu-share
- Support metax GPU and metax sGPU
- Helm chart add checksum annotation for restarting hami component after ConfigMap modification
- Support for using RuntimeClass with nvidia devices
- Add support for profiling via net/http/pprof package
- Add nvidia gpu topoloy score registry to node
- Feat: vGPUmonitor support MigInfo metrics
- Fix stuck in driver 570+
- Fix device memory not counted properly in comfyUI task
- Fix cambricon devices not allocated properly
- Fix wrong log and container request device count error
- Fix vgpu-devices-allocated annotations are inconsistent
- Fix removing node devices from node manager
- Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
- Fix device memory count error on cuMallocAsync
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
- Fix multi-process device memory count
- Bump docker/build-push-action from 6.11.0 to 6.13.0 by (@dependabot) in #837
- Bump golang.org/x/net from 0.26.0 to 0.35.0 by (@dependabot) in #859
- Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by (@dependabot) in #941
- Bump docker/login-action from 3.3.0 to 3.4.0 by (@dependabot) in #942
- Bump docker/build-push-action from 6.13.0 to 6.15.0 by (@dependabot) in #899
- build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0 by (@dependabot) in #1024
- build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0 by (@dependabot) in #1052
- build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0 by (@dependabot) in #1091
- fix: Enhance GPU metrics collection and error handling in vGPU monitor by (@haitwang-cloud) in #827
- refactor: update service configurations for device plugin and scheduler by (@haitwang-cloud) in #799
- add ut for scheduler/score by (@shijinye) in #853
- add ut for device/metax by (@shijinye) in #850
- Remove duplicate log fields by (@learner0810) in #860
- [docs] Fix default nvidia.resourceCoreName value in config.md by (@chinaran) in #842
- Update libvgpu.so by (@archlitchi) in #876
- update example.png by (@rockpanda) in #874
- support ascend 910B2 by (@ouyangluwei163) in #885
- fix docs typos by (@JinVei) in #869
- Accelerate node score calculations using multiple goroutines by (@learner0810) in #824
- Support Metax SGPU to sharing GPU by (@Kyrie336) in #895
- docs: fix broken commmunity links by (@agilgur5) in #907
- add config gpu core isolation policy for webhook by (@lengrongfu) in #901
- feat: support scheduler replicas > 1 by (@Azusa-Yuan) in #898
- docs: add syntax highlighting to various code blocks by (@agilgur5) in #906
- Fix UT not be properly executed during CI phase by (@archlitchi) in #911
- typo: fix typos in log and comment by (@popsiclexu) in #917
- feat: Add kube-qps and kube-burst parameters. by (@chaunceyjiang) in #769
- docs: Update MAINTAINERS file with current contributor information by (@Nimbus318) in #918
- Nominate chaunceyjiang to reviewer by (@chaunceyjiang) in #926
- build: update dependencies and remove unused cdiapi by (@yxxhero) in #903
- add lengrongfu to reviewers by (@lengrongfu) in #937
- chore: add namespace override for multi-namespace deployments by (@chinaran) in #924
- fix: hygon dcu concurrent creation conflict by (@joy717) in #921
- Fix the wrong describe of device registry in protocol.md by (@hurricane1988) in #910
- chore: helm chart support scheduler webhook cert-manager by (@chinaran) in #951
- refactor(scheduler): replace init methods with constructor functions by (@yxxhero) in #905
- add Dependencies policy and Security policy by (@yangshiqi) in #934
- scheduler: fix blocked the nodeNotify channel when node changes by (@Iceber) in #964
- docs: Update Ascend910 support documentation by (@zhaikangqi331) in #988
- update iluvatar's docs by (@yangshiqi) in #995
- refactor: replace interface{} with any in various files by (@yxxhero) in #1000
- scheduler: fix duplicate handling of the node label selector by (@Iceber) in #965
- refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml by (@yxxhero) in #1002
- update hami arch by (@wawa0210) in #1007
- Update README.md by (@yowenter) in #1005
- refactor: simplify code by using modern constructs by (@Shouren) in #978
- scheduler: fix removing node devices from node manager by (@Iceber) in #966
- feat: Add support for profiling via net/http/pprof package by (@Shouren) in #963
- Support Enflame gcushare for enflame devices by (@archlitchi) in #1013
- docs: Remove ACTIVE_OOM_KILLER environment variable description by (@chinaran) in #1015
- refactor(vGPUmonitor): change Run to RunE and return errors by (@yxxhero) in #999
- refactored the filter logs and event messages to enhance their clarity, by (@Wangmin362) in #1023
- feat: Support for using RuntimeClass with nvidia devices by (@chinaran) in #1021
- fix wrong log and container request device count error by (@Wangmin362) in #1020
- feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification by (@chinaran) in #1022
- fix vgpu-devices-allocated annotations are inconsistent #991 by (@ouyangluwei163) in #1012
- add Enflame GCU S60 into roadmap. by (@winston-zhang-orz) in #1030
- add nvidia-smi command show cuda version info by (@lengrongfu) in #953
- Separate options from client to make the responsibility more clear. by (@yangshiqi) in #938
- Add nvidia gpu topoloy score registry to node by (@lengrongfu) in #1018
- fix(cicd): update ci.yaml to upload coverage to Codecov by (@Shouren) in #1056
- feat(Actions): Add an action to label pr automatically by (@Shouren) in #1053
- fix: Improve Metax GPU usability and fix related issues by (@Kyrie336) in #1063
- fix(chart): support GKE pre-release versions via kubeVersion '-0' by (@Nimbus318) in #1072
- fix: Dynamic GPU partitioning lacks single-GPU-level granularity. (#1… by (@Goend) in #1061
- update maintainer information by (@wawa0210) in #1079
- add LIBCUDA_LOG_LEVEL env to device-plugin by (@lengrongfu) in #1087
- fix: missing apiVersion in serviceMonitor dashboard docs by (@ntheanh201) in #1077
- test(pkg/util): Add some unit tests for pkg/util by (@Shouren) in #1067
- feat: vGPUmonitor support MigInfo metrics by (@ouyangluwei163) in #1048
- update hami-core version by (@lengrongfu) in #1082
- rockpanda (@rockpanda)
- ouyangluwei163 (@ouyangluwei163)
- JinVei (@JinVei)
- Shouren (@Shouren)
- Kyrie336 (@Kyrie336)
- agilgur5 (@agilgur5)
- Azusa-Yuan (@Azusa-Yuan)
- popsiclexu (@popsiclexu)
- hurricane1988 (@hurricane1988)
- Iceber (@Iceber)
- zhaikangqi331 (@zhaikangqi331)
- yowenter (@yowenter)
- Wangmin362 (@Wangmin362)
- winston-zhang-orz (@winston-zhang-orz)
- Goend (@Goend)
- ntheanh201 (@ntheanh201)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.3...v2.6.0
- No major features in this release.
- Bug fixes related to issues #1181, #1055, #1219, #1230, #1191
- Release v2.5.1 - fix e2e workflow by (@archlitchi) in #1037
- Release v2.5.2 by (@archlitchi) in #1080
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.2...v2.5.3
- No major features in this release.
- Fix device usage metrics(31992) can't be accessed
- No other changes in this release.
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.1...v2.5.2
- No major features in this release.
- Fix: Update handling of version strings in Helm template and helpers.tpl by (@HJJ256) in #845
- fix: Set passDeviceSpecsEnabled to false by default in device plugin by (@Nimbus318) in #872
- fix: scheduler ignore KUBECONFIG env even if this environment variable is set @Shouren in #681
- fix: correct device filter initialization order by (@Nimbus318) in #857
- fix parseNvidiaNumaInfo index out of range by (@flpanbin) in #889
- Fix cambricon pods not been recognized by HAMi scheduler by (@archlitchi) in #947
- fix ubuntu base image in Dockerfile.withlib by (@flpanbin) in #944
- fix: Add error handling for nvml.Init in NvidiaDevicePlugin by (@yxxhero) in #982
- Fix device memory count error on cuMallocAsync by (@archlitchi) in #1029
- Release v2.5 by (@archlitchi) in #1034
- Update tag to v2.5.1 by (@archlitchi) in #1035
- Fix: Update handling of version strings in Helm template and helpers.tpl by (@HJJ256) in #845
- Update libvgpu.so by (@archlitchi) in #876
- fix: Set passDeviceSpecsEnabled to false by default in device plugin by (@Nimbus318) in #872
- fix: scheduler ignore KUBECONFIG env even if this environment variable is set @Shouren in #681
- fix: correct device filter initialization order by (@Nimbus318) in #857
- fix parseNvidiaNumaInfo index out of range by (@flpanbin) in #889
- Fix cambricon pods not been recognized by HAMi scheduler by (@archlitchi) in #947
- fix ubuntu base image in Dockerfile.withlib by (@flpanbin) in #944
- fix: Add error handling for nvml.Init in NvidiaDevicePlugin by (@yxxhero) in #982
- Fix device memory count error on cuMallocAsync by (@archlitchi) in #1029
- Bump golang.org/x/net from 0.26.0 to 0.33.0 by (@dependabot) in #839
- archlitchi (@archlitchi)
- HJJ256 (@HJJ256)
- Nimbus318 (@Nimbus318)
- Shouren (@Shouren)
- flpanbin (@flpanbin)
- yxxhero (@yxxhero)
- dependabot (@dependabot)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.0...v2.5.1
- Support dynamic mig feature, please refer to this document
- Reinstall Hami will NOT crash GPU tasks
- Put all configurations into a configMap, you can customize hami installation by modify its content: see details
- Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
- Fix hami-core stuck on high glib images, like 'tf-serving:latest'
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by (@dependabot) in #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by (@dependabot) in #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0 by (@dependabot) in #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0 by (@dependabot) in #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0 by (@dependabot) in #792
- Fix Kubernetes version string handling by stripping metadata by (@Nimbus318) in #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit by (@archlitchi) in #624
- feat: support device plugin daemonset update strategy by (@devenami) in #628
- add ut about schedule policy by (@yt-huang) in #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch. by (@haitwang-cloud) in #626
- add ut for the scheduler by (@shijinye) in #645
- docs(issue-tmpl): add FAQ link to issue templates by (@Nimbus318) in #647
- fix: filter device registry to node by (@lengrongfu) in #639
- Add self-hosted runner by (@archlitchi) in #659
- fix-example-yaml by (@WQL782795) in #667
- update docs by (@yangshiqi) in #668
- add ut for ascend by (@shijinye) in #664
- optimization map init in test by (@lengrongfu) in #678
- Optimize monitor by (@for800000) in #683
- fix code lint failed by (@lengrongfu) in #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by (@Nimbus318) in #687
- fix vGPUmonitor deviceidx is always 0 by (@lengrongfu) in #684
- add ut for pkg/scheduler/event.go by (@Penguin-zlh) in #688
- add ut for nodes by (@shijinye) in #695
- add license for pkg/scheduler/event_test.go by (@Penguin-zlh) in #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently by (@lijm87) in #575
- add ut for device/nvidia by (@shijinye) in #657
- add ut for pkg/monitor/nvidia/v0/spec.go by (@yt-huang) in #670
- Enable Dynamic-mig feature for HAMi by (@archlitchi) in #708
- Fix chart can not be deployed properly by (@archlitchi) in #711
- Fix NodeLock issue by (@archlitchi) in #714
- fix example yaml by (@lixd) in #709
- add ut for device/cambricon by (@shijinye) in #712
- Update dynamic mig documents and examples by (@archlitchi) in #718
- random time may be zero by (@shijinye) in #697
- fix grafana dashboard and clarify dashboard usage more clearly. by (@jiangsanyin) in #543
- doc(README): add examples for GPU sharing and update-examples by (@xiaoyao) in #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go by (@yt-huang) in #673
- Add design document to 'dynamic-mig' feature by (@archlitchi) in #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist by (@elrondwong) in #724
- add ut for pkg/util/nodelock/nodelock.go by (@learner0810) in #719
- test: add ut for pkg/version/version.go by (@Penguin-zlh) in #677
- Update on mig mode by (@archlitchi) in #726
- Update documents for config & config_cn by (@archlitchi) in #729
- set PASS_DEVICE_SPECS ENV to device-plugin by (@jingzhe6414) in #690
- fix device-plugin-version by (@learner0810) in #743
- feat: Return the nodes that failed to be scheduled back to the scheduler by (@chaunceyjiang) in #746
- fix(log): fix missing log output in nvidiadeviceplugin server by (@elrondwong) in #735
- support configuration resources limits and requests by (@flpanbin) in #739
- feat(test): add TestMarshalNodeDevices scenarios by (@elrondwong) in #747
- print flags for device-plugin and scheduler by (@flpanbin) in #756
- Fix typos, add more contributors and maintainers. by (@yangshiqi) in #765
- Add a mind map(Chinese and English) to help understand this project by (@oceanweave) in #764
- [Docs] update config pages by (@windsonsea) in #760
- add ut for device-map by (@KubeKyrie) in #762
- refactor(ci): use go.mod file for Go version in workflows by (@yxxhero) in #766
- support set log level for device plugin by (@flpanbin) in #771
- feat: Restart/Upgrade device-plugin will not affect services. by (@chaunceyjiang) in #767
- add ut nvml devices by (@KubeKyrie) in #773
- add ut for device-map by (@KubeKyrie) in #772
- Optimize the time format layout by (@learner0810) in #741
- fix: nvidia-device-plugin no version info by (@chaunceyjiang) in #779
- HAMi supports e2e by (@Rei1010) in #775
- Proposal: enable E2E test by (@Rei1010) in #633
- add ut for device/iluvatar by (@shijinye) in #795
- add ut for device/hygon by (@shijinye) in #787
- add ut for pkg/monitor/nvidia/v1 by (@shijinye) in #780
- refactor(logging): enhance log messages for device resource counting by (@haitwang-cloud) in #778
- Enrich pod health check by (@Rei1010) in #801
- docs: fix broken link by (@lixd) in #802
- Optimize the E2E execution logic by (@Rei1010) in #803
- optimize MetricsBindAddress to MetricsBindPort by (@phoenixwu0229) in #796
- fix: handle the node nil issue & E2E test failure by (@haitwang-cloud) in #804
- add ut for device/mthreads by (@shijinye) in #808
- fix: Resolve formatting issue in ConfigMap causing display anomalies by (@lixd) in #814
- [docs] Update ascend910b-support.md by (@windsonsea) in #816
- Refine metrics logs by (@haitwang-cloud) in #817
- Update mig-related logics and refine logs by (@archlitchi) in #833
- Add 910B4 config to device-configmap for ascend by (@lijm87) in #828
- [docs] fix: glibc version requirement in README by (@chinaran) in #826
- Update HAMi-core for v2.5.0 by (@archlitchi) in #834
- FIx multi-process device memory count issue by (@archlitchi) in #835
- bump version to v2.5.0 by (@wawa0210) in #836
- Fix CI by (@archlitchi) in #838
- Fix CI release by (@archlitchi) in #840
- Fix release ci by (@archlitchi) in #841
- Fix Dockerfile to make CI pass by (@archlitchi) in #846
- Fix E2E failure with pod status check by (@Rei1010) in #847
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU by (@archlitchi) in #848
- yt-huang (@yt-huang)
- shijinye (@shijinye)
- WQL782795 (@WQL782795)
- yangshiqi (@yangshiqi)
- for800000 (@for800000)
- Penguin-zlh (@Penguin-zlh)
- lixd (@lixd)
- jiangsanyin (@jiangsanyin)
- xiaoyao (@xiaoyao)
- elrondwong (@elrondwong)
- learner0810 (@learner0810)
- jingzhe6414 (@jingzhe6414)
- flpanbin (@flpanbin)
- oceanweave (@oceanweave)
- windsonsea (@windsonsea)
- KubeKyrie (@KubeKyrie)
- yxxhero (@yxxhero)
- Rei1010 (@Rei1010)
- phoenixwu0229 (@phoenixwu0229)
- chinaran (@chinaran)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.4.1...v2.5.0