HAMi + prometheus-k8s + grafana实现vgpu虚拟化监控
最近长沙跑了半个多月,跟甲方客户对了下项目指标,许久没更新
回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控,毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控
先说下为啥要用HAMi吧, 一个重要原因是公司有人引见了这个工具的作者, 很多问题我都可以直接向作者提问
HAMi,是一个国产的GPU与国产加速卡(支持的GPU与国产加速卡型号与具体特性请查看此项目官网:https://github.com/Project-HAMi/HAMi/)虚拟化开源项目,实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名“k8s-vGPU-scheduler”,
最初由我司开源,现已在国内与国际上愈加流行,是管理Kubernetes中异构设备的中间件。它可以管理不同类型的异构设备(如GPU、NPU等),在Pod之间共享异构设备,根据设备的拓扑信息和调度策略做出更好的调度决策。为了阐述的简明性,本文只提供一种可行的办法,最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。
本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的,相关组件或软件版本信息如下:
组件或软件名称 | 版本 | 备注 |
---|---|---|
kubernetes集群 | v1.23.1 | AMD64构架服务器环境下 |
HAMi | 根据向开源作者提问,当前HAMi版本发行机制还不够成熟,暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本,此值要跟kubernetes版本看齐 | 项目地址:https://github.com/Project-HAMi/HAMi/ |
kube-prometheus stack | prom/prometheus:v2.27.1 | 关于监控的安装参见实现prometheus+grafana的监控部署_prometheus grafana监控部署-CSDN博客 |
dcgm-exporter | nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04 |
HAMi 的默认安装方式是通过helm,添加Helm仓库:
helm repo add hami-charts https://project-hami.github.io/HAMi/
检查Kubernetes版本并安装HAMi(服务器版本为1.23.1):
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system
验证hami安装成功
kubectl get pods -n kube-system
确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。
把helm安装转为hami-install.yaml
helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system > hami-install.yaml
该格式部署
---
# Source: hami/templates/device-plugin/monitorserviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: hami-device-pluginnamespace: "kube-system"labels:app.kubernetes.io/component: "hami-device-plugin"helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/scheduler/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: hami-schedulernamespace: "kube-system"labels:app.kubernetes.io/component: "hami-scheduler"helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/device-plugin/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
data:config.json: |{"nodeconfig": [{"name": "m5-cloudinfra-online02","devicememoryscaling": 1.8,"devicesplitcount": 10,"migstrategy":"none","filterdevices": {"uuid": [],"index": []}}]}
---
# Source: hami/templates/scheduler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
data:config.json: |{"kind": "Policy","apiVersion": "v1","extenders": [{"urlPrefix": "https://127.0.0.1:443","filterVerb": "filter","bindVerb": "bind","enableHttps": true,"weight": 1,"nodeCacheCapable": true,"httpTimeout": 30000000000,"tlsConfig": {"insecure": true},"managedResources": [{"name": "nvidia.com/gpu","ignoredByScheduler": true},{"name": "nvidia.com/gpumem","ignoredByScheduler": true},{"name": "nvidia.com/gpucores","ignoredByScheduler": true},{"name": "nvidia.com/gpumem-percentage","ignoredByScheduler": true},{"name": "nvidia.com/priority","ignoredByScheduler": true},{"name": "cambricon.com/vmlu","ignoredByScheduler": true},{"name": "hygon.com/dcunum","ignoredByScheduler": true},{"name": "hygon.com/dcumem","ignoredByScheduler": true },{"name": "hygon.com/dcucores","ignoredByScheduler": true},{"name": "iluvatar.ai/vgpu","ignoredByScheduler": true}],"ignoreable": false}]}
---
# Source: hami/templates/scheduler/configmapnew.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-scheduler-newversionlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
data:config.yaml: |apiVersion: kubescheduler.config.k8s.io/v1kind: KubeSchedulerConfigurationleaderElection:leaderElect: falseprofiles:- schedulerName: hami-schedulerextenders:- urlPrefix: "https://127.0.0.1:443"filterVerb: filterbindVerb: bindnodeCacheCapable: trueweight: 1httpTimeout: 30senableHTTPS: truetlsConfig:insecure: truemanagedResources:- name: nvidia.com/gpuignoredByScheduler: true- name: nvidia.com/gpumemignoredByScheduler: true- name: nvidia.com/gpucoresignoredByScheduler: true- name: nvidia.com/gpumem-percentageignoredByScheduler: true- name: nvidia.com/priorityignoredByScheduler: true- name: cambricon.com/vmluignoredByScheduler: true- name: hygon.com/dcunumignoredByScheduler: true- name: hygon.com/dcumemignoredByScheduler: true- name: hygon.com/dcucoresignoredByScheduler: true- name: iluvatar.ai/vgpuignoredByScheduler: true
---
# Source: hami/templates/scheduler/device-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: hami-scheduler-devicelabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
data:device-config.yaml: |-nvidia:resourceCountName: nvidia.com/gpuresourceMemoryName: nvidia.com/gpumemresourceMemoryPercentageName: nvidia.com/gpumem-percentageresourceCoreName: nvidia.com/gpucoresresourcePriorityName: nvidia.com/priorityoverwriteEnv: falsedefaultMemory: 0defaultCores: 0defaultGPUNum: 1deviceSplitCount: 10deviceMemoryScaling: 1deviceCoreScaling: 1cambricon:resourceCountName: cambricon.com/vmluresourceMemoryName: cambricon.com/mlu.smlu.vmemoryresourceCoreName: cambricon.com/mlu.smlu.vcorehygon:resourceCountName: hygon.com/dcunumresourceMemoryName: hygon.com/dcumemresourceCoreName: hygon.com/dcucoresmetax:resourceCountName: "metax-tech.com/gpu"mthreads:resourceCountName: "mthreads.com/vgpu"resourceMemoryName: "mthreads.com/sgpu-memory"resourceCoreName: "mthreads.com/sgpu-core"iluvatar: resourceCountName: iluvatar.ai/vgpuresourceMemoryName: iluvatar.ai/vcuda-memoryresourceCoreName: iluvatar.ai/vcuda-corevnpus:- chipName: 910BcommonWord: Ascend910AresourceName: huawei.com/Ascend910AresourceMemoryName: huawei.com/Ascend910A-memorymemoryAllocatable: 32768memoryCapacity: 32768aiCore: 30templates:- name: vir02memory: 2184aiCore: 2- name: vir04memory: 4369aiCore: 4- name: vir08memory: 8738aiCore: 8- name: vir16memory: 17476aiCore: 16- chipName: 910B3commonWord: Ascend910BresourceName: huawei.com/Ascend910BresourceMemoryName: huawei.com/Ascend910B-memorymemoryAllocatable: 65536memoryCapacity: 65536aiCore: 20aiCPU: 7templates:- name: vir05_1c_16gmemory: 16384aiCore: 5aiCPU: 1- name: vir10_3c_32gmemory: 32768aiCore: 10aiCPU: 3- chipName: 310P3commonWord: Ascend310PresourceName: huawei.com/Ascend310PresourceMemoryName: huawei.com/Ascend310P-memorymemoryAllocatable: 21527memoryCapacity: 24576aiCore: 8aiCPU: 7templates:- name: vir01memory: 3072aiCore: 1aiCPU: 1- name: vir02memory: 6144aiCore: 2aiCPU: 2- name: vir04memory: 12288aiCore: 4aiCPU: 4
---
# Source: hami/templates/device-plugin/monitorrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: hami-device-plugin-monitor
rules:- apiGroups:- ""resources:- podsverbs:- get- create- watch- list- update- patch- apiGroups:- ""resources:- nodesverbs:- get- update- list- patch
---
# Source: hami/templates/device-plugin/monitorrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: "hami-device-plugin"helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRole#name: cluster-adminname: hami-device-plugin-monitor
subjects:- kind: ServiceAccountname: hami-device-pluginnamespace: "kube-system"
---
# Source: hami/templates/scheduler/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: "hami-scheduler"helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: cluster-admin
subjects:- kind: ServiceAccountname: hami-schedulernamespace: "kube-system"
---
# Source: hami/templates/device-plugin/monitorservice.yaml
apiVersion: v1
kind: Service
metadata:name: hami-device-plugin-monitorlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
spec:externalTrafficPolicy: Localselector:app.kubernetes.io/component: hami-device-plugintype: NodePortports:- name: monitorportport: 31992targetPort: 9394nodePort: 31992
---
# Source: hami/templates/scheduler/service.yaml
apiVersion: v1
kind: Service
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
spec:type: NodePortports:- name: httpport: 443targetPort: 443nodePort: 31998protocol: TCP- name: monitorport: 31993targetPort: 9395nodePort: 31993protocol: TCPselector:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hami
---
# Source: hami/templates/device-plugin/daemonsetnvidia.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:name: hami-device-pluginlabels:app.kubernetes.io/component: hami-device-pluginhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
spec:selector:matchLabels:app.kubernetes.io/component: hami-device-pluginapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamitemplate:metadata:labels:app.kubernetes.io/component: hami-device-pluginhami.io/webhook: ignoreapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamispec:imagePullSecrets: []serviceAccountName: hami-device-pluginpriorityClassName: system-node-criticalhostPID: truehostNetwork: truecontainers:- name: device-pluginimage: projecthami/hami:latestimagePullPolicy: "IfNotPresent"lifecycle:postStart:exec:command: ["/bin/sh","-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]command:- nvidia-device-plugin- --config-file=/device-config.yaml- --mig-strategy=none- --disable-core-limit=false- -v=falseenv:- name: NODE_NAMEvalueFrom:fieldRef:fieldPath: spec.nodeName- name: NVIDIA_MIG_MONITOR_DEVICESvalue: all- name: HOOK_PATHvalue: /usr/localsecurityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]add: ["SYS_ADMIN"]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-plugins- name: libmountPath: /usr/local/vgpu- name: usrbinmountPath: /usrbin- name: deviceconfigmountPath: /config- name: hosttmpmountPath: /tmp- name: device-configmountPath: /device-config.yamlsubPath: device-config.yaml- name: vgpu-monitorimage: projecthami/hami:latestimagePullPolicy: "IfNotPresent"command: ["vGPUmonitor"]securityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]add: ["SYS_ADMIN"]env:- name: NVIDIA_VISIBLE_DEVICESvalue: "all"- name: NVIDIA_MIG_MONITOR_DEVICESvalue: all- name: HOOK_PATHvalue: /usr/local/vgpu volumeMounts:- name: ctrsmountPath: /usr/local/vgpu/containers- name: dockersmountPath: /run/docker- name: containerdsmountPath: /run/containerd- name: sysinfomountPath: /sysinfo- name: hostvarmountPath: /hostvarvolumes:- name: ctrshostPath:path: /usr/local/vgpu/containers- name: hosttmphostPath:path: /tmp- name: dockershostPath:path: /run/docker- name: containerdshostPath:path: /run/containerd- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins- name: libhostPath:path: /usr/local/vgpu- name: usrbinhostPath:path: /usr/bin- name: sysinfohostPath:path: /sys- name: hostvarhostPath:path: /var- name: deviceconfigconfigMap:name: hami-device-plugin- name: device-configconfigMap:name: hami-scheduler-devicenodeSelector: gpu: "on"
---
# Source: hami/templates/scheduler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:name: hami-schedulerlabels:app.kubernetes.io/component: hami-schedulerhelm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helm
spec:replicas: 1selector:matchLabels:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamitemplate:metadata:labels:app.kubernetes.io/component: hami-schedulerapp.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamihami.io/webhook: ignorespec:imagePullSecrets: []serviceAccountName: hami-schedulerpriorityClassName: system-node-criticalcontainers:- name: kube-schedulerimage: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0imagePullPolicy: "IfNotPresent"command:- kube-scheduler- --config=/config/config.yaml- -v=4- --leader-elect=true- --leader-elect-resource-name=hami-scheduler- --leader-elect-resource-namespace=kube-systemvolumeMounts:- name: scheduler-configmountPath: /config- name: vgpu-scheduler-extenderimage: projecthami/hami:latestimagePullPolicy: "IfNotPresent"env:command:- scheduler- --http_bind=0.0.0.0:443- --cert_file=/tls/tls.crt- --key_file=/tls/tls.key- --scheduler-name=hami-scheduler- --metrics-bind-address=:9395- --node-scheduler-policy=binpack- --gpu-scheduler-policy=spread- --device-config-file=/device-config.yaml- --debug- -v=4ports:- name: httpcontainerPort: 443protocol: TCPvolumeMounts:- name: tls-configmountPath: /tls- name: device-configmountPath: /device-config.yamlsubPath: device-config.yamlvolumes:- name: tls-configsecret:secretName: hami-scheduler-tls- name: scheduler-configconfigMap:name: hami-scheduler-newversion- name: device-configconfigMap:name: hami-scheduler-device
---
# Source: hami/templates/scheduler/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:name: hami-webhook
webhooks:- admissionReviewVersions:- v1beta1clientConfig:service:name: hami-schedulernamespace: kube-systempath: /webhookport: 443failurePolicy: IgnorematchPolicy: Equivalentname: vgpu.hami.ionamespaceSelector:matchExpressions:- key: hami.io/webhookoperator: NotInvalues:- ignoreobjectSelector:matchExpressions:- key: hami.io/webhookoperator: NotInvalues:- ignorereinvocationPolicy: Neverrules:- apiGroups:- ""apiVersions:- v1operations:- CREATEresources:- podsscope: '*'sideEffects: NonetimeoutSeconds: 10
---
# Source: hami/templates/scheduler/job-patch/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: hami-admissionannotations:"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade"helm.sh/hook-delete-policy": before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
---
# Source: hami/templates/scheduler/job-patch/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: hami-admissionannotations:"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade"helm.sh/hook-delete-policy": before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
rules:- apiGroups:- admissionregistration.k8s.ioresources:#- validatingwebhookconfigurations- mutatingwebhookconfigurationsverbs:- get- update
---
# Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: hami-admissionannotations:"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade"helm.sh/hook-delete-policy": before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: hami-admission
subjects:- kind: ServiceAccountname: hami-admissionnamespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:name: hami-admissionannotations:"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade"helm.sh/hook-delete-policy": before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
rules:- apiGroups:- ""resources:- secretsverbs:- get- create
---
# Source: hami/templates/scheduler/job-patch/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:name: hami-admissionannotations:"helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade"helm.sh/hook-delete-policy": before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
roleRef:apiGroup: rbac.authorization.k8s.iokind: Rolename: hami-admission
subjects:- kind: ServiceAccountname: hami-admissionnamespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/job-createSecret.yaml
apiVersion: batch/v1
kind: Job
metadata:name: hami-admission-createannotations:"helm.sh/hook": pre-install,pre-upgrade"helm.sh/hook-delete-policy": before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
spec:template:metadata:name: hami-admission-createlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhookhami.io/webhook: ignorespec:imagePullSecrets: []containers:- name: createimage: liangjw/kube-webhook-certgen:v1.1.1imagePullPolicy: IfNotPresentargs:- create- --cert-name=tls.crt- --key-name=tls.key- --host=hami-scheduler.kube-system.svc,127.0.0.1- --namespace=kube-system- --secret-name=hami-scheduler-tlsrestartPolicy: OnFailureserviceAccountName: hami-admissionsecurityContext:runAsNonRoot: truerunAsUser: 2000
---
# Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml
apiVersion: batch/v1
kind: Job
metadata:name: hami-admission-patchannotations:"helm.sh/hook": post-install,post-upgrade"helm.sh/hook-delete-policy": before-hook-creation,hook-succeededlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhook
spec:template:metadata:name: hami-admission-patchlabels:helm.sh/chart: hami-2.4.0app.kubernetes.io/name: hamiapp.kubernetes.io/instance: hamiapp.kubernetes.io/version: "2.4.0"app.kubernetes.io/managed-by: Helmapp.kubernetes.io/component: admission-webhookhami.io/webhook: ignorespec:imagePullSecrets: []containers:- name: patchimage: liangjw/kube-webhook-certgen:v1.1.1imagePullPolicy: IfNotPresentargs:- patch- --webhook-name=hami-webhook- --namespace=kube-system- --patch-validating=false- --secret-name=hami-scheduler-tlsrestartPolicy: OnFailureserviceAccountName: hami-admissionsecurityContext:runAsNonRoot: truerunAsUser: 2000
部署dcgm-exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:name: "dcgm-exporter"labels:app.kubernetes.io/name: "dcgm-exporter"app.kubernetes.io/version: "3.6.1"
spec:updateStrategy:type: RollingUpdateselector:matchLabels:app.kubernetes.io/name: "dcgm-exporter"app.kubernetes.io/version: "3.6.1"template:metadata:labels:app.kubernetes.io/name: "dcgm-exporter"app.kubernetes.io/version: "3.6.1"name: "dcgm-exporter"spec:containers:- image: "nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04"env:- name: "DCGM_EXPORTER_LISTEN"value: ":9400"- name: "DCGM_EXPORTER_KUBERNETES"value: "true"name: "dcgm-exporter"ports:- name: "metrics"containerPort: 9400securityContext:runAsNonRoot: falserunAsUser: 0capabilities:add: ["SYS_ADMIN"]volumeMounts:- name: "pod-gpu-resources"readOnly: truemountPath: "/var/lib/kubelet/pod-resources"volumes:- name: "pod-gpu-resources"hostPath:path: "/var/lib/kubelet/pod-resources"---kind: Service
apiVersion: v1
metadata:name: "dcgm-exporter"labels:app.kubernetes.io/name: "dcgm-exporter"app.kubernetes.io/version: "3.6.1"
spec:selector:app.kubernetes.io/name: "dcgm-exporter"app.kubernetes.io/version: "3.6.1"ports:- name: "metrics"port: 9400
dcgm-exporter安装成功
参考这个hami-vgpu dashboard 下载panel 的json文件
hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为“hami-vgpu-dashboard”的dashboard,但此页面中有一些Panel如vGPUCorePercentage还没有数据
ServiceMonitor
是 Prometheus Operator 中的一个自定义资源,主要用于监控 Kubernetes 中的服务。它的作用包括:
1. 自动化发现
ServiceMonitor
允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor
,您可以告诉 Prometheus 监控特定服务的端点。
2. 配置抓取参数
您可以在 ServiceMonitor
中设置抓取的相关参数,例如:
- 抓取间隔:定义 Prometheus 多频繁抓取数据(如每 30 秒)。
- 超时:定义抓取请求的超时时间。
- 标签选择器:指定要监控的服务的标签,确保 Prometheus 仅抓取相关服务的数据。
dcgm-exporter需要配置两个service monitor
hami-device-plugin-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:name: hami-device-plugin-svc-monitornamespace: kube-system
spec:selector:matchLabels:app.kubernetes.io/component: hami-device-pluginnamespaceSelector:matchNames:- kube-systemendpoints:- path: /metricsport: monitorportinterval: "15s"honorLabels: falserelabelings:- sourceLabels: [__meta_kubernetes_endpoints_name]regex: hami-.*replacement: $1action: keep- sourceLabels: [__meta_kubernetes_pod_node_name]regex: (.*)targetLabel: node_namereplacement: ${1}action: replace- sourceLabels: [__meta_kubernetes_pod_host_ip]regex: (.*)targetLabel: ipreplacement: $1action: replace
hami-scheduler-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:name: hami-scheduler-svc-monitornamespace: kube-system
spec:selector:matchLabels:app.kubernetes.io/component: hami-schedulernamespaceSelector:matchNames:- kube-systemendpoints:- path: /metricsport: monitorinterval: "15s"honorLabels: falserelabelings:- sourceLabels: [__meta_kubernetes_endpoints_name]regex: hami-.*replacement: $1action: keep- sourceLabels: [__meta_kubernetes_pod_node_name]regex: (.*)targetLabel: node_namereplacement: ${1}action: replace- sourceLabels: [__meta_kubernetes_pod_host_ip]regex: (.*)targetLabel: ipreplacement: $1action: replace
确认创建的ServiceMonitor
启动gpu pod一个测试下
apiVersion: v1
kind: Pod
metadata:name: gpu-pod-1
spec:restartPolicy: Nevercontainers:- name: cuda-containerimage: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1command: ["sleep", "infinity"]resources:limits:nvidia.com/gpu: 1nvidia.com/gpumem: 1000nvidia.com/gpucores: 10
如果看到pod一直pending 状态
检查下节点如果出现下面gpu为0的情况
需要
docker:1:下载NVIDIA-DOCKER2安装包并安装2:修改/etc/docker/daemon.json文件内容加上{"default-runtime": "nvidia","runtimes": {"nvidia": {"path": "/usr/bin/nvidia-container-runtime","runtimeArgs": []}},}k8s:1:下载k8s-device-plugin 镜像2:编写nvidia-device-plugin.yml创建驱动pod
使用这个yml进行创建
apiVersion: apps/v1
kind: DaemonSet
metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system
spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:labels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedulepriorityClassName: "system-node-critical"containers:- image: nvidia/k8s-device-plugin:1.11name: nvidia-device-plugin-ctrenv:- name: FAIL_ON_INIT_ERRORvalue: "false"securityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins
gpu pod启动后进入查看下, gpu内存和限制的大小相同设置成功
访问下{scheduler node ip}:31993/metrics
日志最后有两行
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-1",podnamespace="default",zone="vGPU"} 1.048576e+10 vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-2",podnamespace="default",zone="vGPU"} 1.048576e+10
可以看到相同deviceuuid的gpu被不同pod共享使用
exec进入hami-device-plugin daemonset里面执行nvidia-smi -L 可以看到机器上所有显卡的信息
root@node126:/# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b)
root@node126:/#
之前创建的两个serviceMonitor会去请求
app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics 接口获取数据
当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard
相关文章:
![](https://i-blog.csdnimg.cn/direct/fae4cead91d04aaebc52671bb5e770e8.png)
HAMi + prometheus-k8s + grafana实现vgpu虚拟化监控
最近长沙跑了半个多月,跟甲方客户对了下项目指标,许久没更新 回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控,毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控 先说下为啥要用HAMi吧, 一个重要原…...
![](https://i-blog.csdnimg.cn/img_convert/3f122012bc8a7aa7f52b21b14e30b0ae.png)
Java基于SSM框架的在线视频教育系统小程序【附源码、文档】
博主介绍:✌IT徐师兄、7年大厂程序员经历。全网粉丝15W、csdn博客专家、掘金/华为云//InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌ 🍅文末获取源码联系🍅 👇🏻 精彩专栏推荐订阅👇dz…...
![](https://i-blog.csdnimg.cn/direct/a61b643327dc4926a9f80221ea9bc3d7.png#pic_center)
mysql本地安装和pycharm链接数据库操作
MySQL本地安装和相关操作 Python相关:基础、函数、数据类型、面向、模块。 前端开发:HTML、CSS、JavaScript、jQuery。【静态页面】 Java前端; Python前端; Go前端 -> 【动态页面】直观: 静态,写死了…...
![](https://www.ngui.cc/images/no-images.jpg)
Unity编程与游戏开发-编程与游戏开发的关系
游戏开发是一个复杂的多领域合作过程,涵盖了从创意构思到最终实现的多个方面。在这个过程中,技术、设计与美术三大核心要素相互交织,缺一不可。在游戏开发的过程中,Unity作为一款强大的跨平台游戏引擎,凭借其高效的开发工具和庞大的社区支持,成为了很多游戏开发者的首选工…...
![](https://www.ngui.cc/images/no-images.jpg)
2025年第三届“华数杯”国际赛A题解题思路与代码(Python版)
游泳竞技策略优化模型代码详解 第一题:速度优化模型 在这一部分,我们将详细解析如何通过数学建模来优化游泳运动员在不同距离比赛中的速度分配策略。 1. 模型概述 我们的模型主要包含三个核心文件: speed_optimization.py: 速度优化的核…...
![](https://i-blog.csdnimg.cn/img_convert/bcf12740fed791f3486b091a9bf0a047.gif)
针对服务器磁盘爆满,MySql数据库始终无法启动,怎么解决
(点击即可进入聊天助手) 很多站长在运营网站的过程当中都会遇到一个问题,就是网站突然无法打开,数据一直无法启动 无论是强制重启还是,删除网站内的所有应用,数据库一直无法启动 这个时候,就需要常见的运维手段了,需要对服务器后台各个资源,进行逐一排查…...
![](https://www.ngui.cc/images/no-images.jpg)
[Android]service命令的使用
在前面的讨论中,我们说到,如果在客户端懒得使用aidl文件生成的接口类进行binder,可以使用IBinder的transcat方法 Parcel dataParcel = Parcel.obtain(); Parcel resultParcel = Parcel.obtain();dataParcel.writeInterfaceToken(DESCRIPTOR);//发起请求 aProxyBinder.trans…...
![](https://i-blog.csdnimg.cn/direct/45aeecbb6c154745a96074994ce6fe0e.png#pic_center)
【芯片封测学习专栏 -- Substrate | RDL Interposer | Si Interposer | 嵌入式硅桥(EMIB)详细介绍】
请阅读【嵌入式开发学习必备专栏 Cache | MMU | AMBA BUS | CoreSight | Trace32 | CoreLink | ARM GCC | CSH】 文章目录 OverviewSubstrate(衬底或基板)Substrate 定义Substrate 特点与作用Substrate 实例 RDL Interposer(重布线层中介层&a…...
![](https://www.ngui.cc/images/no-images.jpg)
spring cloud注册nacos并从nacos上拉取配置文件,spring cloud不会自动读取bootstrap.yml文件
目录 踩坑问题记录前言版本说明spring cloudb不会自动读取bootstrap.yml文件问题解决spring cloud注册nacos并从nacos上拉取配置文件后话 踩坑问题记录 1、spring cloudb不会自动读取bootstrap.yml文件 2、spring cloud注册nacos并从nacos上拉取配置文件 前言 使用cloud Ali…...
![](https://i-blog.csdnimg.cn/direct/3842805b0bc04fe3bbc89a2231034e87.png)
【深度学习地学应用|滑坡制图、变化检测、多目标域适应、感知学习、深度学习】跨域大尺度遥感影像滑坡制图方法:基于原型引导的领域感知渐进表示学习(一)
【深度学习地学应用|滑坡制图、变化检测、多目标域适应、感知学习、深度学习】跨域大尺度遥感影像滑坡制图方法:基于原型引导的领域感知渐进表示学习(一) 【深度学习地学应用|滑坡制图、变化检测、多目标域适应、感知学习、深度学习】跨域大…...
![](https://i-blog.csdnimg.cn/img_convert/72d37fc5cba2d645a308c2fe760eb2d8.png)
Spring Boot 支持哪些日志框架
Spring Boot 支持多种日志框架,主要包括以下几种: SLF4J (Simple Logging Facade for Java) Logback(默认)Log4j 2Java Util Logging (JUL) 其中,Spring Boot 默认使用 SLF4J 和 Logback 作为日志框架。如果你需要使…...
![](https://i-blog.csdnimg.cn/direct/10a584b1706a495397db10c0078212e3.png)
【翻译】2025年华数杯国际赛数学建模题目+翻译pdf自取
保存至本地网盘 链接:https://pan.quark.cn/s/f82a1fa7ed87 提取码:6UUw 2025年“华数杯”国际大学生数学建模竞赛比赛时间于2025年1月11日(周六)06:00开始,至1月15日(周三)09:00结束ÿ…...
![](https://i-blog.csdnimg.cn/direct/29eacd711dc64a278b6934b3568ad2b9.png)
qt 窗口(window/widget)绘制/渲染顺序 QPainter QPaintDevice Qpainter渲染 失效 无效 原因
qt窗体布局 窗体渲染过程 qt中窗体渲染逻辑顺序为 本窗体->子窗体/控件 递归,也就是说先渲染父窗体再渲染子窗体。其中子窗体按加入时的先后顺序进行渲染。通过下方的函数调用堆栈可以看出窗体都是在widget组件源码的widgetprivate::drawwidget中进行渲染的&am…...
![](https://i-blog.csdnimg.cn/direct/7cd750dd9cfa44039bfeb59c198076e0.png)
TIOBE编程语言排行靠前的编程语言的吉祥物
Python的吉祥物:小蟒蛇 Python语言的吉祥物是一只名叫"Pythonidae"(或简称"Py")的小蟒蛇。这个吉祥物由Tobias Kohn设计于2005年,它的形象借鉴了真实的蟒蛇,但加入了一些可爱和友善的特点。小蟒蛇…...
![](https://i-blog.csdnimg.cn/direct/21166ad279654c1b82cbf180a44cf1bc.gif)
【前端动效】HTML + CSS 实现打字机效果
目录 1. 效果展示 2. 思路分析 2.1 难点 2.2 实现思路 3. 代码实现 3.1 html部分 3.2 css部分 3.3 完整代码 4. 总结 1. 效果展示 如图所示,这次带来的是一个有趣的“擦除”效果,也可以叫做打字机效果,其中一段文本从左到右逐渐从…...
![](https://i-blog.csdnimg.cn/direct/83f3c09ca3e54dbb8fdcd14a8cb59457.png)
大疆上云API连接遥控器和无人机
文章目录 1、部署大疆上云API关于如何连接我们自己部署的上云API2、开启无人机和遥控器并连接自己部署的上云API如果遥控器和无人机没有对频的情况下即只有遥控器没有无人机的情况下如果遥控器和无人机已经对频好了的情况下 4、订阅无人机或遥控器的主题信息4.1、订阅无人机实时…...
![](https://i-blog.csdnimg.cn/direct/50d4fc49cce54243bf9167762ac72def.png)
JS逆向-atob和btoa分析
声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! 故事是这样的,有位读者朋友需要模拟登录一个网站: aHR0cDovL3d3dy56bGRzai5jb20v 我…...
![](https://i-blog.csdnimg.cn/direct/d1e68c3cf9244237be1f5c1a2879f0a3.png#pic_center)
primitive 编写着色器材质
import { nextTick, onMounted, ref } from vue import * as Cesium from cesium import gsap from gsaponMounted(() > { ... })// 1、创建矩形几何体,Cesium.RectangleGeometry:几何体,Rectangle:矩形 let rectGeometry new…...
![](https://i-blog.csdnimg.cn/direct/d8871c64be5b48fb89d7153674a6c28f.jpeg)
计算机视觉算法实战——车道线检测
✨个人主页欢迎您的访问 ✨期待您的三连 ✨ ✨个人主页欢迎您的访问 ✨期待您的三连 ✨ ✨个人主页欢迎您的访问 ✨期待您的三连✨ 车道线检测是计算机视觉领域的一个重要研究方向,尤其在自动驾驶和高级驾驶辅助…...
网络安全-安全散列函数,信息摘要SHA-1,MD5原理
安全散列函数 单向散列函数或者安全散列函数之所以重要,不仅在于消息认证(消息摘要。数据指纹)。还有数字签名(加强版的消息认证)和验证数据的完整性。常见的单向散列函数有MD5和SHA 散列函数的要求 散列函数的目的是文件、消息或者其它数据…...
![](https://i-blog.csdnimg.cn/blog_migrate/1150ee702154b551455cbfb92c616f74.png)
树莓派-5-GPIO的应用实验之GPIO的编码方式和SDK介绍
文章目录 1 GPIO编码方式1.1 管脚信息1.2 使用场合1.3 I2C总线1.4 SPI总线2 RPI.GPIO2.1 PWM脉冲宽度调制2.2 静态函数2.2.1 函数setmode()2.2.2 函数setup()2.2.3 函数output()2.2.4 函数input()2.2.5 捕捉引脚的电平改变2.2.5.1 函数wait_for_edge()2.2.5.2 函数event_detect…...
![](https://i-blog.csdnimg.cn/direct/8094c8f713fd42479d18bc7c43bb77ce.png)
《零基础Go语言算法实战》【题目 2-10】接口的实现
《零基础Go语言算法实战》 【题目 2-10】接口的实现 请指出下面代码中存在的问题。 type Programmer struct { Name string } func (p *Programmer) String() string { return fmt.Sprintf("print: %v", p) } func main() { p : &Programmer{} p.String()…...
![](https://i-blog.csdnimg.cn/img_convert/cd92d02848604a3c52ce0d3b1cb32c2b.png)
Win32汇编学习笔记10.OD插件
Win32汇编学习笔记10.OD插件-C/C基础-断点社区-专业的老牌游戏安全技术交流社区 - BpSend.net 筛选器异常插件 被调试程序: 📎TestUnh.zip 我们用OD条试试发现,无法断下 筛选器异常 异常产生之后 异常首先会给调试器 调试器不处理就会给 SEH , SEH 不处理的话有又给…...
![](https://www.ngui.cc/images/no-images.jpg)
在vscode中已经安装了插件Live Server,但是在命令面板确找不到
1、VS Code缓存问题: 有时VS Code的缓存可能导致插件无法正确加载。尝试删除VS Code缓存文件夹(如C:\Users\你的用户名\AppData\Roaming\Code)中的文件,并重启VS Code。 2、重新安装插件: 尝试卸载Live S…...
![](https://www.ngui.cc/images/no-images.jpg)
C# SQL ASP.NET Web
留学生的课程答疑 按照要求完成程序设计、数据库设计、用户手册等相关技术文档; 要求 1. 计算机相关专业,本科以上学历,至少有1年以上工作经验或实习经历。 2. 熟练掌握WinForm程序开发,或ASP.NET Web编程。 3. 熟悉C#中网络…...
![](https://www.ngui.cc/images/no-images.jpg)
联想java开发面试题及参考答案
IP 协议是哪一层的? IP 协议(Internet Protocol)属于网络层协议。 网络层主要负责将数据从源节点传输到目标节点,它在整个网络通信体系中起到了承上启下的关键作用。在分层网络模型中,下层(如数据链路层)为网络层提供物理链路的连接和帧传输服务。数据链路层关注的是在相…...
![](https://www.ngui.cc/images/no-images.jpg)
Node.js中的fs模块:文件与目录操作(写入、读取、复制、移动、删除、重命名等)
在Node.js中,fs模块提供了多种方法来处理文件和目录操作,使得数据的持久性保存和文件管理变得简单。下面将介绍文件读写、文件复制、文件移动、文件重命名、文件删除、文件夹创建与删除以及查看资源状态等常用操作。 首先,在使用写入和读取功…...
![](https://i-blog.csdnimg.cn/img_convert/293b6258969cb6dfe70741e02bcad261.png)
代码的形状:重构的方向
大概2周前写了一篇《代码的形状:从外到内的探索与实践》 涵树:代码的形状:从外到内的探索与实践 觉得这个话题还可以继续,它是一个从无形到有形的过程,而这个过程感觉就是王阳明先生说的“心即理”的探寻过程。 我讨论代码的形状ÿ…...
![](https://www.ngui.cc/images/no-images.jpg)
2021 年 3 月青少年软编等考 C 语言五级真题解析
目录 T1. 红与黑思路分析T2. 密室逃脱思路分析T3. 求逆序对数思路分析T4. 最小新整数思路分析T1. 红与黑 有一间长方形的房子,地上铺了红色、黑色两种颜色的正方形瓷砖。你站在其中一块黑色的瓷砖上,只能向相邻的黑色瓷砖移动。请写一个程序,计算你总共能够到达多少块黑色的…...
![](https://www.ngui.cc/images/no-images.jpg)
华为C语言编程规范总结
1.头文件更改会导致所有直接或间接包含该头文件的的C文件重新编译,会增加大量编译工作量,延长编译时间,因此: 1.1 头文件里尽量少包含头文件 1.2 头文件应向稳定的方向包含 2.每一个.c文件应有一个同名.h文件,…...
![](/images/no-images.jpg)
做暧昧免费视频大全网站/网络广告人社区
5712. 你能构造出连续值的最大数目 此题同时具备dp和贪心的双重思维。秒啊! class Solution { public:int getMaximumConsecutive(vector<int>& coins) {sort(coins.begin(),coins.end());int x 0;for(int y:coins){if(y > x1) break;x y;}return x…...
![](/images/no-images.jpg)
网站开发语言windows/行者seo
1. 问题描述: 给定一个非空二叉树, 返回一个由每层节点平均值组成的数组。 示例 1: 输入: 3 / \ 9 20 / \ 15 7 输出:[3, 14.5, 11] 解释: 第 0 层的平均值是 3 , 第1层是 14.5 , 第2层是 11 …...
![](https://img-blog.csdnimg.cn/img_convert/9699d41bada3a79406b32a406be354d8.png)
江山网站建设/品牌推广的目的和意义
抓包分析先抓包分析一下登录的请求【图1-1】图1-1按照加密的参数,我们一个个分析。首先是 _csrf ,这个参数比较简单,一般是用来防止跨域***的,感兴趣的朋友可以借助搜索引擎了解一下,不是重点我们就不详聊了。直接检索…...
![](/images/no-images.jpg)
怎么用程序做网站/seo研究中心道一老师
在讲我创业故事前,先讲一个几天前的一次招标事件。济南一家公司招标,有两项内容,一个是crm系统建设含手机app,另一项目是财务相关项目。我拿到招标需求后,到我们决定去做前,离提交电子技术材料只有三天半时…...
![](/images/no-images.jpg)
静态网站建设毕业论文/自助建站工具
Remove Edit and Delete privileges on that related object from the profile. They will disappear from the related list. Its one thing to want to restrict the users edit/delete rights at the object level...
![](https://img-blog.csdnimg.cn/fbd438830fe34c8e9242a683269dc1b5.png)
网站关键字个数/自动点击器怎么用
NIO虽然称为Non-Blocking IO(非阻塞IO),但它支持阻塞IO、非阻塞IO和IO多路复用模式这几种方式的使用。 同步IO模式 NIO服务器端 Slf4j public class NIOBlockingServer {public static void main(String[] args) throws IOException {Ser…...