nv-hostengine 的启动和补充知识 
TL;DR 
如果是想要解决在使用 dcgmi 和 dcgmi discovery -l 相关的命令的时候遭遇的 unable to establish a connection to the specified host: localhost 和 Unable to connect to host engine. Host engine connection invalid/disconnected. 错误,可以通过在 dcgm-exporter Pod 中直接调用 nv-hostengine 二进制命令启动 nv-hostengine 来启动 nv-hostengine
启动一个 nv-hostengine 实例 
在接下来的例子中我都会直接使用 GPU Operator 自带和附属的 dcgm-exporter Pod 资源内置的工具和命令行,这样节点设备上就无需再安装和折腾 NVIDIA DCGM (Data Center GPU Manager) 了。
找到 dcgm-exporter 的 Pod ID 并进入 Pod 之后可以通过
ps -A查看当前 dcgm-exporter 容器内所有的进程来确认是否开启了 nv-hostengine:
root@nvidia-dcgm-exporter-pkcrb:/$ ps -A
    PID TTY          TIME CMD
      1 ?        00:00:02 dcgm-exporter
     40 pts/0    00:00:00 bash
     48 pts/0    00:00:00 ps在启动前,也需要检查是否在 PATH 指定的路径中能搜索到 nv-hostengine 的二进制文件:
root@nvidia-dcgm-exporter-pkcrb:/$ which nv-hostengine
/usr/bin/nv-hostengine然后直接调用二进制启动就好了,接下来 nv-hostengine 就会开始在 localhost:5555 监听并处理未来的请求:
root@nvidia-dcgm-exporter-pkcrb:/$ nv-hostengine
Started host engine version 3.3.0 using port number: 5555拓展知识 
配合 dcgmi 模拟 DCGM 的错误指标和数据 
DCGM 的文档中有对错误模拟和注入的文档:Error Injection — NVIDIA DCGM Documentation latest documentation
它的步骤和逻辑是这样的:
- 启动 nv-hostengine守护进程
- 通过 DCGM 启动监控
- 确定要注入的 GPU 错误
- 通过 dcgmi test --inject注入错误
- DCGM 应该会提示出对应的错误
nv-hostengine 的奇怪僵尸进程行为 
启动 nv-hostengine 之后通过
ps -A再次检查进行中的进程的时候会发现有一个 <defunct> 的 nv-hostengine 进程:
root@nvidia-dcgm-exporter-pkcrb:/$ ps -A
    PID TTY          TIME CMD
      1 ?        00:00:04 dcgm-exporter
     61 pts/0    00:00:00 bash
     71 ?        00:00:00 nv-hostengine <defunct>
     72 ?        00:00:00 nv-hostengine
     79 pts/0    00:00:00 ps虽然按照文档介绍的来说的话,不需要的时候可以通过
nv-hostengine -t去关闭 nv-hostengine,但是实际上跑完之后会报错说无法关闭
nv-hostengine -t
Host engine already running with pid 72
Unable to terminate host engine.就像这样:
root@nvidia-dcgm-exporter-4f599:/$ nv-hostengine
Host engine already running with pid 357
root@nvidia-dcgm-exporter-4f599:/$ ps -A
    PID TTY          TIME CMD
      1 ?        00:41:59 dcgm-exporter
    320 ?        00:00:00 dcgmproftester1 <defunct>
    321 ?        00:00:00 dcgmproftester1 <defunct>
    322 ?        00:00:00 dcgmproftester1 <defunct>
    323 ?        00:00:00 dcgmproftester1 <defunct>
    324 ?        00:00:00 dcgmproftester1 <defunct>
    325 ?        00:00:00 dcgmproftester1 <defunct>
    326 ?        00:00:00 dcgmproftester1 <defunct>
    327 ?        00:00:00 dcgmproftester1 <defunct>
    356 ?        00:00:00 nv-hostengine <defunct>
    357 ?        00:00:00 nv-hostengine <defunct>
    404 pts/0    00:00:00 bash
    415 pts/0    00:00:00 ps这个时候我找到的唯一的解决办法是通过 kubectl 把 dcgm-exporter 的 Pod 删掉来解决的。
root@nvidia-dcgm-exporter-cq8bw:/$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA GeForce RTX 3090                                        |
|        | PCI Bus ID: 00000000:1B:00.0                                         |
|        | Device UUID: GPU-69cd4dac-cefb-45e8-9bff-429dfd4bc8c3                |
+--------+----------------------------------------------------------------------+
| 1      | Name: NVIDIA GeForce RTX 3090                                        |
|        | PCI Bus ID: 00000000:1C:00.0                                         |
|        | Device UUID: GPU-ac2d4b59-b78e-42c7-8db6-ef28b01540e1                |
+--------+----------------------------------------------------------------------+
| 2      | Name: NVIDIA GeForce RTX 3090                                        |
|        | PCI Bus ID: 00000000:1D:00.0                                         |
|        | Device UUID: GPU-d8f40ec7-2c07-4d2e-ae9b-a51e7377e052                |
+--------+----------------------------------------------------------------------+
| 3      | Name: NVIDIA GeForce RTX 3090                                        |
|        | PCI Bus ID: 00000000:1E:00.0                                         |
|        | Device UUID: GPU-5f969c85-ffc4-44ac-aa13-10824844087e                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
2 CPUs found.
+--------+----------------------------------------------------------------------+
| CPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: Grace TH500                                                    |
|        | Cores: 0-19,40-59                                                    |
+--------+----------------------------------------------------------------------+
| 1      | Name: Grace TH500                                                    |
|        | Cores: 20-39,60-79                                                   |
+--------+----------------------------------------------------------------------+使用 GPU Operator 自带的内置 DCGM 运行嵌入 DCGM Engine 
如果你在尝试研究使用嵌入 DCGM Engine,可以参考一下这个 Error starting embedded DCGM engine · Issue #16 · triton-inference-server/model_analyzer Issue 来配合解决问题。
一般情况下安装 GPU Operator 都不会启动所谓的「嵌入 DCGM Engine」,可以通过 GPU Operator 的源代码来看:https://github.com/NVIDIA/gpu-operator/blob/4113883838a514cf528ae67f3cf599f79b52fc02/deployments/gpu-operator/values.yaml#L283-L294
而渲染的目标资源类型是 ClusterPolicy,资源名称是 cluster-policy,这个时候可以通过
sudo kubectl get ClusterPolicy cluster-policy -o yaml命令查看到当前该资源的配置详细信息:
  dcgm:
    enabled: false
    hostPort: 5555
    image: dcgmçç
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: 3.3.0-1-ubuntu22.04可以发现,由于 enabled 的值最终是 false,所以这里面确实没有打开 dcgm 模块。
只需要通过
sudo kubectl edit ClusterPolicy cluster-policy对 ClusterPolicy 资源进行编辑并修改 dcgm.enabled 为 true 即可,然后等待 Pod 启动完成就可以继续用了。
延伸阅读 
参考资料 
- Kubernetes概览 — Cloud Atlas beta 文档 和 cloud-atlas/source/kubernetes/gpu/dcgm-exporter.rst at a168158169f5aad0f7683d698cc740f1846374fe · huataihuang/cloud-atlas (github.com) 和 cloud-atlas/source/kubernetes/gpu/dcgm-exporter.rst at a168158169f5aad0f7683d698cc740f1846374fe · huataihuang/cloud-atlas (github.com) 的代码实现中引用到了 nv-hostengine
- Getting Started — gpu-operator 23.6.0 documentation
- Azure 的 Moneo 有引用和调用到 nv-hostengine:Moneo/src/worker/shutdown.sh at 70c0a2d75355d82909784c886ed0fc169a49a033 · Azure/Moneo (github.com)
- DataDog 的 GPU 集成也有用到 nv-hostengine:integrations-core/dcgm/README.md at 764d52840cf6cf694d0eaec5929fc7d799b7fc29 · DataDog/integrations-core (github.com)
- llm-action 项目也有引用到:github.com/liguodongiot/llm-action/blob/c9d8a3570b9e156ef6af035cf1b21294cb3eedbf/docs/llm-base/monitor.md?plain=1#L13-L16
- 在 NVIDIA 官方的 DCGM 的 Dockerfile 中有直接对 nv-hostengine的使用,也可以在这里看到: DCGM/docker/Dockerfile.ubi8 at a33560c9c138c617f3ee6cb50df11561302e5743 · NVIDIA/DCGM (github.com)
- Monitor Your Computing System with Prometheus, Grafana, Alertmanager, and Nvidia DCGM
- Error: unable to establish a connection to the specified host: localhost · Issue #43 · NVIDIA/DCGM
 絢香猫
 絢香猫