Prometheus GPU 监控

1,Prometheus GPU 监控

  • 安装DCGM
  • datacenter-gpu-manager_1.7.2_amd64.deb
# dcgmi --version

dcgmi  version: 1.7.2

2,安装gpu-monitoring-tools

# git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
# cd gpu-monitoring-tools/
# make binary
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
# make install
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
install -m 557 dcgm-exporter /usr/bin/dcgm-exporter
install -m 557 -D ./etc/dcgm-exporter/default-counters.csv /etc/dcgm-exporter/default-counters.csv
install -m 557 -D ./etc/dcgm-exporter/dcp-metrics-included.csv /etc/dcgm-exporter/dcp-metrics-included.csv
  • 运行dcgm-exporter
# which dcgm-exporter
/usr/bin/dcgm-exporter
# dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
  • 测试,可以看到监控数据
# curl 192.168.1.2:9400/metrics

2.1,设置dcgm-exporter开机启动

  • vim /lib/systemd/system/dcgm-exporter.service 新建服务
[Unit]
Description=dcgm-exporter service

[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter

TimeoutStopSec=10
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
# systemctl daemon-reload
# systemctl enable dcgm-exporter.service
# systemctl start dcgm-exporter.service
# systemctl status dcgm-exporter.service

3,Prometheus修改配置

  • 添加dcgm-exporter
    # dcgm-exporter
  - job_name: 'gpu'
    static_configs:
    - targets: ['192.168.1.2:9400']
# cat prometheus.yml
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']


    # node_exporter
  - job_name: 'node'
    static_configs:
    - targets: ['127.0.0.1:9100','192.168.1.2:9100']

    # dcgm-exporter
  - job_name: 'gpu'
    static_configs:
    - targets: ['192.168.1.2:9400']
  • 重启prometheus
systemctl restart  prometheus.service

在这里插入图片描述

4,grafana

在这里插入图片描述

5,使用监控面板9957可以切换节点

在这里插入图片描述
在这里插入图片描述

6,Grafana设置

  • 监控功率,instance为ip地址
DCGM_FI_DEV_POWER_USAGE{instance="192.168.1.101:9400"}
  • 显卡使用率
DCGM_FI_DEV_GPU_UTIL{instance="192.168.1.101:9400"}

7,使用12027

在这里插入图片描述

   # dcgm-exporter
  - job_name: 'gpu-metrics'
    static_configs:
    - targets: ['127.0.0.1:9400','192.168.1.101:9400','192.168.1.102:9400']

在这里插入图片描述

  • 手动设置监控
    在这里插入图片描述
  • 查看显卡指标
curl http://127.0.0.1:9400/metrics
  • 使用功率
DCGM_FI_DEV_POWER_USAGE{instance="127.0.0.1:9400"}
  • 内存使用
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}
  • 总内存
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}+DCGM_FI_DEV_FB_FREE{instance="127.0.0.1:9400"}
  • GPU使用率
DCGM_FI_DEV_GPU_UTIL{instance="127.0.0.1:9400"}
  • GPU内存使用率
DCGM_FI_DEV_MEM_COPY_UTIL{instance="192.168.0.114:9400"}

8,使用GPU-Nodes-Metrics-Nvidia 12639

参考:

  1. Prometheus + Grafana 监控 NVIDIA GPU
  2. DCGM 1.7.2 Downloads (December 2019)
  3. GPU Nodes v2
  4. NVIDIA/gpu-monitoring-tools
  5. NVIDIA DCGM Exporter Dashboard
  6. GPU Nodesby bkeyzers
  7. Integrating with DCGM
  8. 安装dcgm
  9. 基于DCGM和Prometheus的GPU监控方案 dcgm r采集指标项以及含义
相关推荐
©️2020 CSDN 皮肤主题: 技术黑板 设计师:CSDN官方博客 返回首页