Node의 GPU 모니터링 하기
prometheus
를 사용해서 노드들의 매트틱을 수집하고 있다면, 아마 node-exporter
를 사용하고 있을 것이다. NVIDIA
에서는 dcgm-exporter
라는 GPU 매트릭 출력용 이미지를 제공하고 있다. 이 dcgm-exporter
과 node-exporter
를 결합하여 사용하면, GPU 매트릭을 수집할 수 있다.
dcgm-exporter
dcgm(Data Center GPU Manager) exporter
는 nv-hostenging
을 시작해서, 매초마다 GPU 매트릭을 읽어서 prometheus
형식으로 출력해주는 간단한 쉘 스크립트이다.
Node 설정하기
우선 일반 노드와 GPU 노드를 분리하기 위해서 taint
와 label
을 설정해주었다. 대부분 node-exporter
를 실행하기 위해서 DaemonSet
을 사용했을 것이다.
일반 노드에서는 node-exporter
만을 실행하기 위해서 taint nvidia.com/gpu=:NoSchedule
를 사용하였고, GPU 노드에서는 node-exporter
+ dcgm-exporter
를 실행하기 위해서 label hardware-type=NVIDIAGPU
를 사용하였다.
nvidia.com/brand
는 현재로는 별의미가 없지만 붙여주었다.
kubectl taint nodes ${node} nvidia.com/gpu=:NoSchedule kubectl label nodes ${node} "nvidia.com/brand=${label}" kubectl label nodes ${node} hardware-type=NVIDIAGPU
기존 node-exporter
에 dcgm-exporter
추가하기
dcgm-exporter
가 GPU 매트릭을 파일로 남기고, prometheus
는 그 파일을 읽어서 GPU 매트릭을 같이 출력한다.
GPU 노드용
apiVersion: apps/v1 kind: DaemonSet metadata: labels: app.kubernetes.io/name: node-exporter app.kubernetes.io/instance: gpu-node-exporter app.kubernetes.io/part-of: prometheus app.kubernetes.io/managed-by: argo-system name: prometheus-gpu-node-exporter namespace: argo-system spec: revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/name: node-exporter app.kubernetes.io/instance: gpu-node-exporter app.kubernetes.io/part-of: prometheus app.kubernetes.io/managed-by: argo-system template: metadata: labels: app.kubernetes.io/name: node-exporter app.kubernetes.io/instance: gpu-node-exporter app.kubernetes.io/part-of: prometheus app.kubernetes.io/managed-by: argo-system spec: nodeSelector: hardware-type: NVIDIAGPU containers: - args: - --path.procfs=/host/proc - --path.sysfs=/host/sys - "--collector.textfile.directory=/run/prometheus" image: prom/node-exporter:v0.18.1 imagePullPolicy: IfNotPresent name: prometheus-node-exporter ports: - containerPort: 9100 hostPort: 9100 name: metrics protocol: TCP resources: limits: cpu: 500m memory: 200Mi requests: cpu: 100m memory: 100Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host/proc name: proc readOnly: true - mountPath: /host/sys name: sys readOnly: true - name: collector-textfiles readOnly: true mountPath: /run/prometheus - image: nvidia/dcgm-exporter:1.4.6 name: nvidia-dcgm-exporter securityContext: runAsNonRoot: false runAsUser: 0 volumeMounts: - name: collector-textfiles mountPath: /run/prometheus dnsPolicy: ClusterFirst hostNetwork: true hostPID: true restartPolicy: Always serviceAccount: prometheus-node-exporter serviceAccountName: prometheus-node-exporter terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master - effect: NoSchedule key: node-role.kubernetes.io/ingress operator: Exists - effect: NoSchedule key: nvidia.com/gpu operator: Exists volumes: - hostPath: path: /proc type: "" name: proc - hostPath: path: /sys type: "" name: sys - name: collector-textfiles emptyDir: medium: Memory - name: pod-gpu-resources hostPath: path: /var/lib/kubelet/pod-resources updateStrategy: type: OnDelete
일반 노드용
apiVersion: apps/v1 kind: DaemonSet metadata: labels: app.kubernetes.io/name: node-exporter app.kubernetes.io/instance: node-exporter app.kubernetes.io/part-of: prometheus app.kubernetes.io/managed-by: argo-system name: prometheus-node-exporter namespace: argo-system spec: revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/name: node-exporter app.kubernetes.io/instance: node-exporter app.kubernetes.io/part-of: prometheus app.kubernetes.io/managed-by: argo-system template: metadata: labels: app.kubernetes.io/name: node-exporter app.kubernetes.io/instance: node-exporter app.kubernetes.io/part-of: prometheus app.kubernetes.io/managed-by: argo-system spec: containers: - args: - --path.procfs=/host/proc - --path.sysfs=/host/sys image: prom/node-exporter:v0.18.1 imagePullPolicy: IfNotPresent name: prometheus-node-exporter ports: - containerPort: 9100 hostPort: 9100 name: metrics protocol: TCP resources: limits: cpu: 500m memory: 200Mi requests: cpu: 100m memory: 100Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host/proc name: proc readOnly: true - mountPath: /host/sys name: sys readOnly: true dnsPolicy: ClusterFirst hostNetwork: true hostPID: true restartPolicy: Always serviceAccount: prometheus-node-exporter serviceAccountName: prometheus-node-exporter terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master - effect: NoSchedule key: node-role.kubernetes.io/ingress operator: Exists volumes: - hostPath: path: /proc type: "" name: proc - hostPath: path: /sys type: "" name: sys updateStrategy: type: OnDelete