TIG stack for monitoring AMDGPU stats

I made this little ruby script and some config files to monitor AMDGPU stats with Grafana via Telegraf and InfluxDB. It uses the JSON mode of amdgpu_top to collect them and feed them into Telegraf.

I started off with a simpler method, using the script as a plugin to Telegrafs input.execd plugin, but the telegraf user doesn’t have access to whatever memory the amdgpu_top needs, and thats probably for a good reason. So the next best thing is chucking the influx line protocol at a TCP socket as a systemd service.

If you want to copy it check out GitHub - penguinpowernz/amdgpu_tig: TIG stack for monitoring your AMD GPU in Grafana but I’ll copy the instructions here too:

We use docker for running influx and grafana:

apt-get install docker.io
sudo usermod -aG docker $USER
# relogin
docker run --restart=always -d --name influx -p 8086:8086 influxdb:1.12
docker run --restart=always -d --name grafana -p 3000:3000 grafana/grafana

We also need telegraf and amdgpu_top… they weren’t in the ubuntu repo:

wget https://dl.influxdata.com/telegraf/releases/telegraf_1.36.3-1_amd64.deb
sudo dpkg -i telegraf_1.36.3-1_amd64.deb
wget https://github.com/Umio-Yasuno/amdgpu_top/releases/download/v0.11.0/amdgpu-top_0.11.0-1_amd64.deb
sudo dpkg -i amdgpu-top_0.11.0-1_amd64.deb

Also ruby and netcat if you don’t have it:

apt-get install --no-recommends ruby netcat

A pretty simple script to parse and dump the amdgpu_top output (I keep it in /usr/local/bin/amdgpu_tlgfd:

#!/usr/bin/env ruby

require 'json'
require 'open3'

cmd = "amdgpu_top -J"

# return the sensors hash
def pull_sensors(data)
  return nil unless data["devices"]
  return nil unless data["devices"][0]
  return nil unless data["devices"][0]["Sensors"]
  return data["devices"][0]["Sensors"]
end

# return an ILP string
def convert_to_ilp(data)
  name = "gpu_stats"
  tags = {}
  fields = {}

  fields["average_power_w"] = data["Average Power"]["value"]
  fields["edge_critical_temperature_c"] = data["Edge Critical Temperature"]["value"]
  fields["edge_emergency_temperature_c"] = data["Edge Emergency Temperature"]["value"]
  fields["edge_temperature_c"] = data["Edge Temperature"]["value"]
  fields["fclk_mhz"] = data["FCLK"]["value"]
  fields["fan_rpm"] = data["Fan"]["value"]
  fields["fan_max_rpm"] = data["Fan Max"]["value"]
  fields["gfx_power_w"] = data["GFX Power"]["value"]
  fields["gfx_mclk_mhz"] = data["GFX_MCLK"]["value"]
  fields["gfx_sclk_mhz"] = data["GFX_SCLK"]["value"]
  fields["junction_critical_temperature_c"] = data["Junction Critical Temperature"]["value"]
  fields["junction_emergency_temperature_c"] = data["Junction Emergency Temperature"]["value"]
  fields["junction_temperature_c"] = data["Junction Temperature"]["value"]
  fields["memory_critical_temperature_c"] = data["Memory Critical Temperature"]["value"]
  fields["memory_emergency_temperature_c"] = data["Memory Emergency Temperature"]["value"]
  fields["memory_temperature_c"] = data["Memory Temperature"]["value"]
  fields["vddgfx_mv"] = data["VDDGFX"]["value"]

  #return "#{name},#{tags.map{|k,v| "#{k}=#{v}"}.join(",")} #{fields.map{|k,v| "#{k}=#{v}"}.join(",")} #{Time.now.to_i*1000*1000*1000}"
  return "#{name} #{fields.map{|k,v| "#{k}=#{v}"}.join(",")} #{Time.now.to_i*1000*1000*1000}"
end

def process_line(line)
  parsed = JSON.parse(line)
  return unless parsed

  sensors = pull_sensors(parsed)
  return unless sensors

  ilp = convert_to_ilp(sensors)
  return unless ilp

  puts ilp
end

if ARGV.include?("-d")
  line = STDIN.gets
  process_line(line)
  exit
end

Open3.popen3(cmd) do |stdin, stdout, stderr, wait_thr|
  stdout.each_line do |line|
    begin
      process_line(line)
    rescue => e
      puts "Error: #{e.message}"
    end
  end
end

A very simple service file:

[Unit]
Description=AMDGPU stats tracking to telegraf
After=telegraf.service

[Service]
ExecStart=sh -c "/usr/local/bin/amdgpu_tlgfd | nc 127.0.0.1 11233"

And then some simple telegraf configs (just chuck it in /etc/telegraf/telegraf.d/amdgpu.conf:

[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8086"]
  database = "telegraf"

[[inputs.socket_listener]]
  service_address = "tcp4://127.0.0.1:11233"
  max_connections = 1
  data_format = "influx"

Restart some things:

systemctl daemon-reload
systemctl restart telegraf
systemctl start amdgpu

And bobs your uncles twisted stepbrother twice removed…

Oh yea plus when you login to grafana setup the datasource to look like this:

And you can grab the dashboard JSON from the repo and import it.