I made this little ruby script and some config files to monitor AMDGPU stats with Grafana via Telegraf and InfluxDB. It uses the JSON mode of amdgpu_top to collect them and feed them into Telegraf.
I started off with a simpler method, using the script as a plugin to Telegrafs input.execd plugin, but the telegraf user doesn’t have access to whatever memory the amdgpu_top needs, and thats probably for a good reason. So the next best thing is chucking the influx line protocol at a TCP socket as a systemd service.
If you want to copy it check out GitHub - penguinpowernz/amdgpu_tig: TIG stack for monitoring your AMD GPU in Grafana but I’ll copy the instructions here too:
We use docker for running influx and grafana:
apt-get install docker.io
sudo usermod -aG docker $USER
# relogin
docker run --restart=always -d --name influx -p 8086:8086 influxdb:1.12
docker run --restart=always -d --name grafana -p 3000:3000 grafana/grafana
We also need telegraf and amdgpu_top… they weren’t in the ubuntu repo:
wget https://dl.influxdata.com/telegraf/releases/telegraf_1.36.3-1_amd64.deb
sudo dpkg -i telegraf_1.36.3-1_amd64.deb
wget https://github.com/Umio-Yasuno/amdgpu_top/releases/download/v0.11.0/amdgpu-top_0.11.0-1_amd64.deb
sudo dpkg -i amdgpu-top_0.11.0-1_amd64.deb
Also ruby and netcat if you don’t have it:
apt-get install --no-recommends ruby netcat
A pretty simple script to parse and dump the amdgpu_top output (I keep it in /usr/local/bin/amdgpu_tlgfd:
#!/usr/bin/env ruby
require 'json'
require 'open3'
cmd = "amdgpu_top -J"
# return the sensors hash
def pull_sensors(data)
return nil unless data["devices"]
return nil unless data["devices"][0]
return nil unless data["devices"][0]["Sensors"]
return data["devices"][0]["Sensors"]
end
# return an ILP string
def convert_to_ilp(data)
name = "gpu_stats"
tags = {}
fields = {}
fields["average_power_w"] = data["Average Power"]["value"]
fields["edge_critical_temperature_c"] = data["Edge Critical Temperature"]["value"]
fields["edge_emergency_temperature_c"] = data["Edge Emergency Temperature"]["value"]
fields["edge_temperature_c"] = data["Edge Temperature"]["value"]
fields["fclk_mhz"] = data["FCLK"]["value"]
fields["fan_rpm"] = data["Fan"]["value"]
fields["fan_max_rpm"] = data["Fan Max"]["value"]
fields["gfx_power_w"] = data["GFX Power"]["value"]
fields["gfx_mclk_mhz"] = data["GFX_MCLK"]["value"]
fields["gfx_sclk_mhz"] = data["GFX_SCLK"]["value"]
fields["junction_critical_temperature_c"] = data["Junction Critical Temperature"]["value"]
fields["junction_emergency_temperature_c"] = data["Junction Emergency Temperature"]["value"]
fields["junction_temperature_c"] = data["Junction Temperature"]["value"]
fields["memory_critical_temperature_c"] = data["Memory Critical Temperature"]["value"]
fields["memory_emergency_temperature_c"] = data["Memory Emergency Temperature"]["value"]
fields["memory_temperature_c"] = data["Memory Temperature"]["value"]
fields["vddgfx_mv"] = data["VDDGFX"]["value"]
#return "#{name},#{tags.map{|k,v| "#{k}=#{v}"}.join(",")} #{fields.map{|k,v| "#{k}=#{v}"}.join(",")} #{Time.now.to_i*1000*1000*1000}"
return "#{name} #{fields.map{|k,v| "#{k}=#{v}"}.join(",")} #{Time.now.to_i*1000*1000*1000}"
end
def process_line(line)
parsed = JSON.parse(line)
return unless parsed
sensors = pull_sensors(parsed)
return unless sensors
ilp = convert_to_ilp(sensors)
return unless ilp
puts ilp
end
if ARGV.include?("-d")
line = STDIN.gets
process_line(line)
exit
end
Open3.popen3(cmd) do |stdin, stdout, stderr, wait_thr|
stdout.each_line do |line|
begin
process_line(line)
rescue => e
puts "Error: #{e.message}"
end
end
end
A very simple service file:
[Unit]
Description=AMDGPU stats tracking to telegraf
After=telegraf.service
[Service]
ExecStart=sh -c "/usr/local/bin/amdgpu_tlgfd | nc 127.0.0.1 11233"
And then some simple telegraf configs (just chuck it in /etc/telegraf/telegraf.d/amdgpu.conf:
[[outputs.influxdb]]
urls = ["http://127.0.0.1:8086"]
database = "telegraf"
[[inputs.socket_listener]]
service_address = "tcp4://127.0.0.1:11233"
max_connections = 1
data_format = "influx"
Restart some things:
systemctl daemon-reload
systemctl restart telegraf
systemctl start amdgpu
And bobs your uncles twisted stepbrother twice removed…

