Restart ehtereum miner after error without exit

caprica · June 13, 2017, 9:36pm

I'm running the Genoil cpp-ethereum miner (https://github.com/Genoil/cpp-ethereum) on a VM with GPU passthrough and overclocking and it runs mostly stable with a memory clock offset of 1500 MHz (this yield 32 MH/s on a GTX 1070) but occasionally it outputs an error and stops hashing but doesn't exit.

Cuda error in func 'search' at line 346 : an illegal memory access was encountered.

I'm wondering what would be a good approach here. I can check the output for that string, kill the process and start over or check GPU and if utilization drops in consecutive tests and then restart. Etherminer outputs about 5 lines every second.

Anyway, I guess I almost answered my question here but what I don't like is doing 5 checks every second or doing the GPU check which could trigger due to network failure and probably more reasons so I guess I'm asking if people have any other suggestions for how to do this more elegantly?

I usually run the miner in screen, as I like to be able to bring it back up and have the highlighting work but I guess that's a luxury I could drop.

mrpopo · June 14, 2017, 2:18am

Have you tried dropping the clocks on the VRAM? I mined litecoin back in the day and remember that my mining clients were very sensitive to VRAM and RAM clock speeds.

caprica · June 14, 2017, 5:16am

I could probably get it more stable but a few more MH/s is worth a crash per day as long as the process gets going again shortly after it stops.

I haven't done anything with RAM clockand probably can't do much either, it's a server/workstation motherboard and ECC memory + Xeon CPU.

lightonflux · June 16, 2017, 10:26am

If you have a distro with systemd and can tell the program to quit on failure/error, then you could use systemd's restart functionality:

Restart=: This indicates the circumstances under which systemd will attempt to automatically restart the service. This can be set to values like “always”, “on-success”, “on-failure”, “on-abnormal”, “on-abort”, or “on-watchdog”. These will trigger a restart according to the way that the service was stopped.

Source

If the software does not support exiting on CUDA errors then ask for the feature in their issue tracker.

caprica · June 17, 2017, 6:16pm

Thanks, it's an interesting suggestion. I'd have to put in a feature request or find it in code and change it but right now I'll go with a wrapper script. Will post here if it works for reference.