GCP - GPU staging time reduction

fazaar · September 18, 2021, 6:52pm

I have an application that requires the smallest boot-time/TTL possible with GPUs attached to a VM in GCP CE. To keep cost down, my infrastructure is dependent on starting and stopping instances as demand increases/decreases.

I have achieved sub-5second start times with custom images without GPUs, but as soon as I attach a GPU, the time to “RUNNING” is always past 20-30s.

I have tried multiple different distros, clear linux, prepackaged Nvidia driver images, minimal installs of Fedora, minimalised Debian, reductions to kernel and userspace - systemd-analyze says my boot-time is 3s, but starting the VM with a GPU takes 20-30s in “STAGING” before running.

This only occurs when the gpu is attached to the VM and when removed the VM starts within the time mentioned by systemd-analyze. It is consistent across all distros and bootimages.

Is there any packages or documentation I am missing to speed up this staging-time with a GPU attached or is this a limitation with GCP’s internal staging of GPU instances?

I’d much appreciate any help or advice.

risk · September 19, 2021, 6:09am

While I don’t work on GCP directly, I think it’s safe to say that the provisioning complexity is huge, and this adds to latency.

Is it always 30s, how many data points do you have? what’s that distribution like? What are you being charged for in terms of time?

Illustrating a highly plausible scenario is that there might be 20-50 different developer teams only within GCP/Technical Infrastructure part of Google … on top of whatever Nvidia is doing that end up writing various service software that needs to coordinate and perform hundreds of various database transactions, logging, checks against various budgets and prioritization and so on, across various systems to ensure nothing slips by and the device goes through basic volatile memory purges and power resets firmware uploads and 2+2=5 style checks before it’s given to your VM, and before that of that any low priority / scientific / e.g. protein folding workload might need to live migrate to a different host, to allow for consistent binpacking.

Then again if you’re just using the GPUs for ML training, i don’t think a few seconds makes a huge difference.
If you’re doing it at scale and providing e.g. rendering in the cloud service, you could probably afford to keep a VM pre spun up ready to receive stuff from your job scheduling service.

fazaar · September 19, 2021, 9:16am

Thanks for the help, in relation to your questions heres my data:

Logging the results of 20 different VMs with different distros, time started in the day and locations, the staging time is consistent. While I haven’t gotten sub-second consistency, the staging time is always 28-30 seconds. It seems the time of day, distro image and machine type do not differ the results.

To clarify my application, a hard requirement is that GPU VMs start within a 10 second timeframe as it’s used in 3d rendering, user requests site - a VM is assigned to the user and a rtc stream is initialised, waiting times need to be short, each client gets a VM in their region - assigns vms up to a max limit.

I agree that running the VM consistently circumvents this issue, but with this application, it isn’t practical. In terms of price, one of the benefits of GCP is their per minute billing and preemptible cost plan - 0.21$ per hour, which is why my architecture is designed in this way, max time these machines run is 1 hour - avg 28m.

I’ll play around, “30s” isn’t awful but if any workaround is found please do let me know.

Mastic_Warrior · September 20, 2021, 4:13pm

What are the GPUs? If using nVidia cards, there is a lot happening in the background to get them running.

Are you using modeset or nomodeset on boot of the VMs?

fazaar · September 20, 2021, 8:11pm

Using clearlinux for this project, just ran some tests with nomodeset and modeset in the kernel cmdline, no difference - 28s. I’m running one Tesla T4s in a VM - 8vcpus, 16gb ram, 50gb SSD with CL.

I’ve posted this elsewhere and a Google Support Team member says:

This is an internal limitation in GCE and GKE and beside baking the drivers into the OS image there’s not a lot that can be currently done to remediate this.
However, I noticed that startup times have dropped over time, so there is some improvement in this matter. You can report this via Public Issue Tracker to follow development.
You can also consider using Committed Use Discount or Sustained Use Discounts. It may be beneficial in the long run to keep the instances running and therefore avoid the startup problem altogether.

To save someone time - “baking the drivers into the OS image” , does not change the TTL of the VM - there is no difference with or without.

Thanks for the response @Mastic_Warrior, if you come across anything do let me know.

Mastic_Warrior · September 21, 2021, 11:51am

Will do.

I am surprised nomodeset did not make a difference. It is good that a Google rep got back to you. That is some quick customer service!

fazaar · September 24, 2021, 3:48am

If you’re also experiencing this issue and would like to track its progress, I created a issue report:
https://issuetracker.google.com/issues/200575905