GCP live migrations - Strategies for Resilient AI on Google Cloud

Managed Services

Using managed services such as Vertex AI, BigQuery, and AutoML allow an organization to focus on models and services and less on hardware. Google will manage the maintenance more gracefully with these products and services.

Checkpointing

For long running jobs and training pipelines, implement checkpoints to periodically save (model weights, optimizer state). A few examples of this I found are tools such as PyTorch's

torch.save()

or TensorFlow's

tf.train.Checkpoint

If a VM restart or shutdown occurs, one can start from the last checkpoint.

Redundancy

When you have a concern of inference API's being disrupted you can deploy multi-zonal or multi-regional workloads. Leverage global load balancers to route traffic away from disrupted zones. Even replicating workloads across zones to spin up replacement instances quickly.

VM Metadata Server

Within the VM where a workload is being ran, one can check in with the VM's metadata server. If you have access to the VM you automatically get access to the server and valuable information about the VM, such as maintenance windows. A light weight script can be ran every 30 seconds or at some other interval. That output can be linked with a notification system to alert when there will be a planned maintenance occurrence.

Partition Workloads

Instead of running an enormous workload in a single location or machine think about partitioning that workload into smaller, restartable sub-jobs. The idea being smaller units of work reduce exposure to interruption.