Multiple Compute or Node Failures

In case of multiple compute failures, all the replicas for a Slot map goes down. For instance, If two nodes are down, both the replicas of a Slot map goes down (the default value of Slot replica count is two). This leads to an entire local replica-set failure and triggers a shutdown of all the cdl-ep pods. Also, a CDL Geo Replication (GR) is triggered, if GeoHA is configured. The NF application talks to the remote site cdl-ep to process the requests when GeoHA is configured. When there is no GeoHA configuration, it ultimately leads to a service downtime.

When the failed nodes recover, the Slot pods and Slot replica set recovers and synchronizes from the remote site (if both the replicas were down during the failure). This will happen only if GeoHA is configured. The Slot Recovery from Remote Peer figure depicts the initial synchronization from the remote peer when the entire local replica set was down.