Control Plane Baremetal Node Failure - Unplanned

This section describes the procedure to replace a failed primary Bare Metal control plane 1 node in a stacked cluster. When the primary control plane 1 Bare Metal node fails, the node status changes to NotReady. You can use the following command to view the status of the nodes in the cluster:
kubectl get nodes 
In the following example, the status of the primary control plane 1 node changes to NotReady after it fails.
user1-cloud@kali-stacked-control-plane:~$ kubectl get nodes
NAME                         STATUS     ROLES           AGE    VERSION
kali-stacked-control-plane1   NotReady   control-plane   136m   v1.21.0
kali-stacked-control-plane2   Ready      control-plane   10h    v1.21.0
kali-stacked-control-plane3   Ready      control-plane   10h    v1.21.0

All the pods in the failed primary control plane 1 Bare Metal node remains either terminated or in pending state. You verify the status of the pods using the kubectl get pods command as shown in the following example:

user1-cloud@kali-stacked-controlplane3:~$ kubectl get pods -A
NAMESPACE           NAME                                             READY   STATUS    RESTARTS   AGE
kube-system         calico-kube-controllers-5d7fff4bc6-lxkpc         1/1     Running   0          7h26m
kube-system         calico-node-tx7zg                                1/1     Running   0          10h
kube-system         calico-node-v6m7v                                1/1     Running   0          10h
kube-system         coredns-66d57f55d9-6dnsn                         1/1     Running   0          136m
kube-system         coredns-66d57f55d9-rdtbd                         1/1     Running   0          136m
kube-system         etcd-kali-stacked-controlplane2                  1/1     Running   0          10h
kube-system         etcd-kali-stacked-controlplane3                  1/1     Running   0          10h
kube-system         kube-apiserver-kali-stacked-controlplane2        1/1     Running   0          10h
kube-system         kube-apiserver-kali-stacked-controlplane3        1/1     Running   0          10h

To replace the failed primary control plane 1 Bare Metal node:

  1. Delete the failed primary control plane 1 Bare Metal node using the following command:

    kubectl delete node node_name 

    Example:

    user1-cloud@kali-stacked-controlplane3:~$ kubectl delete node kali-stacked-controlplane1
    node "kali-stacked-controlplane1" deleted
    
  2. Assign the primary control plane 1 Bare Metal node to maintenance mode in the cluster configuration using the following commands:

    configure 
      clusters cluster_name 
      nodes controlplane1 
      maintenance true 
      commit 
      end 

    Example:

    SMI Cluster Deployer(config)# clusters kali-stacked 
    SMI Cluster Deployer(config-clusters-kali-stacked)# nodes controlplane1 
    SMI Cluster Deployer(config-nodes-controlplane1)# maintenance true 
    SMI Cluster Deployer(config-nodes-controlplane1)# commit
    Commit complete.
    SMI Cluster Deployer(config-nodes-controlplane1)# end
    SMI Cluster Deployer# 
  3. The node is ready for the RMA process.

    Note

    If the remaining nodes need to be upgraded or NFs need to be synchronized, run a cluster sync in this state. However, it's not a part of the RMA process.

  4. Add the node back to the cluster when it is repaired or replaced and available.

    Note

    If you add a node after it's repaired, ensure that the disks are clean by clearing the boot drive and virtual drive on the node. This step is to ensure that the virtual drive is in a clean state without the previous state before you add it back. However, removal of the virtual drive is not required for a new replacement node.

  5. Attach the new primary control plane 1 Bare Metal node and remove it from the maintenance mode in the cluster configuration using the following commands:

    configure 
      clusters cluster_name 
      nodes controlplane1 
      maintenance false 
      commit 
      end 

    Example:

    SMI Cluster Deployer(config)# clusters kali-stacked 
    SMI Cluster Deployer(config-clusters-kali-stacked)# nodes controlplane1 
    SMI Cluster Deployer(config-nodes-controlplane1)# maintenance false 
    SMI Cluster Deployer(config-nodes-controlplane1)# commit
    Commit complete.
    SMI Cluster Deployer(config-nodes-controlplane1)# end
    SMI Cluster Deployer# 
  6. Run the cluster synchronization using the following command:

    clusters cluster_name actions sync run debug true 

    Example:

    SMI Cluster Deployer# clusters kali-stacked actions sync run debug true
    This will run sync.  Are you sure? [no,yes] yes
    message accepted
  7. Verify the status of the cluster using the following command:

    clusters cluster_name actions k8s cluster-status 

    Example:

    SMI Cluster Deployer# clusters kali-stacked actions k8s cluster-status     
    pods-desired-count 40
    pods-ready-count 39
    pods-desired-are-ready true
    etcd-healthy true
    all-ok true

NOTES:

  • clusters cluster_name - Specifies the K8s cluster.

  • nodes controlplane1 - Specifies primary control plane 1 Bare Metal node.

  • maintenance true/false - Assigns or removes the primary control plane 1 Bare Metal mode to maintenance mode

  • actions sync run debug true - Synchronizes the cluster configuration.

  • actions k8s cluster-status - Displays the status of the cluster.