Worker Bare Metal Node Failure - Unplanned

This section describes the procedure to replace a failed primary Bare Metal worker node in a stacked cluster. When the primary worker Bare Metal node fails, the node status changes to NotReady.

Verify the status of the nodes in the cluster using the following command:
kubectl get nodes 
In the following example, the status of the primary worker node changes to NotReady after it fails.
user1-cloud@kali-stacked-controlplane1:~$ kubectl get nodes
NAME                        STATUS     ROLES           AGE    VERSION
kali-stacked-cmts-worker1   NotReady   <none>          38m    v1.21.0
kali-stacked-cmts-worker2   Ready      <none>          38m    v1.21.0
kali-stacked-cmts-worker3   Ready      <none>          38m    v1.21.0
kali-stacked-controlplane1  Ready      control-plane   125m   v1.21.0
kali-stacked-controlplane2  Ready      control-plane   13h    v1.21.0
kali-stacked-controlplane3  Ready      control-plane   13h    v1.21.0

To replace the failed primary Bare Metal worker node in the Cluster:

  1. Delete the failed primary worker Bare Metal node using the following command:

    kubectl delete node node_name 

    Example:

    user1-cloud@kali-stacked-controlplane3:~$ kubectl delete node kali-stacked-cmts-worker1
    node "kali-stacked-cmts-worker1" deleted
    
  2. Assign the primary worker Bare Metal node to maintenance mode in the cluster configuration using the following commands:

    configure 
      clusters cluster_name 
      nodes worker_node 
      maintenance true 
      commit 
      end 

    Example:

    SMI Cluster Deployer(config)# clusters kali-stacked
    SMI Cluster Deployer(config-clusters-kali-stacked)# nodes cmts-worker1   
    SMI Cluster Deployer(config-nodes-cmts-worker1)# maintenance true     
    SMI Cluster Deployer(config-nodes-cmts-worker1)# commit
    Commit complete.
    SMI Cluster Deployer(config-nodes-cmts-worker1)# end
  3. The node is ready for the RMA process.

    Note

    If the remaining nodes need to be upgraded or NFs need to be synchronized, run a cluster sync in this state. However, it's not a part of the RMA process.

  4. Add the node back to the cluster when it is repaired or replaced and available.

    Note

    If you add a node after it's repaired, ensure that the disks are clean by clearing the boot drive and virtual drive on the node. This step is to ensure that the virtual drive is in a clean state without the previous state before you add it back. However, removal of the virtual drive is not required for a new replacement node.

  5. Attach the new primary worker Bare Metal node and remove it from the maintenance mode in the cluster configuration using the following commands:

    configure 
      clusters cluster_name 
      nodes worker_node 
      maintenance false 
      commit 
      end 

    Example:

    SMI Cluster Deployer(config)# clusters kali-stacked 
    SMI Cluster Deployer(config-clusters-kali-stacked)# nodes cmts-worker1 
    SMI Cluster Deployer(config-nodes-cmts-worker1)# maintenance false 
    SMI Cluster Deployer(config-nodes-cmts-worker1)# commit
    Commit complete.
    SMI Cluster Deployer(config-nodes-cmts-worker1)# end
    SMI Cluster Deployer# 
  6. Run the cluster synchronization using the following command:

    clusters cluster_name actions sync run debug true 

    Example:

    SMI Cluster Deployer# clusters kali-stacked actions sync run debug true
    This will run sync.  Are you sure? [no,yes] yes
    message accepted
  7. Verify the status of the cluster using the following command:

    clusters cluster_name actions k8s cluster-status 

    Example:

    SMI Cluster Deployer# clusters kali-stacked actions k8s cluster-status 
    pods-desired-count 67
    pods-ready-count 67
    pods-desired-are-ready true
    etcd-healthy true
    all-ok true
  8. Verify the status of the pods redeployed on the added worker node using the following command:

    clusters cluster_name nodes worker_node actions k8s pod-status show-pod-details 
    Value for 'show-pod-details' [false,true]: true 
    Example:
    SMI Cluster Deployer# clusters kali-stacked nodes cmts-worker1 actions k8s pod-status show-pod-details 
    Value for 'show-pod-details' [false,true]: true
    pods {
        name calico-node-67gs6
        namespace kube-system
        owner-kind DaemonSet
        owner-name calico-node
        ready true
    }
    pods {
        name coredns-f9fd979d6-b2gsb
        namespace kube-system
        owner-kind ReplicaSet
        owner-name coredns-f9fd979d6
        ready true
    }
    pods {
        name kube-proxy-5m9qh
        namespace kube-system
        owner-kind DaemonSet
        owner-name kube-proxy
        ready true
    }
    pods {
        name maintainer-2nxlq
        namespace kube-system
        owner-kind DaemonSet
        owner-name maintainer
        ready true
    }
    pods {
        name charts-cee-2020-02-0-i21-4
        namespace registry
        owner-kind StatefulSet
        owner-name charts-cee-2020-02-0-i21
        ready true
    }
    pods {
        name charts-cluster-deployer-2020-02-0-i22-5
        namespace registry
        owner-kind StatefulSet
        owner-name charts-cluster-deployer-2020-02-0-i22
        ready true
    }
    pods {
        name registry-cee-2020-02-0-i21-5
        namespace registry
        owner-kind StatefulSet
        owner-name registry-cee-2020-02-0-i21
        ready true
    }
    pods {
        name registry-cluster-deployer-2020-02-0-i22-5
        namespace registry
        owner-kind StatefulSet
        owner-name registry-cluster-deployer-2020-02-0-i22
        ready true
    }
    pods {
        name software-unpacker-3
        namespace registry
        owner-kind StatefulSet
        owner-name software-unpacker
        ready true
    }
    pods {
        name keepalived-jrj4g
        namespace smi-vips
        owner-kind DaemonSet
        owner-name keepalived
        ready true
    }
    pods-count 10
    pods-available-to-drain-count 6
Note

You can follow the same procedure to replace one or more failed worker nodes in the cluster.

NOTES:

  • clusters cluster_name - Specifies the K8s cluster.

  • nodes worker - Specifies primary worker Bare Metal node.

  • maintenance true/false - Assigns or removes the primary control plane 1 Bare Metal mode to maintenance mode

  • actions sync run debug true - Synchronizes the cluster configuration.

  • actions k8s cluster-status - Displays the status of the cluster.