Worker Bare Metal Node Failure - Unplanned
This section describes the procedure to replace a failed primary Bare Metal worker node in a stacked cluster. When the primary worker Bare Metal node fails, the node status changes to NotReady.
kubectl get nodes
In the following example, the status of the primary worker node changes to NotReady after it fails.user1-cloud@kali-stacked-controlplane1:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kali-stacked-cmts-worker1 NotReady <none> 38m v1.21.0
kali-stacked-cmts-worker2 Ready <none> 38m v1.21.0
kali-stacked-cmts-worker3 Ready <none> 38m v1.21.0
kali-stacked-controlplane1 Ready control-plane 125m v1.21.0
kali-stacked-controlplane2 Ready control-plane 13h v1.21.0
kali-stacked-controlplane3 Ready control-plane 13h v1.21.0
To replace the failed primary Bare Metal worker node in the Cluster:
-
Delete the failed primary worker Bare Metal node using the following command:
kubectl delete node node_name
Example:
user1-cloud@kali-stacked-controlplane3:~$ kubectl delete node kali-stacked-cmts-worker1 node "kali-stacked-cmts-worker1" deleted
-
Assign the primary worker Bare Metal node to maintenance mode in the cluster configuration using the following commands:
configure clusters cluster_name nodes worker_node maintenance true commit end
Example:
SMI Cluster Deployer(config)# clusters kali-stacked SMI Cluster Deployer(config-clusters-kali-stacked)# nodes cmts-worker1 SMI Cluster Deployer(config-nodes-cmts-worker1)# maintenance true SMI Cluster Deployer(config-nodes-cmts-worker1)# commit Commit complete. SMI Cluster Deployer(config-nodes-cmts-worker1)# end
-
The node is ready for the RMA process.
NoteIf the remaining nodes need to be upgraded or NFs need to be synchronized, run a cluster sync in this state. However, it's not a part of the RMA process.
-
Add the node back to the cluster when it is repaired or replaced and available.
NoteIf you add a node after it's repaired, ensure that the disks are clean by clearing the boot drive and virtual drive on the node. This step is to ensure that the virtual drive is in a clean state without the previous state before you add it back. However, removal of the virtual drive is not required for a new replacement node.
-
Attach the new primary worker Bare Metal node and remove it from the maintenance mode in the cluster configuration using the following commands:
configure clusters cluster_name nodes worker_node maintenance false commit end
Example:
SMI Cluster Deployer(config)# clusters kali-stacked SMI Cluster Deployer(config-clusters-kali-stacked)# nodes cmts-worker1 SMI Cluster Deployer(config-nodes-cmts-worker1)# maintenance false SMI Cluster Deployer(config-nodes-cmts-worker1)# commit Commit complete. SMI Cluster Deployer(config-nodes-cmts-worker1)# end SMI Cluster Deployer#
-
Run the cluster synchronization using the following command:
clusters cluster_name actions sync run debug true
Example:
SMI Cluster Deployer# clusters kali-stacked actions sync run debug true This will run sync. Are you sure? [no,yes] yes message accepted
-
Verify the status of the cluster using the following command:
clusters cluster_name actions k8s cluster-status
Example:
SMI Cluster Deployer# clusters kali-stacked actions k8s cluster-status pods-desired-count 67 pods-ready-count 67 pods-desired-are-ready true etcd-healthy true all-ok true
-
Verify the status of the pods redeployed on the added worker node using the following command:
clusters cluster_name nodes worker_node actions k8s pod-status show-pod-details Value for 'show-pod-details' [false,true]: true
Example:SMI Cluster Deployer# clusters kali-stacked nodes cmts-worker1 actions k8s pod-status show-pod-details Value for 'show-pod-details' [false,true]: true pods { name calico-node-67gs6 namespace kube-system owner-kind DaemonSet owner-name calico-node ready true } pods { name coredns-f9fd979d6-b2gsb namespace kube-system owner-kind ReplicaSet owner-name coredns-f9fd979d6 ready true } pods { name kube-proxy-5m9qh namespace kube-system owner-kind DaemonSet owner-name kube-proxy ready true } pods { name maintainer-2nxlq namespace kube-system owner-kind DaemonSet owner-name maintainer ready true } pods { name charts-cee-2020-02-0-i21-4 namespace registry owner-kind StatefulSet owner-name charts-cee-2020-02-0-i21 ready true } pods { name charts-cluster-deployer-2020-02-0-i22-5 namespace registry owner-kind StatefulSet owner-name charts-cluster-deployer-2020-02-0-i22 ready true } pods { name registry-cee-2020-02-0-i21-5 namespace registry owner-kind StatefulSet owner-name registry-cee-2020-02-0-i21 ready true } pods { name registry-cluster-deployer-2020-02-0-i22-5 namespace registry owner-kind StatefulSet owner-name registry-cluster-deployer-2020-02-0-i22 ready true } pods { name software-unpacker-3 namespace registry owner-kind StatefulSet owner-name software-unpacker ready true } pods { name keepalived-jrj4g namespace smi-vips owner-kind DaemonSet owner-name keepalived ready true } pods-count 10 pods-available-to-drain-count 6
Note | You can follow the same procedure to replace one or more failed worker nodes in the cluster. |
NOTES:
-
clusters cluster_name - Specifies the K8s cluster.
-
nodes worker - Specifies primary worker Bare Metal node.
-
maintenance true/false - Assigns or removes the primary control plane 1 Bare Metal mode to maintenance mode
-
actions sync run debug true - Synchronizes the cluster configuration.
-
actions k8s cluster-status - Displays the status of the cluster.