During cluster creation, there was a bad EC2 instance. This node was removed using the kubectl delete node <node> command. After running an ocadm clusters nodes <cluster> command, we noticed the IP address was still in the list. The DM still has the Kubernetes-deleted nodes in its list of workers. This is the cause of the random "Host not reachable" errors.
When you delete a node from the cluster via kubernetes, it does not remove all traces of it (the DM master still thinks that the host exists and is part of the ODAS cluster). What is happening is the planners assign a task to that node and that is causing the “cannot reach” error. While we can scale up and down in general, we don’t have a way to remove an arbitrary node.
The recommendations, for now, are as follows:
- Recreate the cluster; or
- Scale the cluster down to a size of n-1 where the offending node is the nth node listed in the output of ./ocadm clusters nodes <cluster id>, wait a few minutes for the cluster to settle, and then scale back up.
In this ticket, the ocadm output is:
ocadm clusters nodes 61
10.180.44.81 10.180.45.104 10.180.45.9 10.180.45.164 10.180.45.106 10.180.45.254 10.180.44.15 10.180.45.6 10.180.44.84 10.180.44.35 10.180.44.142 10.180.44.30 10.180.44.162 10.180.44.126 10.180.45.100 10.180.44.213 10.180.45.71 10.180.45.84 10.180.45.74 10.180.44.17 10.180.44.22 10.180.44.51 10.180.44.38 10.180.44.9 10.180.44.155 10.180.44.233 10.180.45.233 10.180.45.168 10.180.44.224 10.180.45.140 10.180.44.225 10.180.45.246 10.180.44.118 10.180.44.113 10.180.44.191 10.180.45.200 10.180.44.201 10.180.45.105 10.180.45.153 10.180.44.45 10.180.44.116
The node in question is the third worker/fourth entry (10.180.45.164).
So, for option 2, you would scale the cluster down to size 3 (4-1), wait a bit for the cluster to settle, and then scale back to the desired size.