A user runs a request on a healthy ODAS cluster but receives "Host not reachable" errors.
When a worker node is removed from an ODAS cluster with the kubectl delete node <node> command, its IP address remains in the Deployment Manager's list, as observed using the ocadm clusters nodes <cluster> command. The Planner assumes the dropped node is still valid and assigns it a task, prompting the "Host not reachable" error.
The DM will update correctly when an administrator grows or shrinks the cluster's node count. The process for removing an arbitrary node -- perhaps because the EC2 instance has failed -- is currently incomplete.
We recommend the following workarounds until this process is fixed:
- Recreate the cluster; or
- Reduce the cluster size to n-1 nodes, where the nth node is the one that was removed. Allow some time for all cluster members to receive the state change, then restore the original cluster size.
$ ocadm clusters nodes 61
10.180.44.81 10.180.45.104 10.180.45.9 10.180.45.164 10.180.45.106 10.180.45.254 10.180.44.15 10.180.45.6 10.180.44.84 10.180.44.35 10.180.44.142 10.180.44.30 10.180.44.162 10.180.44.126 10.180.45.100 10.180.44.213 10.180.45.71 10.180.45.84 10.180.45.74 10.180.44.17 10.180.44.22 10.180.44.51 10.180.44.38 10.180.44.9 10.180.44.155 10.180.44.233 10.180.45.233 10.180.45.168 10.180.44.224 10.180.45.140 10.180.44.225 10.180.45.246 10.180.44.118 10.180.44.113 10.180.44.191 10.180.45.200 10.180.44.201 10.180.45.105 10.180.45.153 10.180.44.45 10.180.44.116
Assume the boldfaced node (10.180.45.164) was removed using the kubectl delete node <node> command. You could restart the cluster to update the DM's node list. If a restart is impractical, you could instead reduce the cluster size to 3 (n = 4). Allow some time for the cluster to update itself, then restore the node count.