How can I validate that all of my ODAS services are running through system logs?
- A Deployment Manager (DM) is launched first on an EC2 instance.
- The DM is used to launch the other EC2 nodes required for the cluster using Kubernetes in the background.
- A single DM can control one or more ODAS clusters.
- Kubernetes is responsible for managing the services and scaling them as needed.
- Kuberenetes runs these services in containers:
- ODAS planner
- ODAS workers
- ODAS REST server
Troubleshooting the Components
Deployment Manager (DM)
To see if the DM came up successfully, search the log for
"DeploymentManager started in master mode"
Also in the log, check the "System Configuration" block for valid configurations. Pay special attention to the values for DB_URL, DB_NAME, DB_USERNAME, DB_PASSWORD, PORT_CONFIGURATION, and S3_STAGING_DIR.
- Are the credentials correct?
- Are the URL and the port specified correctly?
- Does the S3_STAGING_DIR exist and have the correct permissions?
Check the values for variables that are site-specific (such as SYSTEM_TOKEN) as well.
Logs (cluster nodes): /var/log/cerebro/deployment-manager.log
An EMR cluster failure is usually due to node configuration issues or services not starting (often security credentials). To verify the cluster is running, execute the following command from the shell on the DM node:
/opt/cerebro/cerebro_cli clusters list
The output of a healthy cluster looks something like this:
$ /opt/cerebro/cerebro_cli clusters list description id name numNodes numRunningServices owner statusCode statusMessage type ------------- ---- ------ ---------- -------------------- ------- ------------ --------------------- ------------------ 36 080_c1 2 7/7 admin READY All services running. STANDALONE_CLUSTER
If the statusMessage returns something other than "All services running."
- Identify host:ports of CDAS services.
/opt/cerebro/cerebro_cli clusters endpoints <id>
- Determine the EC2 instances that constitute the cluster.
/opt/cerebro/cerebro_cli clusters nodes <id>
- Note: One of these nodes does not run any ODAS services, as it is the Kubernetes master. To confirm the identify this node, run the following command against the node running no services. Only one node will report "isMaster": true.
/opt/cerebro/cerebro_cli agent <node_ip_address> kubernetes-info
- If you suspect Kubernetes is at fault, refer to the following link for troubleshooting advice.
ODAS Cluster - Service Containers
Logs: in service containers. See below.
If the service is running, point your browser to the WebUI endpoint of the problematic service. This will give you information about the host, the OS, and the process. It also provides access to the configuration, logs, and service metrics.
To collect this information manually:
- Log into the Kubernetes master node.
- ssh to a host of the failing service. By default, there is one planner service (running only on one node) and multiple workers (one per cluster node).
- Run the command docker images. Ensure that the service docker image has successfully loaded. For example, a CDAS docker image will have a name like "cerebro/cdas".
- Run docker ps and confirm that the service is up. Note the Okera/Cerebro canary service (which runs on every node) is intended to test a basic container configuration.