Troubleshooting gadmin start exe/start all failure due to failed to start executor

Problem

The executor is a core infrastructure service that runs on all nodes, very similar to a daemon. Troubleshooting with the executor generally involves making sure the machine itself is online and has network connectivity. After that, we need to verify that other nodes in the cluster can ssh to it, as the Executor service is started via ssh. These troubleshooting steps apply to all 3.x versions.

Error message

gadmin start exe
[   Info] Starting EXE
[  Error] ExternalError (Failed to start executor(s); Timeout waiting executor at 10.10.10.10 to start; Process exited with status 1)

Diagnosis

Useful information to learn:

1.Which node the executor is not starting. In my above example, it’s 10.10.10.10. 2.Whether it’s the machine the command is run on or a different one. 3.Whether it’s just 1 IP address or multiple. Multiple IP addresses would indicate the executor failing to start on multiple nodes.

Check the connectivity of the machine(s) determine in the above step:

1.Ping the machine 2.Use netcat to check if port 22 is open on that machine nc -zv 10.10.10.10 22 3.Use netcat to check if port 9177 is open on that machine nc -zv 10.10.10.10 9177 4.If you can access one of the machines where the executor start is failing, try the same checks from that machine to the one you originally encountered the error on

Verify ssh connectivity to the machine:

ssh -i ~/.ssh/tigergraph_rsa tigergraph@10.10.10.10
  1. This command assumes the default settings. If you have changed the ssh port or key on your server, you will need to modify these commands. Use gssh to determine your ssh configuration.

Workaround

This command assumes the default settings. If you have changed the ssh port or key on your server, you will need to modify these commands. Use gssh to determine your ssh configuration. Remove failed node:https://docs.tigergraph.com/tigergraph-server/current/ha/remove-failed-node Cluster expansion: https://docs.tigergraph.com/tigergraph-server/current/cluster-resizing/expand-a-cluster

Solution

There is no generic solution to this issue. The diagnosis steps should provide enough information to understand which solution is needed. In the case of closed firewall rules, the firewall should be opened on those ports. For ssh failure, troubleshoot the ssh connectivity. For examples, see the documentation here:https://docs.digitalocean.com/support/how-to-troubleshoot-ssh-connectivity-issues/