It’s useful to know the component parts of a problem. Not just their nature, but their number. If there’s a component you don’t know about, you may spend excessive time on fruitless attempts to solve other parts that in reality rely upon the undiscovered aspect.
Last week I experienced this while working on an OpenStack setup. Instances were to be launched by a remote glideinWMS server, using HTCondor‘s EC2 interface. When invoked manually through a simple job description, instances did start on the cloud controller node. When invoked by glideinWMS, they failed. While there was some log information at the remote end showing that the problem was an HTTP 414 return code, indicating that the URI was too long, nothing was logged at the OpenStack end specifically related to the launching of instances (there was information showing that other requests from glideinWMS were succeeding, so it wasn’t a simple connectivity problem).
At first I thought it might be a quota problem. Increasing the defaults had no effect, so it wasn’t that. What was more puzzling was the complete lack of any local trace of the launch request. Eventually I found this OpenStack bug, which looked like it might explain that. Talking it over with two of the incredibly helpful glideinWMS developers, I found some WSGI code with an internally specified limit to the length of incoming requests. It turned out this code was included in OpenStack and, if I increased it and restarted the Nova API service, the launch instance requests coming from the remote site started working immediately. This bug was reported and a fix proposed (again, by a glideinWMS developer).
The whole process was very frustrating until it clicked that I was dealing with two problems and that one was unhelpfully obscuring evidence needed to help solve the other.