Architect Flow Timeout at 200 Concurrent Users

Dealing with a very strange bug here with Architect flow execution during load tests. Using JMeter to simulate 200 concurrent inbound calls. The flow triggers a simple data lookup via API. At 100 users, it works fine. At 200, 30% of calls get a 504 Gateway Timeout on the API request node. Environment is us-east-1. Checked rate limits, looks okay. What is the correct way to handle API latency spikes in Architect flows under high concurrency?

Check your request_timeout and connection_pool_size settings within the API integration node configuration. The 504 Gateway Timeout is rarely a pure network latency issue at this scale; it is almost always a resource contention problem where the Architect flow engine cannot acquire a connection from the pool fast enough to handle the burst of 200 concurrent requests.

In my experience managing high-volume BYOC trunks, the default connection pooling in Architect is quite conservative. When you hit 200 concurrent users, the queue for available HTTP connections fills up faster than the requests can complete, leading to immediate timeouts before the upstream API even processes the load.

You need to adjust the max_connections_per_host parameter in your integration profile. Increase this value significantly, perhaps to 50 or 100, depending on your upstream API’s rate limits. Additionally, ensure your request_timeout is set high enough to accommodate the slowest 99th percentile response time of your API, plus a buffer for network jitter. A setting of 5000ms is often too aggressive for complex data lookups under load.

Here is a sample configuration snippet for your integration profile to illustrate the necessary adjustments:

{
 "integration_profile": {
 "name": "HighConcurrencyAPI",
 "connection_pool": {
 "max_connections_per_host": 100,
 "keep_alive_timeout": 30000
 },
 "request_settings": {
 "request_timeout": 10000,
 "retry_policy": {
 "max_retries": 2,
 "retry_interval": 1000
 }
 }
 }
}

Also, verify that your upstream API supports persistent connections. If it drops connections frequently, the overhead of establishing new TLS handshakes for each of the 200 calls will exacerbate the timeout issue. Implementing a retry policy with exponential backoff can help manage transient failures during these spikes, ensuring that only truly failed calls are dropped rather than timing out prematurely. This approach stabilizes the flow execution significantly during peak load tests.

This is a standard resource contention issue, similar to what we see when bulk export jobs hit S3 bucket limits during legal discovery syncs. The suggestion above regarding connection_pool_size is spot on. When you push 200 concurrent requests, the default pool often exhausts available handles before the downstream API can respond, causing the Architect engine to drop the connection and return that 504. It is not necessarily about network latency, but about how the flow engine manages state for simultaneous execution paths.

To mitigate this, adjust the API Integration Node settings directly. Increase the max_connections to at least 50-100 for this specific node, rather than relying on the global default. Also, ensure the request_timeout is set slightly higher than your API’s P95 latency, perhaps 8-10 seconds, to prevent premature drops. If the target API supports it, enable keep_alive to reuse TCP connections and reduce handshake overhead.

{
 "node_id": "api_lookup_01",
 "configuration": {
 "connection_pool_size": 100,
 "request_timeout_ms": 10000,
 "retry_on_timeout": true,
 "max_retries": 2
 }
}

In my experience with high-volume data exports, adding a simple retry logic with an exponential backoff helps smooth out these bursts. If the first attempt fails due to a transient timeout, a quick retry often succeeds once a connection slot frees up. This preserves the chain of custody for the call data by ensuring the lookup completes, even under load. Monitor the node execution logs for any 502 or 504 errors after applying these changes. If timeouts persist, consider offloading the heavy API calls to a webhook or a queue-based system, which decouples the flow execution from the external dependency latency. This approach is much more stable for legal hold scenarios where data integrity is paramount.