A trading bot is only as good as the infrastructure running it. You can have the most sophisticated ML models in the world, but if your server crashes at 2 AM and misses four hours of market data, your models are flying blind when the market opens. This week I focused on building the operational infrastructure that keeps the Quantum AI Trading Bot running 24/7 for paper trading — and the engineering challenges were more interesting than I expected.
The Infrastructure Stack
The bot runs on a Google Cloud Platform VM, which provides the reliability and network performance needed for consistent market data access. The stack looks like this:
- Compute: GCP e2-standard-4 (4 vCPU, 16GB RAM) — sufficient for the current scale of 289 symbols - OS: Ubuntu 22.04 LTS with automatic security updates - Process Manager: systemd for service lifecycle management - Data Store: Local SSD for the time-series ring buffer, with hourly snapshots to GCS - Monitoring: Custom health checks exposed via a lightweight HTTP endpoint - Alerting: Webhook-based alerts to a private notification channel
The Interactive Brokers Gateway sits at the center of this stack. It's the bridge between the bot and the market — providing real-time data feeds and paper trading order execution through IB's TWS API.
The IB Gateway Challenge
Interactive Brokers' Gateway application is powerful but temperamental. It requires a running GUI session (even in headless mode), needs daily authentication refreshes, and has a tendency to drop connections under certain network conditions. Building reliable automation around it required solving several problems:
Headless GUI. The IB Gateway requires an X11 display even when running on a headless server. I use Xvfb (X Virtual Frame Buffer) to provide a virtual display, with a wrapper script that starts Xvfb before launching the Gateway:bash
#!/bin/bash
export DISPLAY=:99
Xvfb :99 -screen 0 1024x768x24 &
sleep 2
/opt/ibgateway/ibgateway &
Authentication. IB requires periodic authentication, and their two-factor auth can't be fully automated (by design — it's a security measure). I implemented a semi-automated flow: the bot detects when authentication is needed, sends a notification, and waits for manual 2FA approval. Once authenticated, sessions typically last 24-48 hours before requiring renewal.
Connection Resilience. The TWS API connection drops occasionally — sometimes due to IB server maintenance, sometimes due to network blips. The bot implements exponential backoff reconnection with jitter:
python
class ConnectionManager:
def __init__(self, max_retries=10, base_delay=1.0):
self.max_retries = max_retries
self.base_delay = base_delay
async def connect_with_retry(self):
for attempt in range(self.max_retries):
try:
await self.connect()
self.consecutive_failures = 0
return True
except ConnectionError:
delay = self.base_delay (2 * attempt) + random.uniform(0, 1)
logger.warning(f"Connection attempt {attempt + 1} failed, retrying in {delay:.1f}s")
await asyncio.sleep(delay)
return False
Health Monitoring
The bot exposes a health endpoint that reports the status of every component:
json
{
"status": "healthy",
"uptime_hours": 142.3,
"components": {
"ib_gateway": {"status": "connected", "last_heartbeat": "2s ago"},
"data_pipeline": {"status": "running", "symbols_active": 289, "data_freshness": "1.2s"},
"ml_inference": {"status": "running", "predictions_today": 2847},
"risk_manager": {"status": "active", "gates_triggered_today": 34},
"paper_portfolio": {"value": 1043267.50, "positions": 12, "daily_pnl": "+0.34%"}
},
"alerts": []
}
A watchdog process polls this endpoint every 30 seconds. If any component reports unhealthy status for more than 5 minutes, the watchdog attempts automatic recovery in a defined sequence:
1. Restart the failed component 2. If the component fails to recover, restart the IB Gateway connection 3. If the Gateway fails to recover, restart the entire application 4. If the application fails, reboot the VM and send an alert
This escalation ladder handles most failure modes automatically. In the first month of operation, the watchdog resolved 14 incidents without any manual intervention — mostly IB Gateway connection drops that recovered after a reconnect.
Data Integrity Under Failure
The trickiest part of building resilient infrastructure isn't restarting processes — it's ensuring data integrity across restarts. When the bot restarts, it needs to:
1. Recover the current paper portfolio state (positions, orders, cash balance) 2. Reconstruct the feature engineering state (rolling statistics, technical indicators) 3. Reload the ML model weights and verify they match the expected version 4. Resume data collection without creating gaps in the time series
I solved this with a checkpoint system that writes the full application state to disk every 60 seconds. On restart, the bot loads the latest checkpoint, fast-forwards through any missed data from the buffered data store, and resumes normal operation. The gap between the last checkpoint and the restart is typically under 2 minutes, and the fast-forward process handles this gracefully.
Deployment Without Downtime
Deploying new code to a running trading system — even a paper trading one — requires care. I implemented a blue-green deployment pattern:
1. The new version starts alongside the old version, connecting to the data pipeline but not generating orders 2. The new version runs in shadow mode for 30 minutes, generating signals that are logged but not executed 3. Shadow mode signals are compared against the live version's signals to detect anomalies 4. If the signals are consistent (within expected variance), traffic switches to the new version 5. The old version enters a 1-hour grace period before shutdown
This process is overkill for paper trading, but it's exactly the kind of discipline I want to build into the system from day one. The same deployment pattern, adapted appropriately, is used in the UAPK Gateway for rolling out policy updates to AI governance rules — where a bad deployment could cause real harm to the businesses governed by the framework.
Cost Optimization
Running a GCP VM 24/7 isn't free, and I've been deliberate about keeping costs manageable for what is fundamentally a research project:
- Spot instances for the training pipeline (saves ~60% on compute) - Committed use discounts for the always-on inference VM (saves ~30%) - Lifecycle policies on GCS storage to automatically archive data older than 90 days - Right-sizing the VM quarterly based on actual CPU and memory utilization
Current monthly infrastructure cost: approximately $85. That includes compute, storage, network egress, and data provider API fees. Not trivial, but reasonable for the learning value and the engineering portfolio it builds.
Lessons Learned
Automation is an investment. The first month of building automation felt slow compared to just SSHing in and restarting things manually. By the second month, the automation had saved me more time than I'd spent building it. By the third month, I'd stopped thinking about infrastructure entirely and could focus on ML research. Failure modes are creative. I prepared for server crashes, network outages, and API errors. I didn't prepare for the IB Gateway silently dropping a subset of data subscriptions while reporting healthy status. Now I have a data completeness check that catches this specific failure mode. Monitor everything, alert selectively. I log hundreds of metrics but only alert on actionable conditions. An alert that fires too often gets ignored. An alert that fires never isn't doing its job. Finding the right sensitivity took several iterations.What's Next
The infrastructure is stable and largely self-managing. The next focus is on the ML pipeline improvements identified by the backtesting analysis — particularly regime detection and improved bear market performance. Having reliable infrastructure means I can iterate on the models quickly without worrying about operational issues disrupting experiments. The bot will continue running in paper trading mode, accumulating data and refining its predictions one trade at a time.