Build a Multi-Region AI Inference Pipeline with Global Infrastructure
What You'll Build
By the end of this tutorial, you'll have a working multi-region inference pipeline that intelligently routes requests between cloud regions. Your FastAPI service will select the optimal region based on real-time latency measurements and cost optimization, with automatic failover when regions become unavailable.
This architecture matters because compute pricing varies across regions. You'll build a system that balances cost differences with latency requirements. The tutorial uses local mock servers for testing before deploying to real cloud infrastructure, so you can validate the routing logic without incurring cloud costs during development.
Prerequisites
- Python 3.10+ installed locally (
python --version to check)
- pip package manager
- curl for testing endpoints
- jq for parsing JSON responses:
brew install jq (macOS) or apt-get install jq (Ubuntu)
- AWS Account (optional for Step 8 - real deployment only)
- AWS CLI v2 (optional for Step 8):
aws --version
- Terraform 1.6+ (optional for Step 8):
terraform --version
- Basic understanding of REST APIs and async Python
- Estimated time: 60-90 minutes (Steps 1-7), additional 30-60 minutes for Step 8 if deploying to cloud
Install core Python dependencies:
pip install fastapi==0.104.1 uvicorn==0.24.0 httpx==0.25.0 pydantic==2.5.0
Expected output:
Successfully installed fastapi-0.104.1 uvicorn-0.24.0 httpx-0.25.0 pydantic-2.5.0
Step-by-Step Instructions
Step 1: Set Up the Project Structure
Create the project directory and file structure:
mkdir multi-region-inference
cd multi-region-inference
mkdir -p src terraform
touch src/__init__.py src/main.py src/region_router.py src/region_config.py
Verify the structure:
tree -L 2
Expected output:
.
├── src
│ ├── __init__.py
│ ├── main.py
│ ├── region_config.py
│ └── region_router.py
└── terraform
What just happened: You've created a standard Python project layout. The src directory contains your application code, and terraform will hold infrastructure-as-code files if you deploy to real cloud regions in Step 8.
Step 2: Create the Region Configuration
Create src/region_config.py:
import os
from typing import Dict, List
from pydantic import BaseModel
class RegionEndpoint(BaseModel):
"""Configuration for a single region endpoint"""
name: str
endpoint: str
cost_per_1k_tokens: float # Cost in USD
priority: int # Lower number = higher priority
provider: str # "aws" or "azure"
# Region configurations with representative cost values
REGIONS: Dict[str, RegionEndpoint] = {
"us-east-1": RegionEndpoint(
name="us-east-1",
endpoint=os.getenv("AWS_US_EAST_ENDPOINT", "http://localhost:8001"),
cost_per_1k_tokens=0.50,
priority=3,
provider="aws"
),
"me-south-1": RegionEndpoint(
name="me-south-1", # AWS Bahrain - Middle East
endpoint=os.getenv("AWS_ME_SOUTH_ENDPOINT", "http://localhost:8002"),
cost_per_1k_tokens=0.35,
priority=1,
provider="aws"
),
"ap-south-1": RegionEndpoint(
name="ap-south-1", # AWS Mumbai - South Asia
endpoint=os.getenv("AWS_AP_SOUTH_ENDPOINT", "http://localhost:8003"),
cost_per_1k_tokens=0.32,
priority=1,
provider="aws"
),
"brazilsouth": RegionEndpoint(
name="brazilsouth", # Azure Brazil
endpoint=os.getenv("AZURE_BRAZIL_ENDPOINT", "http://localhost:8004"),
cost_per_1k_tokens=0.38,
priority=2,
provider="azure"
),
}
def get_sorted_regions() -> List[RegionEndpoint]:
"""Returns regions sorted by priority, then by cost"""
return sorted(
REGIONS.values(),
key=lambda x: (x.priority, x.cost_per_1k_tokens)
)
What just happened: You've defined configurations for four regions with different cost profiles. The priority system favors lower-cost regions (me-south-1 and ap-south-1) while maintaining fallback options. Endpoints default to localhost for local testing but can be overridden with environment variables for production deployment.
Step 3: Build the Smart Region Router
Create src/region_router.py:
import asyncio
import time
from typing import Optional, Dict, Any
import httpx
from src.region_config import get_sorted_regions, RegionEndpoint
class RegionRouter:
"""Routes inference requests to optimal regions based on latency and cost"""
def __init__(self, timeout: float = 5.0):
self.timeout = timeout
self.latency_cache: Dict[str, float] = {}
self.failure_count: Dict[str, int] = {}
async def measure_latency(self, region: RegionEndpoint) -> Optional[float]:
"""
Ping region endpoint to measure actual latency.
Returns latency in seconds, or None if region is unreachable.
"""
try:
start = time.time()
async with httpx.AsyncClient(timeout=self.timeout) as client:
response = await client.get(f"{region.endpoint}/health")
if response.status_code == 200:
latency = time.time() - start
self.latency_cache[region.name] = latency
return latency
except Exception as e:
print(f"Failed to reach {region.name}: {e}")
self.failure_count[region.name] = self.failure_count.get(region.name, 0) + 1
return None
async def select_best_region(self) -> Optional[RegionEndpoint]:
"""
Select optimal region based on:
1. Priority (cost tier)
2. Measured latency
3. Failure history (exclude regions with 3+ consecutive failures)
"""
regions = get_sorted_regions()
# Measure latency for all regions concurrently
latency_tasks = [self.measure_latency(r) for r in regions]
await asyncio.gather(*latency_tasks)
# Filter out unhealthy regions
available_regions = [
r for r in regions
if self.failure_count.get(r.name, 0) < 3
and self.latency_cache.get(r.name) is not None
]
if not available_regions:
print("WARNING: No healthy regions available!")
return None
# Calculate weighted score: priority + normalized latency
def score_region(region: RegionEndpoint) -> float:
latency = self.latency_cache.get(region.name, 999)
# Lower score is better
return region.priority + (latency * 10)
best_region = min(available_regions, key=score_region)
latency_ms = self.latency_cache.get(best_region.name, 0) * 1000
print(f"Selected region: {best_region.name} "
f"(latency: {latency_ms:.0f}ms, cost: ${best_region.cost_per_1k_tokens}/1k tokens)")
return best_region
async def route_inference(self, payload: Dict[str, Any]) -> Dict[str, Any]:
"""
Route inference request to best available region.
Returns response with metadata about selected region.
"""
best_region = await self.select_best_region()
if not best_region:
raise Exception("No healthy regions available for inference")
try:
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
f"{best_region.endpoint}/v1/inference",
json=payload
)
response.raise_for_status()
result = response.json()
# Add routing metadata to response
result["_metadata"] = {
"region": best_region.name,
"provider": best_region.provider,
"cost_per_1k": best_region.cost_per_1k_tokens
}
return result
except Exception as e:
print(f"Inference failed on {best_region.name}: {e}")
self.failure_count[best_region.name] = self.failure_count.get(best_region.name, 0) + 1
raise
What just happened: This is the core routing engine. It measures real latency to each region using concurrent health checks, maintains a failure count to avoid repeatedly trying dead regions, and selects the optimal region using a weighted scoring system that balances cost priority with actual latency. The route_inference method handles the actual request forwarding and adds metadata about which region processed the request.
Step 4: Create the FastAPI Application
Create src/main.py:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Dict, Any
from src.region_router import RegionRouter
app = FastAPI(
title="Multi-Region Inference API",
description="Intelligent routing for AI inference across global regions"
)
router = RegionRouter()
class InferenceRequest(BaseModel):
"""Request schema for inference endpoint"""
prompt: str
max_tokens: int = 100
temperature: float = 0.7
class InferenceResponse(BaseModel):
"""Response schema with result and routing metadata"""
result: str
metadata: Dict[str, Any]
@app.get("/health")
async def health_check():
"""Health check endpoint for load balancers"""
return {"status": "healthy", "service": "multi-region-router"}
@app.post("/v1/inference", response_model=InferenceResponse)
async def inference(request: InferenceRequest):
"""
Main inference endpoint - automatically routes to optimal region.
The router selects the best region based on cost and latency,
then forwards the request and returns the result with metadata.
"""
try:
payload = request.dict()
result = await router.route_inference(payload)
return InferenceResponse(
result=result.get("output", ""),
metadata=result.get("_metadata", {})
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/regions/status")
async def region_status():
"""
Debug endpoint showing current region health metrics.
Useful for monitoring and troubleshooting routing decisions.
"""
return {
"latency_cache": router.latency_cache,
"failure_count": router.failure_count
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
What just happened: You've created a FastAPI application that wraps the region router. The /v1/inference endpoint accepts inference requests and automatically routes them to the optimal region. The /regions/status endpoint exposes internal routing metrics for debugging and monitoring.
Step 5: Create Mock Regional Endpoints for Testing
Create mock_regional_server.py in the project root directory:
"""
Mock regional inference server for local testing.
Simulates a real inference endpoint with configurable latency.
"""
import sys
import time
import random
from fastapi import FastAPI
import uvicorn
def create_mock_server(region_name: str, latency_ms: int):
"""Create a mock server that simulates a regional endpoint"""
app = FastAPI()
@app.get("/health")
async def health():
# Simulate network latency
time.sleep(latency_ms / 1000.0)
return {"status": "healthy", "region": region_name}
@app.post("/v1/inference")
async def inference(payload: dict):
# Simulate inference processing time
time.sleep(random.uniform(0.5, 1.5))
return {
"output": f"Mock response from {region_name}: {payload.get('prompt', '')[:50]}...",
"tokens_used": payload.get('max_tokens', 100)
}
return app
if __name__ == "__main__":
# Parse command line arguments
region = sys.argv[1] if len(sys.argv) > 1 else "us-east-1"
port = int(sys.argv[2]) if len(sys.argv) > 2 else 8001
latency = int(sys.argv[3]) if len(sys.argv) > 3 else 50
app = create_mock_server(region, latency)
print(f"Starting mock server for {region} on port {port} (simulated latency: {latency}ms)")
uvicorn.run(app, host="0.0.0.0", port=port, log_level="warning")
What just happened: This creates mock regional inference endpoints with configurable latency characteristics. Each mock server simulates a real regional deployment, allowing you to test the routing logic locally without deploying to actual cloud infrastructure. The latency parameter lets you simulate geographic distance.
Step 6: Test the Multi-Region Router Locally
You'll need 5 terminal windows for this step. Open them all and navigate to your project directory in each.
Sub-step 6a: Start the mock regional servers
In terminal 1 (US East - higher latency):
python mock_regional_server.py us-east-1 8001 100
In terminal 2 (Middle East - lower latency):
python mock_regional_server.py me-south-1 8002 60
In terminal 3 (South Asia - lowest latency):
python mock_regional_server.py ap-south-1 8003 50
In terminal 4 (Brazil - medium latency):
python mock_regional_server.py brazilsouth 8004 80
Expected output in each terminal:
Starting mock server for [region] on port [port] (simulated latency: [X]ms)
INFO: Started server process [PID]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:[port]
Sub-step 6b: Start the main router service
In terminal 5:
python src/main.py
Expected output:
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
What just happened: You now have a complete simulated multi-region environment running locally. Four mock servers represent different geographic regions with realistic latency profiles, and the main router service is ready to intelligently distribute requests among them.
Step 7: Send Test Requests and Verify Routing
Sub-step 7a: Test a single inference request
Open a new terminal (terminal 6) and send a test request:
curl -X POST http://localhost:8000/v1/inference \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum computing in simple terms",
"max_tokens": 150,
"temperature": 0.7
}'
Expected output:
{
"result": "Mock response from ap-south-1: Explain quantum computing in simple terms...",
"metadata": {
"region": "ap-south-1",
"provider": "aws",
"cost_per_1k": 0.32
}
}
In terminal 5 (where the router is running), you should see:
Selected region: ap-south-1 (latency: 52ms, cost: $0.32/1k tokens)
Sub-step 7b: Check region health status
curl -s http://localhost:8000/regions/status | jq
Expected output:
{
"latency_cache": {
"us-east-1": 0.102,
"me-south-1": 0.062,
"ap-south-1": 0.051,
"brazilsouth": 0.081
},
"failure_count": {}
}
Sub-step 7c: Test failover behavior
Stop the ap-south-1 server (press Ctrl+C in terminal 3), then send another request:
curl -X POST http://localhost:8000/v1/inference \
-H "Content-Type: application/json" \
-d '{"prompt": "Test failover", "max_tokens": 50}'
Expected output: The response should now come from me-south-1 (the next best region):
{
"result": "Mock response from me-south-1: Test failover...",
"metadata": {
"region": "me-south-1",
"provider": "aws",
"cost_per_1k": 0.35
}
}
Restart the ap-south-1 server in terminal 3 for the next steps.
What just happened: You've verified that the router correctly selects the lowest-cost, lowest-latency region (ap-south-1) under normal conditions, and automatically fails over to the next best region when the primary becomes unavailable. This demonstrates the core value proposition: cost optimization with reliability.
Step 8: Deploy to Real Cloud Regions (Optional)
Note: This step requires an AWS account and will incur cloud infrastructure costs. Skip this step if you want to stay with local testing only.
Sub-step 8a: Install additional prerequisites
pip install boto3==1.34.0
Verify AWS CLI is configured:
aws sts get-caller-identity
Expected output:
{
"UserId": "AIDAXXXXXXXXXXXXXXXXX",
"Account": "123456789012",
"Arn": "arn:aws:iam::123456789012:user/your-username"
}
Sub-step 8b: Enable required AWS regions
Some regions like me-south-1 require manual opt-in. Enable them via AWS Console or CLI:
# Check which regions are enabled
aws account list-regions --region-opt-status-contains ENABLED ENABLED_BY_DEFAULT
# Enable Middle East region (if not already enabled)
aws account enable-region --region-name me-south-1
Wait 5-10 minutes for the region to become fully available.
Sub-step 8c: Create Terraform configuration
Create terraform/main.tf:
terraform {
required_version = ">= 1.6"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# Define provider for each region
provider "aws" {
alias = "us_east"
region = "us-east-1"
}
provider "aws" {
alias = "me_south"
region = "me-south-1"
}
provider "aws" {
alias = "ap_south"
region = "ap-south-1"
}
# Example: Create ECS cluster in each region
# This is a simplified example - production requires security groups,
# load balancers, auto-scaling, and monitoring
resource "aws_ecs_cluster" "inference_us_east" {
provider = aws.us_east
name = "inference-cluster-us-east-1"
}
resource "aws_ecs_cluster" "inference_me_south" {
provider = aws.me_south
name = "inference-cluster-me-south-1"
}
resource "