Thursday, February 19, 2026

Build a Multi-Region AI Inference Pipeline with Global Infrastructure

Build a Multi-Region AI Inference Pipeline with Global Infrastructure

What You'll Build

By the end of this tutorial, you'll have a working multi-region inference pipeline that intelligently routes requests between cloud regions. Your FastAPI service will select the optimal region based on real-time latency measurements and cost optimization, with automatic failover when regions become unavailable.

This architecture matters because compute pricing varies across regions. You'll build a system that balances cost differences with latency requirements. The tutorial uses local mock servers for testing before deploying to real cloud infrastructure, so you can validate the routing logic without incurring cloud costs during development.

Prerequisites

  • Python 3.10+ installed locally (python --version to check)
  • pip package manager
  • curl for testing endpoints
  • jq for parsing JSON responses: brew install jq (macOS) or apt-get install jq (Ubuntu)
  • AWS Account (optional for Step 8 - real deployment only)
  • AWS CLI v2 (optional for Step 8): aws --version
  • Terraform 1.6+ (optional for Step 8): terraform --version
  • Basic understanding of REST APIs and async Python
  • Estimated time: 60-90 minutes (Steps 1-7), additional 30-60 minutes for Step 8 if deploying to cloud

Install core Python dependencies:

pip install fastapi==0.104.1 uvicorn==0.24.0 httpx==0.25.0 pydantic==2.5.0

Expected output:

Successfully installed fastapi-0.104.1 uvicorn-0.24.0 httpx-0.25.0 pydantic-2.5.0

Step-by-Step Instructions

Step 1: Set Up the Project Structure

Create the project directory and file structure:

mkdir multi-region-inference
cd multi-region-inference
mkdir -p src terraform
touch src/__init__.py src/main.py src/region_router.py src/region_config.py

Verify the structure:

tree -L 2

Expected output:

.
├── src
│   ├── __init__.py
│   ├── main.py
│   ├── region_config.py
│   └── region_router.py
└── terraform

What just happened: You've created a standard Python project layout. The src directory contains your application code, and terraform will hold infrastructure-as-code files if you deploy to real cloud regions in Step 8.

Step 2: Create the Region Configuration

Create src/region_config.py:

import os
from typing import Dict, List
from pydantic import BaseModel


class RegionEndpoint(BaseModel):
    """Configuration for a single region endpoint"""
    name: str
    endpoint: str
    cost_per_1k_tokens: float  # Cost in USD
    priority: int  # Lower number = higher priority
    provider: str  # "aws" or "azure"


# Region configurations with representative cost values
REGIONS: Dict[str, RegionEndpoint] = {
    "us-east-1": RegionEndpoint(
        name="us-east-1",
        endpoint=os.getenv("AWS_US_EAST_ENDPOINT", "http://localhost:8001"),
        cost_per_1k_tokens=0.50,
        priority=3,
        provider="aws"
    ),
    "me-south-1": RegionEndpoint(
        name="me-south-1",  # AWS Bahrain - Middle East
        endpoint=os.getenv("AWS_ME_SOUTH_ENDPOINT", "http://localhost:8002"),
        cost_per_1k_tokens=0.35,
        priority=1,
        provider="aws"
    ),
    "ap-south-1": RegionEndpoint(
        name="ap-south-1",  # AWS Mumbai - South Asia
        endpoint=os.getenv("AWS_AP_SOUTH_ENDPOINT", "http://localhost:8003"),
        cost_per_1k_tokens=0.32,
        priority=1,
        provider="aws"
    ),
    "brazilsouth": RegionEndpoint(
        name="brazilsouth",  # Azure Brazil
        endpoint=os.getenv("AZURE_BRAZIL_ENDPOINT", "http://localhost:8004"),
        cost_per_1k_tokens=0.38,
        priority=2,
        provider="azure"
    ),
}


def get_sorted_regions() -> List[RegionEndpoint]:
    """Returns regions sorted by priority, then by cost"""
    return sorted(
        REGIONS.values(),
        key=lambda x: (x.priority, x.cost_per_1k_tokens)
    )

What just happened: You've defined configurations for four regions with different cost profiles. The priority system favors lower-cost regions (me-south-1 and ap-south-1) while maintaining fallback options. Endpoints default to localhost for local testing but can be overridden with environment variables for production deployment.

Step 3: Build the Smart Region Router

Create src/region_router.py:

import asyncio
import time
from typing import Optional, Dict, Any
import httpx
from src.region_config import get_sorted_regions, RegionEndpoint


class RegionRouter:
    """Routes inference requests to optimal regions based on latency and cost"""
    
    def __init__(self, timeout: float = 5.0):
        self.timeout = timeout
        self.latency_cache: Dict[str, float] = {}
        self.failure_count: Dict[str, int] = {}
        
    async def measure_latency(self, region: RegionEndpoint) -> Optional[float]:
        """
        Ping region endpoint to measure actual latency.
        Returns latency in seconds, or None if region is unreachable.
        """
        try:
            start = time.time()
            async with httpx.AsyncClient(timeout=self.timeout) as client:
                response = await client.get(f"{region.endpoint}/health")
                if response.status_code == 200:
                    latency = time.time() - start
                    self.latency_cache[region.name] = latency
                    return latency
        except Exception as e:
            print(f"Failed to reach {region.name}: {e}")
            self.failure_count[region.name] = self.failure_count.get(region.name, 0) + 1
            return None
    
    async def select_best_region(self) -> Optional[RegionEndpoint]:
        """
        Select optimal region based on:
        1. Priority (cost tier)
        2. Measured latency
        3. Failure history (exclude regions with 3+ consecutive failures)
        """
        regions = get_sorted_regions()
        
        # Measure latency for all regions concurrently
        latency_tasks = [self.measure_latency(r) for r in regions]
        await asyncio.gather(*latency_tasks)
        
        # Filter out unhealthy regions
        available_regions = [
            r for r in regions 
            if self.failure_count.get(r.name, 0) < 3
            and self.latency_cache.get(r.name) is not None
        ]
        
        if not available_regions:
            print("WARNING: No healthy regions available!")
            return None
        
        # Calculate weighted score: priority + normalized latency
        def score_region(region: RegionEndpoint) -> float:
            latency = self.latency_cache.get(region.name, 999)
            # Lower score is better
            return region.priority + (latency * 10)
        
        best_region = min(available_regions, key=score_region)
        latency_ms = self.latency_cache.get(best_region.name, 0) * 1000
        print(f"Selected region: {best_region.name} "
              f"(latency: {latency_ms:.0f}ms, cost: ${best_region.cost_per_1k_tokens}/1k tokens)")
        return best_region
    
    async def route_inference(self, payload: Dict[str, Any]) -> Dict[str, Any]:
        """
        Route inference request to best available region.
        Returns response with metadata about selected region.
        """
        best_region = await self.select_best_region()
        
        if not best_region:
            raise Exception("No healthy regions available for inference")
        
        try:
            async with httpx.AsyncClient(timeout=30.0) as client:
                response = await client.post(
                    f"{best_region.endpoint}/v1/inference",
                    json=payload
                )
                response.raise_for_status()
                result = response.json()
                
                # Add routing metadata to response
                result["_metadata"] = {
                    "region": best_region.name,
                    "provider": best_region.provider,
                    "cost_per_1k": best_region.cost_per_1k_tokens
                }
                return result
        except Exception as e:
            print(f"Inference failed on {best_region.name}: {e}")
            self.failure_count[best_region.name] = self.failure_count.get(best_region.name, 0) + 1
            raise

What just happened: This is the core routing engine. It measures real latency to each region using concurrent health checks, maintains a failure count to avoid repeatedly trying dead regions, and selects the optimal region using a weighted scoring system that balances cost priority with actual latency. The route_inference method handles the actual request forwarding and adds metadata about which region processed the request.

Step 4: Create the FastAPI Application

Create src/main.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Dict, Any
from src.region_router import RegionRouter

app = FastAPI(
    title="Multi-Region Inference API",
    description="Intelligent routing for AI inference across global regions"
)
router = RegionRouter()


class InferenceRequest(BaseModel):
    """Request schema for inference endpoint"""
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7


class InferenceResponse(BaseModel):
    """Response schema with result and routing metadata"""
    result: str
    metadata: Dict[str, Any]


@app.get("/health")
async def health_check():
    """Health check endpoint for load balancers"""
    return {"status": "healthy", "service": "multi-region-router"}


@app.post("/v1/inference", response_model=InferenceResponse)
async def inference(request: InferenceRequest):
    """
    Main inference endpoint - automatically routes to optimal region.
    
    The router selects the best region based on cost and latency,
    then forwards the request and returns the result with metadata.
    """
    try:
        payload = request.dict()
        result = await router.route_inference(payload)
        return InferenceResponse(
            result=result.get("output", ""),
            metadata=result.get("_metadata", {})
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/regions/status")
async def region_status():
    """
    Debug endpoint showing current region health metrics.
    Useful for monitoring and troubleshooting routing decisions.
    """
    return {
        "latency_cache": router.latency_cache,
        "failure_count": router.failure_count
    }


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

What just happened: You've created a FastAPI application that wraps the region router. The /v1/inference endpoint accepts inference requests and automatically routes them to the optimal region. The /regions/status endpoint exposes internal routing metrics for debugging and monitoring.

Step 5: Create Mock Regional Endpoints for Testing

Create mock_regional_server.py in the project root directory:

"""
Mock regional inference server for local testing.
Simulates a real inference endpoint with configurable latency.
"""
import sys
import time
import random
from fastapi import FastAPI
import uvicorn


def create_mock_server(region_name: str, latency_ms: int):
    """Create a mock server that simulates a regional endpoint"""
    app = FastAPI()
    
    @app.get("/health")
    async def health():
        # Simulate network latency
        time.sleep(latency_ms / 1000.0)
        return {"status": "healthy", "region": region_name}
    
    @app.post("/v1/inference")
    async def inference(payload: dict):
        # Simulate inference processing time
        time.sleep(random.uniform(0.5, 1.5))
        return {
            "output": f"Mock response from {region_name}: {payload.get('prompt', '')[:50]}...",
            "tokens_used": payload.get('max_tokens', 100)
        }
    
    return app


if __name__ == "__main__":
    # Parse command line arguments
    region = sys.argv[1] if len(sys.argv) > 1 else "us-east-1"
    port = int(sys.argv[2]) if len(sys.argv) > 2 else 8001
    latency = int(sys.argv[3]) if len(sys.argv) > 3 else 50
    
    app = create_mock_server(region, latency)
    print(f"Starting mock server for {region} on port {port} (simulated latency: {latency}ms)")
    uvicorn.run(app, host="0.0.0.0", port=port, log_level="warning")

What just happened: This creates mock regional inference endpoints with configurable latency characteristics. Each mock server simulates a real regional deployment, allowing you to test the routing logic locally without deploying to actual cloud infrastructure. The latency parameter lets you simulate geographic distance.

Step 6: Test the Multi-Region Router Locally

You'll need 5 terminal windows for this step. Open them all and navigate to your project directory in each.

Sub-step 6a: Start the mock regional servers

In terminal 1 (US East - higher latency):

python mock_regional_server.py us-east-1 8001 100

In terminal 2 (Middle East - lower latency):

python mock_regional_server.py me-south-1 8002 60

In terminal 3 (South Asia - lowest latency):

python mock_regional_server.py ap-south-1 8003 50

In terminal 4 (Brazil - medium latency):

python mock_regional_server.py brazilsouth 8004 80

Expected output in each terminal:

Starting mock server for [region] on port [port] (simulated latency: [X]ms)
INFO:     Started server process [PID]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:[port]

Sub-step 6b: Start the main router service

In terminal 5:

python src/main.py

Expected output:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

What just happened: You now have a complete simulated multi-region environment running locally. Four mock servers represent different geographic regions with realistic latency profiles, and the main router service is ready to intelligently distribute requests among them.

Step 7: Send Test Requests and Verify Routing

Sub-step 7a: Test a single inference request

Open a new terminal (terminal 6) and send a test request:

curl -X POST http://localhost:8000/v1/inference \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 150,
    "temperature": 0.7
  }'

Expected output:

{
  "result": "Mock response from ap-south-1: Explain quantum computing in simple terms...",
  "metadata": {
    "region": "ap-south-1",
    "provider": "aws",
    "cost_per_1k": 0.32
  }
}

In terminal 5 (where the router is running), you should see:

Selected region: ap-south-1 (latency: 52ms, cost: $0.32/1k tokens)

Sub-step 7b: Check region health status

curl -s http://localhost:8000/regions/status | jq

Expected output:

{
  "latency_cache": {
    "us-east-1": 0.102,
    "me-south-1": 0.062,
    "ap-south-1": 0.051,
    "brazilsouth": 0.081
  },
  "failure_count": {}
}

Sub-step 7c: Test failover behavior

Stop the ap-south-1 server (press Ctrl+C in terminal 3), then send another request:

curl -X POST http://localhost:8000/v1/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Test failover", "max_tokens": 50}'

Expected output: The response should now come from me-south-1 (the next best region):

{
  "result": "Mock response from me-south-1: Test failover...",
  "metadata": {
    "region": "me-south-1",
    "provider": "aws",
    "cost_per_1k": 0.35
  }
}

Restart the ap-south-1 server in terminal 3 for the next steps.

What just happened: You've verified that the router correctly selects the lowest-cost, lowest-latency region (ap-south-1) under normal conditions, and automatically fails over to the next best region when the primary becomes unavailable. This demonstrates the core value proposition: cost optimization with reliability.

Step 8: Deploy to Real Cloud Regions (Optional)

Note: This step requires an AWS account and will incur cloud infrastructure costs. Skip this step if you want to stay with local testing only.

Sub-step 8a: Install additional prerequisites

pip install boto3==1.34.0

Verify AWS CLI is configured:

aws sts get-caller-identity

Expected output:

{
    "UserId": "AIDAXXXXXXXXXXXXXXXXX",
    "Account": "123456789012",
    "Arn": "arn:aws:iam::123456789012:user/your-username"
}

Sub-step 8b: Enable required AWS regions

Some regions like me-south-1 require manual opt-in. Enable them via AWS Console or CLI:

# Check which regions are enabled
aws account list-regions --region-opt-status-contains ENABLED ENABLED_BY_DEFAULT

# Enable Middle East region (if not already enabled)
aws account enable-region --region-name me-south-1

Wait 5-10 minutes for the region to become fully available.

Sub-step 8c: Create Terraform configuration

Create terraform/main.tf:

terraform {
  required_version = ">= 1.6"
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Define provider for each region
provider "aws" {
  alias  = "us_east"
  region = "us-east-1"
}

provider "aws" {
  alias  = "me_south"
  region = "me-south-1"
}

provider "aws" {
  alias  = "ap_south"
  region = "ap-south-1"
}

# Example: Create ECS cluster in each region
# This is a simplified example - production requires security groups,
# load balancers, auto-scaling, and monitoring

resource "aws_ecs_cluster" "inference_us_east" {
  provider = aws.us_east
  name     = "inference-cluster-us-east-1"
}

resource "aws_ecs_cluster" "inference_me_south" {
  provider = aws.me_south
  name     = "inference-cluster-me-south-1"
}

resource "

No comments:

Post a Comment