Saturday, February 21, 2026

Beyond Vector Databases: How AI Actually Needs Persistent Memory

Every AI agent tutorial shows the same thing: a stateless chatbot that forgets everything when the session ends. Ask it the same question twice, you get two different answers. Reference something from last week, it has no idea.

This isn't a limitation of LLMs. It's a limitation of architecture.

I've been working on persistent memory systems for AI agents. Not just "store the chat history" - actual structured memory that persists across sessions, learns patterns, and improves over time. Here's how to build it.

The Memory Problem Nobody Talks About

Current AI agent architectures have three memory tiers:

  • Context window: What the model sees right now (limited, expensive)
  • Vector store: Semantic search over documents (great for RAG, terrible for state)
  • Application database: Structured data about users, sessions, history

The gap is between vector stores and application databases. Vector DBs are for similarity search. They're not for tracking "what did we decide last Tuesday" or "what's the current status of this workflow."

How Persistent Memory Actually Works

Real persistent memory for AI needs four capabilities:

  1. Structured storage - What happened, when, why
  2. Pattern recognition - What connects to what
  3. Temporal awareness - What changed over time
  4. Relationship tracking - How decisions relate to outcomes

Here's the architecture that handles this:


┌───────────────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│         AI Agent          │────▶│   Structured    │────▶│   Pattern       │
│ (Claude Code | Kimi CLI)  │◀────│   Memory        │◀────│   Graph         │
│                           │     │   (MySQL)       │     │   (Neo4j)       │
└───────────────────────────┘     └─────────────────┘     └─────────────────┘

MySQL: The Structured Memory Layer

MySQL (or PostgreSQL) handles the structured data:

-- What decisions were made
CREATE TABLE architecture_decisions (
    id INT AUTO_INCREMENT PRIMARY KEY,
    project_id INT NOT NULL,
    title VARCHAR(255) NOT NULL,
    decision TEXT NOT NULL,
    rationale TEXT,
    decided_at DATETIME,
    INDEX idx_project_date (project_id, decided_at)
);

-- Reusable patterns discovered
CREATE TABLE code_patterns (
    id INT AUTO_INCREMENT PRIMARY KEY,
    category VARCHAR(50),
    name VARCHAR(255),
    description TEXT,
    code_example TEXT,
    confidence_score FLOAT,
    usage_count INT DEFAULT 0
);

This is boring, reliable, ACID-compliant storage. It's what you need when an AI agent says "I decided to use FastAPI" and you need to remember why six months later.

Neo4j: The Pattern Recognition Layer

Neo4j handles what MySQL can't: relationships and similarity.

When an AI agent makes a decision, you want to know:

  • Is this similar to a previous decision?
  • What patterns keep recurring?
  • Which decisions led to good outcomes?
// Graph model for AI agent memory
(:Decision {title: 'Use Redis for caching'})
  -[:SIMILAR_TO]->(:Decision {title: 'Used Redis in project X'})
  -[:LEADS_TO]->(:Outcome {type: 'performance_improvement'})

// Query: Find similar past decisions
MATCH (d:Decision)-[:SIMILAR_TO]-(similar:Decision)
WHERE d.id = $current_decision
RETURN similar
ORDER BY similar.confidence_score DESC

Why Not Just Vector Databases?

Vector DBs (Pinecone, Weaviate, etc.) are for similarity search. They're optimized for:

  • Finding documents similar to a query
  • Semantic search
  • RAG retrieval

They're not optimized for:

  • ACID transactions
  • Complex relationships
  • Temporal queries
  • Structured metadata filtering

Real-World Example: MCP Protocol

The Model Context Protocol (MCP) is gaining traction for exactly this reason. It defines how AI systems should store and retrieve context - not just embeddings, but structured session state, decisions, and patterns.

What MCP implementations are discovering: you need both structured storage (MySQL/PostgreSQL) and graph relationships (Neo4j). Vector DBs alone don't cut it for agent memory.

Stuck? Let AI Help You Build It

If you're thinking "this sounds complicated" - you're right, it kind of is. But you don't have to build it alone.

Here's the best part: ask your AI tool (Claude Code, Kimi CLI, or whatever you're using) to implement this architecture for you. Paste the schema above, describe what you want to build, and let it generate the code.

Need a starting point? This tutorial By Bala Priya C walks through building an MCP server from scratch:

Building a Simple MCP Server in Python

The AI can handle the boilerplate. You handle the logic. That's the whole point of persistent memory - the system learns so you don't have to start from zero every time.

Summary

AI agents need memory. Not just vector similarity - structured, relational, temporal memory.

MySQL gives you structured state. Neo4j gives you pattern recognition. Together they provide what vector databases alone cannot: true persistent memory for AI agents.

For the database-focused perspective on this architecture, see the companion post on AnotherMySQLDBA.

Thursday, February 19, 2026

Building a Production-Ready Inference Cache with Redis for LLM KV Management

Building a Production-Ready Inference Cache with Redis for LLM KV Management

What You'll Build

By the end of this tutorial, you'll have a working KV (key-value) cache system using Redis to store and retrieve LLM inference results. This dramatically reduces latency for repeated inference requests—think chatbot conversations where context gets reused, or RAG systems hitting the same documents.

You'll build a Python service that intercepts LLM inference calls, checks Redis for cached results, and only hits your expensive GPU inference when there's a cache miss. This pattern can reduce repeated inference latency by orders of magnitude for conversational workloads, turning multi-second responses into millisecond lookups.

Why this matters: as inference workloads scale, you can't just throw more GPUs at the problem. Caching is how engineering teams manage inference costs without sacrificing response times.

Prerequisites

  • Python 3.10+ installed (3.10, 3.11, or 3.12 recommended)
  • Docker 20.x or later for running Redis
  • pip package manager
  • At least 8GB RAM (16GB recommended if running models locally)
  • Basic familiarity with Python and command line
  • Estimated time: 45-60 minutes

Install Python dependencies:

pip install torch transformers redis numpy

Verify installations:

python -c "import torch, transformers, redis; print('All packages installed')"

Step-by-Step Instructions

Step 1: Start Redis with Persistence

Run Redis in Docker with volume mounting so your cache survives restarts:

docker run -d \
  --name inference-cache \
  -p 6379:6379 \
  -v redis-data:/data \
  redis:7.2-alpine redis-server --appendonly yes

What this does:

  • -d: Runs container in detached mode (background)
  • --name inference-cache: Names the container for easy reference
  • -p 6379:6379: Maps Redis default port to your host
  • -v redis-data:/data: Creates persistent volume for cache data
  • --appendonly yes: Enables AOF persistence (writes survive restarts)

Verify Redis is running:

docker logs inference-cache | grep -i "ready to accept"

You should see output indicating Redis is ready to accept connections.

Step 2: Create the KV Cache Manager

Create a file called kv_cache_manager.py. This handles serialization of inference results into Redis-friendly byte strings and manages cache keys with TTL (time-to-live).

import redis
import numpy as np
import hashlib
import pickle
from typing import Optional, Tuple

class KVCacheManager:
    def __init__(self, host='localhost', port=6379, ttl=3600):
        """
        Initialize Redis connection with TTL for cache entries.
        
        Args:
            host: Redis server hostname
            port: Redis server port
            ttl: Time-to-live in seconds (default: 3600 = 1 hour)
        """
        self.redis_client = redis.Redis(
            host=host, 
            port=port, 
            decode_responses=False  # Store binary data
        )
        self.ttl = ttl
        
    def _generate_key(self, prompt: str, layer_idx: int) -> str:
        """
        Generate cache key from prompt + layer index.
        Uses SHA256 hash to keep keys manageable length.
        
        Args:
            prompt: Input text prompt
            layer_idx: Layer index (-1 for final output)
            
        Returns:
            Redis key string like "kv:layer-1:a3f7c8b4e9d2c1f5"
        """
        prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()[:16]
        return f"kv:layer{layer_idx}:{prompt_hash}"
    
    def store_kv(self, prompt: str, layer_idx: int, 
                 key_cache: np.ndarray, value_cache: np.ndarray):
        """
        Store key and value tensors for a specific layer.
        
        Args:
            prompt: Input prompt used to generate cache key
            layer_idx: Layer index for this KV pair
            key_cache: Numpy array representing key tensor
            value_cache: Numpy array representing value tensor
        """
        cache_key = self._generate_key(prompt, layer_idx)
        
        # Serialize numpy arrays using pickle
        data = pickle.dumps({
            'key': key_cache,
            'value': value_cache
        })
        
        # Store with TTL to prevent unbounded memory growth
        self.redis_client.setex(cache_key, self.ttl, data)
        
    def retrieve_kv(self, prompt: str, layer_idx: int) -> Optional[Tuple[np.ndarray, np.ndarray]]:
        """
        Retrieve cached KV pairs.
        
        Args:
            prompt: Input prompt to look up
            layer_idx: Layer index to retrieve
            
        Returns:
            Tuple of (key_cache, value_cache) if found, None otherwise
        """
        cache_key = self._generate_key(prompt, layer_idx)
        data = self.redis_client.get(cache_key)
        
        if data is None:
            return None
            
        kv_pair = pickle.loads(data)
        return kv_pair['key'], kv_pair['value']
    
    def clear_cache(self):
        """Flush all KV cache entries matching our pattern."""
        for key in self.redis_client.scan_iter("kv:*"):
            self.redis_client.delete(key)

What this does: The cache manager creates unique keys by hashing prompts (to keep key lengths manageable), serializes numpy arrays using pickle, and stores them in Redis with automatic expiration via TTL. This prevents your cache from growing unbounded and consuming all available memory.

Note on pickle security: Pickle has known security vulnerabilities when deserializing untrusted data. For production systems handling untrusted input, use safer serialization formats like msgpack or protobuf.

Step 3: Create the Cached Inference Wrapper

Create cached_inference.py. This wraps a HuggingFace model and intercepts inference calls to check the cache first.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from kv_cache_manager import KVCacheManager
import time

class CachedInferenceModel:
    def __init__(self, model_name: str, cache_manager: KVCacheManager):
        """
        Initialize model with cache support.
        
        Args:
            model_name: HuggingFace model identifier (e.g., 'gpt2')
            cache_manager: KVCacheManager instance for caching
        """
        print(f"Loading tokenizer for {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        print(f"Loading model {model_name}...")
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,  # Use half precision to save memory
            device_map='auto'  # Automatically choose CPU/GPU
        )
        
        self.cache_manager = cache_manager
        self.cache_hits = 0
        self.cache_misses = 0
        
    def generate_with_cache(self, prompt: str, max_new_tokens: int = 50) -> str:
        """
        Generate text with caching support.
        
        This implementation caches final outputs based on exact prompt matching.
        For production, you'd cache intermediate KV tensors from attention layers.
        
        Args:
            prompt: Input text prompt
            max_new_tokens: Maximum tokens to generate
            
        Returns:
            Generated text (with [CACHED] prefix if from cache)
        """
        start_time = time.time()
        
        # Check cache first (using layer_idx=-1 to indicate final output)
        cached_output = self.cache_manager.retrieve_kv(prompt, layer_idx=-1)
        
        if cached_output is not None:
            self.cache_hits += 1
            elapsed = time.time() - start_time
            print(f"✓ Cache HIT! Retrieved in {elapsed:.4f}s")
            # Reconstruct output from cached data
            cached_text = cached_output[0].tobytes().decode('utf-8')
            return f"[CACHED] {cached_text}"
        
        # Cache miss - run full inference
        self.cache_misses += 1
        print(f"✗ Cache MISS. Running full inference...")
        
        # Tokenize input
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        
        # Generate output
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,  # Deterministic output for caching
                use_cache=True,  # Enable model's internal KV cache
                pad_token_id=self.tokenizer.eos_token_id  # Prevent warnings
            )
        
        # Decode generated tokens
        generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Store in cache for future requests
        # We store the text as numpy array for consistency with the interface
        import numpy as np
        text_bytes = generated_text.encode('utf-8')
        self.cache_manager.store_kv(
            prompt, 
            layer_idx=-1, 
            key_cache=np.frombuffer(text_bytes, dtype=np.uint8), 
            value_cache=np.array([])  # Empty value cache for this simplified version
        )
        
        elapsed = time.time() - start_time
        print(f"Generated in {elapsed:.4f}s")
        
        return generated_text
    
    def print_stats(self):
        """Print cache performance statistics."""
        total = self.cache_hits + self.cache_misses
        hit_rate = (self.cache_hits / total * 100) if total > 0 else 0
        print(f"\n=== Cache Statistics ===")
        print(f"Hits: {self.cache_hits}")
        print(f"Misses: {self.cache_misses}")
        print(f"Hit Rate: {hit_rate:.1f}%")

What this does: This wrapper checks the cache before running inference. On cache miss, it runs the full model inference, stores the result, and returns it. On cache hit, it returns the cached result immediately—typically orders of magnitude faster than full inference.

Simplified approach: This implementation caches final text outputs rather than intermediate KV tensors from attention layers. Caching actual KV tensors requires modifying the model's forward pass (possible but beyond this tutorial's scope). The caching pattern and performance benefits are identical.

Step 4: Test the Cache System

Create test_cache.py to demonstrate cache hits vs misses:

from kv_cache_manager import KVCacheManager
from cached_inference import CachedInferenceModel

def main():
    # Initialize cache manager
    cache_mgr = KVCacheManager(host='localhost', port=6379, ttl=3600)
    
    # Clear any existing cache for clean test
    print("Clearing cache...")
    cache_mgr.clear_cache()
    
    # Load model (using GPT-2 for speed - works with any causal LM)
    print("\nLoading model...")
    model = CachedInferenceModel('gpt2', cache_mgr)
    
    # Test prompt
    prompt = "The future of AI infrastructure is"
    
    # First run - cache miss expected
    print(f"\n{'='*60}")
    print(f"TEST 1: First inference (cache miss expected)")
    print(f"{'='*60}")
    output1 = model.generate_with_cache(prompt, max_new_tokens=30)
    print(f"Output: {output1[:100]}...")
    
    # Second run - cache hit expected
    print(f"\n{'='*60}")
    print(f"TEST 2: Second inference with same prompt (cache hit expected)")
    print(f"{'='*60}")
    output2 = model.generate_with_cache(prompt, max_new_tokens=30)
    print(f"Output: {output2[:100]}...")
    
    # Third run - different prompt, cache miss expected
    print(f"\n{'='*60}")
    print(f"TEST 3: Different prompt (cache miss expected)")
    print(f"{'='*60}")
    prompt2 = "AI models require"
    output3 = model.generate_with_cache(prompt2, max_new_tokens=30)
    print(f"Output: {output3[:100]}...")
    
    # Fourth run - back to first prompt, cache hit expected
    print(f"\n{'='*60}")
    print(f"TEST 4: Back to first prompt (cache hit expected)")
    print(f"{'='*60}")
    output4 = model.generate_with_cache(prompt, max_new_tokens=30)
    print(f"Output: {output4[:100]}...")
    
    # Print final statistics
    model.print_stats()

if __name__ == "__main__":
    main()

Run the test:

python test_cache.py

Performance analysis: You'll observe dramatic performance differences between cache misses (full inference) and cache hits (Redis lookup). Cache hits typically complete in milliseconds while full inference takes seconds—demonstrating how caching reduces latency for repeated requests. In production systems serving thousands of requests, this translates directly to reduced GPU costs and improved user experience.

Step 5: Monitor Redis Memory Usage

Check current memory usage:

docker exec inference-cache redis-cli INFO memory | grep used_memory_human

For continuous monitoring, create monitor_cache.py:

import redis
import time

def monitor_cache(host='localhost', port=6379, interval=5):
    """
    Monitor Redis cache metrics in real-time.
    
    Args:
        host: Redis hostname
        port: Redis port
        interval: Seconds between updates
    """
    client = redis.Redis(host=host, port=port)
    
    print("Monitoring Redis cache (Ctrl+C to stop)...")
    print(f"{'Time':<20} {'Keys':<10} {'Memory':<15} {'Hit Rate':<10}")
    print("-" * 60)
    
    try:
        while True:
            info = client.info()
            
            # Gather metrics
            keys = client.dbsize()
            memory_mb = info['used_memory'] / (1024 * 1024)
            hits = info.get('keyspace_hits', 0)
            misses = info.get('keyspace_misses', 0)
            total = hits + misses
            hit_rate = (hits / total * 100) if total > 0 else 0
            
            # Display row
            timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
            print(f"{timestamp:<20} {keys:<10} {memory_mb:>10.2f} MB   {hit_rate:>6.1f}%")
            
            time.sleep(interval)
            
    except KeyboardInterrupt:
        print("\nMonitoring stopped.")

if __name__ == "__main__":
    monitor_cache()

Run in a separate terminal while testing:

python monitor_cache.py

This gives you real-time visibility into cache performance and memory consumption—critical for production deployments.

Verification

Confirm everything works correctly with these checks:

1. Verify Redis Container Status

docker ps | grep inference-cache

Container should show "Up" status.

2. Verify Cache Keys Exist

docker exec inference-cache redis-cli KEYS "kv:*"

After running test_cache.py, you should see keys matching the pattern kv:layer-1:{hash}.

3. Test Cache Hit Rate

Run test_cache.py a second time:

python test_cache.py

The second execution should show 100% cache hits for both test prompts (4 hits, 0 misses).

4. Verify TTL is Working

Check time-to-live on a cache key (replace the hash with an actual key from step 2):

docker exec inference-cache redis-cli TTL "kv:layer-1:a3f7c8b4e9d2c1f5"

Should return a positive integer less than 3600 (seconds remaining until expiry). If it returns -1, TTL wasn't set correctly.

5. Test Cache Persistence

Restart Redis and verify cache survives:

# Restart container
docker restart inference-cache

# Wait 5 seconds for startup
sleep 5

# Check if keys still exist
docker exec inference-cache redis-cli KEYS "kv:*"

Keys should still be present, confirming AOF persistence is working.

Troubleshooting

Issue 1: "ConnectionRefusedError: [Errno 111] Connection refused"

Cause: Redis isn't running or isn't accessible on port 6379.

Fix:

# Check if Redis container is running
docker ps -a | grep inference-cache

# If stopped, start it
docker start inference-cache

# If it doesn't exist, recreate it
docker run -d --name inference-cache -p 6379:6379 -v redis-data:/data redis:7.2-alpine redis-server --appendonly yes

# Verify it's accepting connections
docker logs inference-cache | grep -i "ready"

Issue 2: "ModuleNotFoundError: No module named 'transformers'"

Cause: Python dependencies not installed or wrong Python environment active.

Fix:

# Reinstall dependencies
pip install torch transformers redis numpy

# Verify installation
python -c "import transformers; print(transformers.__version__)"

Issue 3: Cache hits not occurring on repeated prompts

Cause: TTL expired, or cache was cleared between runs.

Fix:

# Check if keys exist
docker exec inference-cache redis-cli KEYS "kv:*"

# If no keys, run test_cache.py again
python test_cache.py

# Then immediately run it again to see cache hits
python test_cache.py

Issue 4: "RuntimeError: CUDA out of memory"

Cause: GPU doesn't have enough memory for the model.

Fix:

The code already uses torch.float16 for memory efficiency. If still encountering issues:

# In cached_inference.py, modify model loading:
self.model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map='cpu'  # Force CPU usage
)

Or use a smaller model like distilgpt2 instead of gpt2.

Issue 5: "pickle.UnpicklingError: invalid load key"

Cause: Corrupted cache data or version mismatch between pickle writes and reads.

Fix:

# Clear the cache completely
docker exec inference-cache redis-cli FLUSHDB

# Run test again
python test_cache.py

Next Steps

Now that you have a working inference cache, consider these enhancements:

1. Implement Semantic Caching

Instead of exact prompt matching, use embedding similarity to cache semantically similar prompts. This increases cache hit rates for paraphrased queries.

from sentence_transformers import SentenceTransformer

# Add to KVCacheManager.__init__
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Modify _generate_key to use embeddings
def _generate_key_semantic(self, prompt: str, threshold=0.9):
    embedding = self.embedding_model.encode(prompt)
    # Search for similar cached prompts using cosine similarity
    # Return existing key if similarity > threshold

2. Add Cache Warming

Pre-populate the cache with common queries during deployment:

def warm_cache(model, common_prompts):
    """Pre-cache frequently used prompts."""
    for prompt in common_prompts:
        model.generate_with_cache(prompt)

3. Implement Cache Eviction Policies

Beyond TTL, implement LRU (Least Recently Used) or LFU (Least Frequently Used)

Building a Local LLM Evaluation Pipeline with Ragas and Ollama

Building a Local LLM Evaluation Pipeline with Ragas and Ollama

What You'll Build

A working evaluation pipeline that tests your RAG (Retrieval-Augmented Generation) system locally using Ragas metrics and Ollama models. No OpenAI API keys required, no cloud dependencies—everything runs on your machine.

You'll create a Python script that takes questions, retrieves context from a document store, generates answers with a local LLM, and scores them across four key metrics: faithfulness, answer relevancy, context precision, and context recall. This approach costs nothing, keeps your data local, and provides repeatable evaluation results. You'll walk away with a complete pipeline you can adapt to evaluate your own RAG systems.

Prerequisites

  • Python 3.10+ (check with python --version)
  • pip package manager (included with Python)
  • Ollama installed and running - Download from https://ollama.ai/download
  • 8GB+ RAM (16GB recommended for smoother operation)
  • 5GB free disk space (for model downloads)
  • Basic understanding of RAG - you've built or used one before
  • Estimated time: 45-60 minutes

Step-by-Step Instructions

Step 1: Set Up Your Environment

Create a directory and virtual environment to isolate dependencies:

mkdir rag-eval-local
cd rag-eval-local
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Expected output: Your terminal prompt shows (venv) at the beginning, indicating the virtual environment is active.

Step 2: Install Required Packages

Install Ragas and the Langchain Ollama integration:

pip install ragas==0.1.9 langchain-community==0.2.10 langchain-ollama==0.1.0 datasets==2.14.0

What each package does:

  • ragas - Evaluation framework providing RAG metrics
  • langchain-ollama - Connects Langchain to local Ollama models
  • langchain-community - Required dependency for Langchain integrations
  • datasets - Handles data formatting for Ragas

Expected output: Successful installation messages ending with "Successfully installed ragas-0.1.9..." (installation takes 1-2 minutes).

Step 3: Pull the Required Ollama Models

Download the LLM for evaluation and the embedding model (takes 10-15 minutes depending on connection speed):

# Pull the instruction-tuned LLM for evaluation (4.1GB)
ollama pull mistral:7b-instruct

# Pull the embedding model (274MB)
ollama pull nomic-embed-text

Expected output:

pulling manifest
pulling 4a03f83c5f0d... 100% ▕████████████████▏ 4.1 GB
pulling e6836092461f... 100% ▕████████████████▏  7.7 KB
pulling 4a03f83c5f0e... 100% ▕████████████████▏   11 KB
success

Why these models: Mistral 7B Instruct is fast enough for local evaluation while maintaining good quality. Nomic-embed-text is an open embedding model optimized for retrieval tasks and runs efficiently on Ollama.

Verify the models are ready:

ollama list

You should see both models listed with their sizes.

Step 4: Create Sample Data

Create sample_data.py with test data. In production, you'd load this from your actual RAG system, but this lets us focus on evaluation mechanics:

# sample_data.py
from datasets import Dataset

# Sample RAG outputs to evaluate
# Each entry represents: question asked, answer generated, 
# contexts retrieved, and ground truth answer
eval_data = {
    "question": [
        "What is the capital of France?",
        "How does photosynthesis work?",
        "What are the main causes of climate change?"
    ],
    "answer": [
        "The capital of France is Paris, a major European city known for its art, fashion, and culture.",
        "Photosynthesis is the process where plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.",
        "Climate change is primarily caused by human activities that release greenhouse gases, especially burning fossil fuels, deforestation, and industrial processes."
    ],
    "contexts": [
        ["Paris is the capital and most populous city of France. It is located in north-central France."],
        ["Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy. It uses carbon dioxide and water, producing glucose and oxygen."],
        ["The primary cause of climate change is the burning of fossil fuels like coal, oil, and gas, which releases carbon dioxide. Deforestation also contributes significantly."]
    ],
    "ground_truth": [
        "Paris is the capital of France.",
        "Photosynthesis converts light energy into chemical energy using CO2 and water, producing glucose and oxygen.",
        "Climate change is mainly caused by burning fossil fuels and deforestation, which release greenhouse gases."
    ]
}

# Convert to Ragas-compatible dataset format
dataset = Dataset.from_dict(eval_data)

Data structure explained: Ragas expects contexts as a list of lists (each question can have multiple retrieved context chunks), while other fields are simple lists. This mirrors what your RAG system produces.

Step 5: Configure Ragas with Ollama

Create the main evaluation script. This connects Ragas to your local Ollama models instead of cloud APIs.

Create evaluate_rag.py:

# evaluate_rag.py
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from langchain_ollama import ChatOllama, OllamaEmbeddings
from sample_data import dataset

# Configure the Ollama LLM for evaluation
llm = ChatOllama(
    model="mistral:7b-instruct",
    temperature=0,  # Deterministic outputs for consistent evaluation
    num_ctx=4096,   # Context window size
)

# Configure the embedding model
embeddings = OllamaEmbeddings(
    model="nomic-embed-text"
)

# Run evaluation with local models
# The llm and embeddings parameters override Ragas' default OpenAI models
result = evaluate(
    dataset,
    metrics=[
        faithfulness,        # Is answer grounded in context?
        answer_relevancy,    # Does answer address the question?
        context_precision,   # Are retrieved contexts relevant?
        context_recall,      # Do contexts contain needed info?
    ],
    llm=llm,
    embeddings=embeddings,
)

# Display results
print("\n=== Evaluation Results ===")
print(result)

# Save detailed per-question results
result_df = result.to_pandas()
result_df.to_csv("evaluation_results.csv", index=False)
print("\nDetailed results saved to evaluation_results.csv")

Key configuration choices:

  • temperature=0 makes evaluation deterministic—you want consistent scores across runs
  • num_ctx=4096 provides enough context window for Ragas' evaluation prompts
  • We pass our local models directly to evaluate() instead of using default OpenAI models

Step 6: Run the Evaluation

Verify Ollama is running (it should auto-start after installation), then execute the evaluation script:

# Check Ollama is running
ollama list

# Run the evaluation
python evaluate_rag.py

Expected output:

Evaluating: 100%|████████████████████| 12/12 [00:45<00:00,  3.78s/it]

=== Evaluation Results ===
{'faithfulness': 0.8333, 'answer_relevancy': 0.9123, 'context_precision': 0.8889, 'context_recall': 0.9167}

Detailed results saved to evaluation_results.csv

What just happened: Ragas evaluated 12 items (4 metrics × 3 questions). Each metric uses the LLM to judge quality through carefully designed prompts. Execution time varies by hardware—expect 30-60 seconds on modern machines.

Step 7: Understand Your Metrics

Open evaluation_results.csv to see per-question scores. Here's what each metric measures:

  • Faithfulness (0-1): Is the answer grounded in the retrieved context? Scores near 1 mean no hallucinations.
  • Answer Relevancy (0-1): Does the answer actually address the question asked?
  • Context Precision (0-1): Are the retrieved contexts relevant to answering the question?
  • Context Recall (0-1): Do the contexts contain all information needed to answer?

Create a visualization script to better understand the results. Create visualize.py:

# visualize.py
import pandas as pd
import matplotlib.pyplot as plt

# Load evaluation results
df = pd.read_csv("evaluation_results.csv")

# Calculate average scores across all questions
metrics = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
scores = df[metrics].mean()

# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(metrics, scores, color='steelblue')
plt.ylim(0, 1)
plt.ylabel('Score')
plt.title('RAG System Evaluation Metrics')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('metrics.png', dpi=150)
print("Visualization saved to metrics.png")

Install matplotlib if needed, then run the visualization:

pip install matplotlib==3.7.1
python visualize.py

Open metrics.png to see your results visualized. This makes it easier to spot which aspects of your RAG system need improvement.

Verification

Confirm everything worked correctly with these checks:

  1. Check the CSV exists and has content:
    ls -lh evaluation_results.csv
    head evaluation_results.csv
    
    You should see a file with multiple columns including your metrics and scores.
  2. Verify scores are reasonable: Open evaluation_results.csv—all metric scores should be between 0 and 1. If you see NaN, null, or values outside this range, something failed.
  3. Test with intentionally bad data: Modify sample_data.py to add a clearly wrong answer:
    # Add this to the end of the "answer" list in sample_data.py
    "The capital of France is London."
    # Add corresponding question, contexts, and ground_truth entries
    
    Run python evaluate_rag.py again—faithfulness should drop significantly for that question.
  4. Test consistency: Run python evaluate_rag.py three times with the original data. Scores should be identical or vary by less than 0.02 due to temperature=0.

Success looks like: You can run the evaluation repeatedly and get consistent scores, the CSV contains per-question breakdowns, and intentionally bad answers score lower than good ones.

Common Issues & Fixes

Issue 1: "Connection refused" or "Ollama not found"

Error message:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=11434): 
Max retries exceeded with url: /api/generate

Cause: Ollama service isn't running.

Fix: Start Ollama manually:

# macOS/Linux
ollama serve

# Or check if it's already running
ps aux | grep ollama

# Windows - Ollama should run as a system service
# Check system tray for Ollama icon

On macOS, Ollama typically runs as a background service after installation. If ollama list works, the service is running.

Issue 2: Evaluation hangs or takes extremely long

Symptoms: Progress bar stuck at 0% for 5+ minutes, or system becomes unresponsive.

Cause: Model not fully loaded in memory, or system is swapping to disk.

Fix: Cancel with Ctrl+C and verify models load properly:

# Test model responds quickly
ollama run mistral:7b-instruct "What is 2+2?"
# Should respond within 5-10 seconds

If the model loads slowly or your system is swapping, try a smaller quantized model:

# Pull 4-bit quantized version (smaller, faster)
ollama pull mistral:7b-instruct-q4_0

# Update evaluate_rag.py to use it
# Change: model="mistral:7b-instruct-q4_0"

Issue 3: Low or inconsistent scores on clearly good answers

Symptoms: Faithfulness of 0.3 when answer clearly matches context, or scores vary wildly between runs.

Cause: Model may be misinterpreting Ragas' evaluation prompts.

Fix: Try an alternative model that handles instruction-following better:

# Pull Llama 3 (often more reliable for evaluation)
ollama pull llama3:8b-instruct

# Update evaluate_rag.py
# Change: model="llama3:8b-instruct"

Or re-pull the current model to ensure you have the latest version:

ollama pull mistral:7b-instruct

Issue 4: ImportError or ModuleNotFoundError

Error message:

ModuleNotFoundError: No module named 'ragas'

Cause: Virtual environment not activated, or packages not installed.

Fix: Ensure virtual environment is active and reinstall:

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Verify activation - should show venv path
which python

# Reinstall packages
pip install ragas==0.1.9 langchain-community==0.2.10 langchain-ollama==0.1.0 datasets==2.14.0

Next Steps

Now that you have a working local evaluation pipeline, here's how to extend it:

  • Integrate with your actual RAG system: Replace sample_data.py with outputs from your production system. Export questions, generated answers, retrieved contexts, and ground truth answers in the same format.
  • Add custom metrics: Ragas supports custom evaluators for domain-specific requirements. See the official documentation at https://docs.ragas.io/en/latest/concepts/metrics/custom.html
  • Batch evaluation: Process larger datasets by loading them from CSV or JSON files. Ragas handles batching automatically, but consider chunking very large datasets (1000+ questions) to avoid memory issues.
  • Track metrics over time: Store results in a database (SQLite works well) to monitor how system changes affect metrics. This helps you catch regressions early.
  • Compare different approaches: Evaluate different chunking strategies, embedding models, or retrieval methods by swapping out the contexts while keeping questions constant. This isolates what actually improves performance.
  • Automate with CI/CD: Add evaluation to your testing pipeline to catch quality regressions before deployment.

Local models score differently than GPT-4 would, but they're consistent and free. This makes them ideal for iterative development where you need to run hundreds of evaluations. You can always run a final validation with commercial APIs once you've narrowed down your best approach.

Helpful resources:

If you hit issues not covered here, the Ragas GitHub Issues page is actively maintained, and the Langchain community forums have good coverage of Ollama integration questions.

Build a Multi-Region AI Inference Pipeline with Global Infrastructure

Build a Multi-Region AI Inference Pipeline with Global Infrastructure

What You'll Build

By the end of this tutorial, you'll have a working multi-region inference pipeline that intelligently routes requests between cloud regions. Your FastAPI service will select the optimal region based on real-time latency measurements and cost optimization, with automatic failover when regions become unavailable.

This architecture matters because compute pricing varies across regions. You'll build a system that balances cost differences with latency requirements. The tutorial uses local mock servers for testing before deploying to real cloud infrastructure, so you can validate the routing logic without incurring cloud costs during development.

Prerequisites

  • Python 3.10+ installed locally (python --version to check)
  • pip package manager
  • curl for testing endpoints
  • jq for parsing JSON responses: brew install jq (macOS) or apt-get install jq (Ubuntu)
  • AWS Account (optional for Step 8 - real deployment only)
  • AWS CLI v2 (optional for Step 8): aws --version
  • Terraform 1.6+ (optional for Step 8): terraform --version
  • Basic understanding of REST APIs and async Python
  • Estimated time: 60-90 minutes (Steps 1-7), additional 30-60 minutes for Step 8 if deploying to cloud

Install core Python dependencies:

pip install fastapi==0.104.1 uvicorn==0.24.0 httpx==0.25.0 pydantic==2.5.0

Expected output:

Successfully installed fastapi-0.104.1 uvicorn-0.24.0 httpx-0.25.0 pydantic-2.5.0

Step-by-Step Instructions

Step 1: Set Up the Project Structure

Create the project directory and file structure:

mkdir multi-region-inference
cd multi-region-inference
mkdir -p src terraform
touch src/__init__.py src/main.py src/region_router.py src/region_config.py

Verify the structure:

tree -L 2

Expected output:

.
├── src
│   ├── __init__.py
│   ├── main.py
│   ├── region_config.py
│   └── region_router.py
└── terraform

What just happened: You've created a standard Python project layout. The src directory contains your application code, and terraform will hold infrastructure-as-code files if you deploy to real cloud regions in Step 8.

Step 2: Create the Region Configuration

Create src/region_config.py:

import os
from typing import Dict, List
from pydantic import BaseModel


class RegionEndpoint(BaseModel):
    """Configuration for a single region endpoint"""
    name: str
    endpoint: str
    cost_per_1k_tokens: float  # Cost in USD
    priority: int  # Lower number = higher priority
    provider: str  # "aws" or "azure"


# Region configurations with representative cost values
REGIONS: Dict[str, RegionEndpoint] = {
    "us-east-1": RegionEndpoint(
        name="us-east-1",
        endpoint=os.getenv("AWS_US_EAST_ENDPOINT", "http://localhost:8001"),
        cost_per_1k_tokens=0.50,
        priority=3,
        provider="aws"
    ),
    "me-south-1": RegionEndpoint(
        name="me-south-1",  # AWS Bahrain - Middle East
        endpoint=os.getenv("AWS_ME_SOUTH_ENDPOINT", "http://localhost:8002"),
        cost_per_1k_tokens=0.35,
        priority=1,
        provider="aws"
    ),
    "ap-south-1": RegionEndpoint(
        name="ap-south-1",  # AWS Mumbai - South Asia
        endpoint=os.getenv("AWS_AP_SOUTH_ENDPOINT", "http://localhost:8003"),
        cost_per_1k_tokens=0.32,
        priority=1,
        provider="aws"
    ),
    "brazilsouth": RegionEndpoint(
        name="brazilsouth",  # Azure Brazil
        endpoint=os.getenv("AZURE_BRAZIL_ENDPOINT", "http://localhost:8004"),
        cost_per_1k_tokens=0.38,
        priority=2,
        provider="azure"
    ),
}


def get_sorted_regions() -> List[RegionEndpoint]:
    """Returns regions sorted by priority, then by cost"""
    return sorted(
        REGIONS.values(),
        key=lambda x: (x.priority, x.cost_per_1k_tokens)
    )

What just happened: You've defined configurations for four regions with different cost profiles. The priority system favors lower-cost regions (me-south-1 and ap-south-1) while maintaining fallback options. Endpoints default to localhost for local testing but can be overridden with environment variables for production deployment.

Step 3: Build the Smart Region Router

Create src/region_router.py:

import asyncio
import time
from typing import Optional, Dict, Any
import httpx
from src.region_config import get_sorted_regions, RegionEndpoint


class RegionRouter:
    """Routes inference requests to optimal regions based on latency and cost"""
    
    def __init__(self, timeout: float = 5.0):
        self.timeout = timeout
        self.latency_cache: Dict[str, float] = {}
        self.failure_count: Dict[str, int] = {}
        
    async def measure_latency(self, region: RegionEndpoint) -> Optional[float]:
        """
        Ping region endpoint to measure actual latency.
        Returns latency in seconds, or None if region is unreachable.
        """
        try:
            start = time.time()
            async with httpx.AsyncClient(timeout=self.timeout) as client:
                response = await client.get(f"{region.endpoint}/health")
                if response.status_code == 200:
                    latency = time.time() - start
                    self.latency_cache[region.name] = latency
                    return latency
        except Exception as e:
            print(f"Failed to reach {region.name}: {e}")
            self.failure_count[region.name] = self.failure_count.get(region.name, 0) + 1
            return None
    
    async def select_best_region(self) -> Optional[RegionEndpoint]:
        """
        Select optimal region based on:
        1. Priority (cost tier)
        2. Measured latency
        3. Failure history (exclude regions with 3+ consecutive failures)
        """
        regions = get_sorted_regions()
        
        # Measure latency for all regions concurrently
        latency_tasks = [self.measure_latency(r) for r in regions]
        await asyncio.gather(*latency_tasks)
        
        # Filter out unhealthy regions
        available_regions = [
            r for r in regions 
            if self.failure_count.get(r.name, 0) < 3
            and self.latency_cache.get(r.name) is not None
        ]
        
        if not available_regions:
            print("WARNING: No healthy regions available!")
            return None
        
        # Calculate weighted score: priority + normalized latency
        def score_region(region: RegionEndpoint) -> float:
            latency = self.latency_cache.get(region.name, 999)
            # Lower score is better
            return region.priority + (latency * 10)
        
        best_region = min(available_regions, key=score_region)
        latency_ms = self.latency_cache.get(best_region.name, 0) * 1000
        print(f"Selected region: {best_region.name} "
              f"(latency: {latency_ms:.0f}ms, cost: ${best_region.cost_per_1k_tokens}/1k tokens)")
        return best_region
    
    async def route_inference(self, payload: Dict[str, Any]) -> Dict[str, Any]:
        """
        Route inference request to best available region.
        Returns response with metadata about selected region.
        """
        best_region = await self.select_best_region()
        
        if not best_region:
            raise Exception("No healthy regions available for inference")
        
        try:
            async with httpx.AsyncClient(timeout=30.0) as client:
                response = await client.post(
                    f"{best_region.endpoint}/v1/inference",
                    json=payload
                )
                response.raise_for_status()
                result = response.json()
                
                # Add routing metadata to response
                result["_metadata"] = {
                    "region": best_region.name,
                    "provider": best_region.provider,
                    "cost_per_1k": best_region.cost_per_1k_tokens
                }
                return result
        except Exception as e:
            print(f"Inference failed on {best_region.name}: {e}")
            self.failure_count[best_region.name] = self.failure_count.get(best_region.name, 0) + 1
            raise

What just happened: This is the core routing engine. It measures real latency to each region using concurrent health checks, maintains a failure count to avoid repeatedly trying dead regions, and selects the optimal region using a weighted scoring system that balances cost priority with actual latency. The route_inference method handles the actual request forwarding and adds metadata about which region processed the request.

Step 4: Create the FastAPI Application

Create src/main.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Dict, Any
from src.region_router import RegionRouter

app = FastAPI(
    title="Multi-Region Inference API",
    description="Intelligent routing for AI inference across global regions"
)
router = RegionRouter()


class InferenceRequest(BaseModel):
    """Request schema for inference endpoint"""
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7


class InferenceResponse(BaseModel):
    """Response schema with result and routing metadata"""
    result: str
    metadata: Dict[str, Any]


@app.get("/health")
async def health_check():
    """Health check endpoint for load balancers"""
    return {"status": "healthy", "service": "multi-region-router"}


@app.post("/v1/inference", response_model=InferenceResponse)
async def inference(request: InferenceRequest):
    """
    Main inference endpoint - automatically routes to optimal region.
    
    The router selects the best region based on cost and latency,
    then forwards the request and returns the result with metadata.
    """
    try:
        payload = request.dict()
        result = await router.route_inference(payload)
        return InferenceResponse(
            result=result.get("output", ""),
            metadata=result.get("_metadata", {})
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/regions/status")
async def region_status():
    """
    Debug endpoint showing current region health metrics.
    Useful for monitoring and troubleshooting routing decisions.
    """
    return {
        "latency_cache": router.latency_cache,
        "failure_count": router.failure_count
    }


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

What just happened: You've created a FastAPI application that wraps the region router. The /v1/inference endpoint accepts inference requests and automatically routes them to the optimal region. The /regions/status endpoint exposes internal routing metrics for debugging and monitoring.

Step 5: Create Mock Regional Endpoints for Testing

Create mock_regional_server.py in the project root directory:

"""
Mock regional inference server for local testing.
Simulates a real inference endpoint with configurable latency.
"""
import sys
import time
import random
from fastapi import FastAPI
import uvicorn


def create_mock_server(region_name: str, latency_ms: int):
    """Create a mock server that simulates a regional endpoint"""
    app = FastAPI()
    
    @app.get("/health")
    async def health():
        # Simulate network latency
        time.sleep(latency_ms / 1000.0)
        return {"status": "healthy", "region": region_name}
    
    @app.post("/v1/inference")
    async def inference(payload: dict):
        # Simulate inference processing time
        time.sleep(random.uniform(0.5, 1.5))
        return {
            "output": f"Mock response from {region_name}: {payload.get('prompt', '')[:50]}...",
            "tokens_used": payload.get('max_tokens', 100)
        }
    
    return app


if __name__ == "__main__":
    # Parse command line arguments
    region = sys.argv[1] if len(sys.argv) > 1 else "us-east-1"
    port = int(sys.argv[2]) if len(sys.argv) > 2 else 8001
    latency = int(sys.argv[3]) if len(sys.argv) > 3 else 50
    
    app = create_mock_server(region, latency)
    print(f"Starting mock server for {region} on port {port} (simulated latency: {latency}ms)")
    uvicorn.run(app, host="0.0.0.0", port=port, log_level="warning")

What just happened: This creates mock regional inference endpoints with configurable latency characteristics. Each mock server simulates a real regional deployment, allowing you to test the routing logic locally without deploying to actual cloud infrastructure. The latency parameter lets you simulate geographic distance.

Step 6: Test the Multi-Region Router Locally

You'll need 5 terminal windows for this step. Open them all and navigate to your project directory in each.

Sub-step 6a: Start the mock regional servers

In terminal 1 (US East - higher latency):

python mock_regional_server.py us-east-1 8001 100

In terminal 2 (Middle East - lower latency):

python mock_regional_server.py me-south-1 8002 60

In terminal 3 (South Asia - lowest latency):

python mock_regional_server.py ap-south-1 8003 50

In terminal 4 (Brazil - medium latency):

python mock_regional_server.py brazilsouth 8004 80

Expected output in each terminal:

Starting mock server for [region] on port [port] (simulated latency: [X]ms)
INFO:     Started server process [PID]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:[port]

Sub-step 6b: Start the main router service

In terminal 5:

python src/main.py

Expected output:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

What just happened: You now have a complete simulated multi-region environment running locally. Four mock servers represent different geographic regions with realistic latency profiles, and the main router service is ready to intelligently distribute requests among them.

Step 7: Send Test Requests and Verify Routing

Sub-step 7a: Test a single inference request

Open a new terminal (terminal 6) and send a test request:

curl -X POST http://localhost:8000/v1/inference \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 150,
    "temperature": 0.7
  }'

Expected output:

{
  "result": "Mock response from ap-south-1: Explain quantum computing in simple terms...",
  "metadata": {
    "region": "ap-south-1",
    "provider": "aws",
    "cost_per_1k": 0.32
  }
}

In terminal 5 (where the router is running), you should see:

Selected region: ap-south-1 (latency: 52ms, cost: $0.32/1k tokens)

Sub-step 7b: Check region health status

curl -s http://localhost:8000/regions/status | jq

Expected output:

{
  "latency_cache": {
    "us-east-1": 0.102,
    "me-south-1": 0.062,
    "ap-south-1": 0.051,
    "brazilsouth": 0.081
  },
  "failure_count": {}
}

Sub-step 7c: Test failover behavior

Stop the ap-south-1 server (press Ctrl+C in terminal 3), then send another request:

curl -X POST http://localhost:8000/v1/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Test failover", "max_tokens": 50}'

Expected output: The response should now come from me-south-1 (the next best region):

{
  "result": "Mock response from me-south-1: Test failover...",
  "metadata": {
    "region": "me-south-1",
    "provider": "aws",
    "cost_per_1k": 0.35
  }
}

Restart the ap-south-1 server in terminal 3 for the next steps.

What just happened: You've verified that the router correctly selects the lowest-cost, lowest-latency region (ap-south-1) under normal conditions, and automatically fails over to the next best region when the primary becomes unavailable. This demonstrates the core value proposition: cost optimization with reliability.

Step 8: Deploy to Real Cloud Regions (Optional)

Note: This step requires an AWS account and will incur cloud infrastructure costs. Skip this step if you want to stay with local testing only.

Sub-step 8a: Install additional prerequisites

pip install boto3==1.34.0

Verify AWS CLI is configured:

aws sts get-caller-identity

Expected output:

{
    "UserId": "AIDAXXXXXXXXXXXXXXXXX",
    "Account": "123456789012",
    "Arn": "arn:aws:iam::123456789012:user/your-username"
}

Sub-step 8b: Enable required AWS regions

Some regions like me-south-1 require manual opt-in. Enable them via AWS Console or CLI:

# Check which regions are enabled
aws account list-regions --region-opt-status-contains ENABLED ENABLED_BY_DEFAULT

# Enable Middle East region (if not already enabled)
aws account enable-region --region-name me-south-1

Wait 5-10 minutes for the region to become fully available.

Sub-step 8c: Create Terraform configuration

Create terraform/main.tf:

terraform {
  required_version = ">= 1.6"
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Define provider for each region
provider "aws" {
  alias  = "us_east"
  region = "us-east-1"
}

provider "aws" {
  alias  = "me_south"
  region = "me-south-1"
}

provider "aws" {
  alias  = "ap_south"
  region = "ap-south-1"
}

# Example: Create ECS cluster in each region
# This is a simplified example - production requires security groups,
# load balancers, auto-scaling, and monitoring

resource "aws_ecs_cluster" "inference_us_east" {
  provider = aws.us_east
  name     = "inference-cluster-us-east-1"
}

resource "aws_ecs_cluster" "inference_me_south" {
  provider = aws.me_south
  name     = "inference-cluster-me-south-1"
}

resource "

Wednesday, February 18, 2026

Hands-On: Accessing DeepMind's AlphaFold 3 API Through India's National AI Partnership

Hands-On: Accessing DeepMind's AlphaFold 3 API Through India's National AI Partnership

What You'll Build

By the end of this tutorial, you'll have a working Python script that queries DeepMind's AlphaFold 3 protein structure prediction API through the newly announced National Partnerships for AI initiative. You'll authenticate using the academic access program, submit a protein sequence, and retrieve a 3D structure prediction in PDB format that you can visualize locally.

Important clarification: As of the latest public information, DeepMind has not announced a direct AlphaFold 3 API accessible through India's National AI Partnership. This tutorial demonstrates the conceptual workflow for how such an integration would work, based on existing Google Cloud patterns and the AlphaFold Server academic access program. The actual implementation may differ when official API access becomes available.

This matters because partnerships like these could give researchers and educators access to frontier AI models. If you're working in computational biology, drug discovery, or AI-for-science education, understanding this workflow prepares you for when such access becomes available. The authentication patterns and API interaction methods shown here follow standard Google Cloud practices used across their AI services.

Prerequisites

  • Python 3.10+ installed (check with python --version or python3 --version)
  • pip package manager (usually included with Python)
  • Academic or institutional email from an Indian university/research institution (for partnership programs when available)
  • Google Cloud account - https://cloud.google.com/free (free tier includes $300 credit)
  • pip packages: requests, biopython, google-auth, google-auth-oauthlib, google-auth-httplib2
  • PyMOL or Mol* for visualization (optional but recommended for viewing structures)
  • Basic command line familiarity (navigating directories, running scripts)
  • Estimated time: 45-60 minutes including account setup

Step-by-Step Instructions

Step 1: Register for AlphaFold Server Academic Access

Currently, DeepMind provides academic access through the AlphaFold Server. Navigate to the official access portal and complete the registration form.

Actions:

  1. Visit the AlphaFold Server: https://alphafoldserver.com/
  2. Click "Sign in" or "Request Access"
  3. Use your institutional email address (.ac.in or .edu.in domain for Indian institutions)
  4. Complete the academic verification form with your research details
  5. Wait for approval email (typically 24-72 hours)

What just happened: You've requested access to the academic tier of AlphaFold services. Academic programs often provide free or subsidized access to computational resources that would otherwise require significant infrastructure investment.

Note: Keep your approval email—it may contain specific API endpoints or access tokens needed for programmatic access.

Step 2: Set Up Google Cloud Project and Authentication

To interact with Google Cloud services programmatically, you need a project and service account credentials.

2a. Install Google Cloud SDK:

# For Linux/macOS:
curl https://sdk.cloud.google.com | bash

# Restart your shell
exec -l $SHELL

# For Windows: Download installer from
# https://cloud.google.com/sdk/docs/install

2b. Initialize and authenticate:

# Initialize gcloud (follow prompts to select/create project)
gcloud init

# Authenticate with your Google account
gcloud auth login

# Set up application default credentials
gcloud auth application-default login

Expected output:

You are now logged in as [your-email@domain.com].
Your current project is [your-project-id].

2c. Create a service account:

# Capture your project ID
export PROJECT_ID=$(gcloud config get-value project)

# Create service account
gcloud iam service-accounts create alphafold-access \
    --display-name="AlphaFold API Access" \
    --project=$PROJECT_ID

# Generate and download key file
gcloud iam service-accounts keys create ~/alphafold-key.json \
    --iam-account=alphafold-access@${PROJECT_ID}.iam.gserviceaccount.com

Expected output:

created key [a1b2c3d4e5f6] of type [json] as [/home/username/alphafold-key.json]

What just happened: You created a service account—a special type of Google account intended for applications rather than humans. The JSON key file contains credentials your Python script will use to authenticate. Store this file securely and never commit it to version control.

Step 3: Set Up Python Environment and Install Dependencies

Create an isolated Python environment to prevent package conflicts.

# Create virtual environment
python3 -m venv alphafold-env

# Activate it
# On Linux/macOS:
source alphafold-env/bin/activate

# On Windows:
# alphafold-env\Scripts\activate

# Verify activation (you should see (alphafold-env) in your prompt)
which python

# Install required packages
pip install --upgrade pip
pip install requests==2.31.0 biopython==1.83 google-auth==2.27.0 google-auth-oauthlib==1.2.0 google-auth-httplib2==0.2.0

Expected output:

Successfully installed requests-2.31.0 biopython-1.83 google-auth-2.27.0 google-auth-oauthlib-1.2.0 google-auth-httplib2-0.2.0

What just happened: You've installed the HTTP client (requests), biological sequence handling library (biopython), and Google authentication libraries needed to securely communicate with Google Cloud APIs.

Step 4: Create the Protein Structure Prediction Script

Build the core script that handles authentication, API communication, and result processing. Create a new file called predict_structure.py.

#!/usr/bin/env python3
"""
AlphaFold API Interaction Script
Demonstrates protein structure prediction workflow
"""

import os
import sys
import requests
import json
from google.auth.transport.requests import Request
from google.oauth2 import service_account

# ============================================================================
# CONFIGURATION
# ============================================================================

# Path to your service account key file
SERVICE_ACCOUNT_FILE = os.path.expanduser('~/alphafold-key.json')

# API endpoint (NOTE: This is a placeholder - actual endpoint will be provided
# when API access is granted through official channels)
API_ENDPOINT = 'https://alphafoldserver.com/api/v1/predict'

# ============================================================================
# AUTHENTICATION
# ============================================================================

def authenticate():
    """
    Authenticate using service account credentials.
    Returns an authenticated credentials object.
    """
    if not os.path.exists(SERVICE_ACCOUNT_FILE):
        print(f"ERROR: Service account key not found at {SERVICE_ACCOUNT_FILE}")
        print("Please ensure you've completed Step 2 and the file exists.")
        sys.exit(1)
    
    try:
        credentials = service_account.Credentials.from_service_account_file(
            SERVICE_ACCOUNT_FILE,
            scopes=['https://www.googleapis.com/auth/cloud-platform']
        )
        # Refresh to get valid token
        credentials.refresh(Request())
        return credentials
    except Exception as e:
        print(f"ERROR: Authentication failed: {e}")
        sys.exit(1)

# ============================================================================
# PREDICTION FUNCTION
# ============================================================================

def predict_structure(sequence, sequence_id="protein_1"):
    """
    Submit a protein sequence for structure prediction.
    
    Args:
        sequence: String of amino acid letters (standard 20 amino acids)
        sequence_id: Identifier for this sequence
    
    Returns:
        Dictionary containing prediction results
    """
    # Validate sequence
    valid_amino_acids = set("ACDEFGHIKLMNPQRSTVWY")
    if not all(aa in valid_amino_acids for aa in sequence.upper()):
        invalid = set(sequence.upper()) - valid_amino_acids
        raise ValueError(f"Invalid amino acids found: {invalid}")
    
    # Authenticate
    print("Authenticating...")
    credentials = authenticate()
    
    # Prepare request payload
    payload = {
        "sequence": sequence.upper(),
        "id": sequence_id
    }
    
    # Set up headers with authentication token
    headers = {
        'Authorization': f'Bearer {credentials.token}',
        'Content-Type': 'application/json'
    }
    
    # Submit prediction request
    print(f"\nSubmitting sequence: {sequence}")
    print(f"Length: {len(sequence)} amino acids")
    print(f"Sequence ID: {sequence_id}")
    print("Calling API (this may take 1-5 minutes)...")
    
    try:
        response = requests.post(
            API_ENDPOINT,
            headers=headers,
            json=payload,
            timeout=600  # 10 minute timeout
        )
        
        # Check response status
        if response.status_code == 200:
            return response.json()
        else:
            print(f"\nERROR {response.status_code}: {response.text}")
            return None
            
    except requests.exceptions.Timeout:
        print("\nERROR: Request timed out. Try a shorter sequence or increase timeout.")
        return None
    except requests.exceptions.RequestException as e:
        print(f"\nERROR: Request failed: {e}")
        return None

# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    # Test sequence: Human insulin A-chain (21 amino acids)
    # This is a well-studied small protein, ideal for testing
    test_sequence = "GIVEQCCTSICSLYQLENYCN"
    
    print("=" * 70)
    print("AlphaFold Structure Prediction")
    print("=" * 70)
    
    # Run prediction
    result = predict_structure(test_sequence, sequence_id="insulin_a_chain")
    
    if result:
        # Extract and save PDB structure
        pdb_data = result.get('pdb_content', '')
        confidence = result.get('confidence_score', 'N/A')
        
        if pdb_data:
            output_file = 'insulin_a_chain_predicted.pdb'
            with open(output_file, 'w') as f:
                f.write(pdb_data)
            
            print("\n" + "=" * 70)
            print("✓ SUCCESS!")
            print("=" * 70)
            print(f"Structure saved to: {output_file}")
            print(f"Confidence score: {confidence}")
            print(f"File size: {len(pdb_data)} bytes")
            print("\nNext: Visualize with PyMOL or upload to https://molstar.org/viewer/")
        else:
            print("\nWARNING: No PDB structure data in response")
            print(f"Response keys: {result.keys()}")
    else:
        print("\n✗ Prediction failed. Check error messages above.")
        sys.exit(1)

What this code does:

  • Authentication: Loads your service account credentials and obtains an access token
  • Validation: Checks that your sequence contains only valid amino acid codes
  • API Communication: Sends a POST request with your sequence to the prediction endpoint
  • Result Handling: Saves the returned PDB structure file to disk
  • Error Handling: Provides clear error messages for common failure modes

The insulin A-chain sequence is used as a test case because it's small (fast prediction), well-characterized (you can verify results), and contains interesting structural features (alpha helices and disulfide bonds).

Step 5: Run Your First Prediction

With your virtual environment still activated and the script created, execute the prediction:

# Ensure you're in the directory containing predict_structure.py
# and your virtual environment is activated

python predict_structure.py

Expected output (if API were available):

======================================================================
AlphaFold Structure Prediction
======================================================================
Authenticating...

Submitting sequence: GIVEQCCTSICSLYQLENYCN
Length: 21 amino acids
Sequence ID: insulin_a_chain
Calling API (this may take 1-5 minutes)...

======================================================================
✓ SUCCESS!
======================================================================
Structure saved to: insulin_a_chain_predicted.pdb
Confidence score: N/A
File size: 8432 bytes

Next: Visualize with PyMOL or upload to https://molstar.org/viewer/

What just happened: Your script authenticated with Google Cloud, submitted a protein sequence, and received a predicted 3D structure. The confidence score (often reported as pLDDT - predicted Local Distance Difference Test) indicates prediction reliability: >90 is very high confidence, 70-90 is good, 50-70 is low confidence, <50 is unreliable.

Step 6: Visualize the Predicted Structure

PDB files contain 3D coordinates but need specialized software to visualize.

Option A: PyMOL (Desktop Application)

# If PyMOL is installed, launch with your structure
pymol insulin_a_chain_predicted.pdb

Then in PyMOL's command interface:

hide everything
show cartoon
color spectrum, insulin_a_chain_predicted
bg_color white
orient

Option B: Mol* Web Viewer (No Installation Required)

  1. Navigate to https://molstar.org/viewer/
  2. Click "Open Files" in the top-left
  3. Select your insulin_a_chain_predicted.pdb file
  4. The structure will load automatically with default visualization

What you should see: Insulin A-chain typically shows an alpha-helix structure with two disulfide bonds (cysteine-cysteine connections). The structure should appear as a ribbon or cartoon representation, not a tangled mess of atoms. Look for regular helical turns—this indicates the prediction captured the known secondary structure.

Verification

Confirm your setup is working correctly with these checks:

Check 1: Verify PDB file was created

# Check file exists and has reasonable size
ls -lh insulin_a_chain_predicted.pdb

# Should show something like:
# -rw-r--r-- 1 user user 8.2K Dec 15 10:30 insulin_a_chain_predicted.pdb

Check 2: Inspect PDB file format

# View first 20 lines of the PDB file
head -20 insulin_a_chain_predicted.pdb

Expected output: You should see lines starting with ATOM containing coordinate data:

ATOM      1  N   GLY A   1      10.123  12.456   8.789  1.00 85.23           N
ATOM      2  CA  GLY A   1      11.234  13.567   9.890  1.00 87.45           C
ATOM      3  C   GLY A   1      12.345  14.678  10.901  1.00 88.12           C
...

Check 3: Validate structure in viewer

Load the file in either PyMOL or Mol* and verify:

  • Structure loads without errors
  • You can see clear secondary structure elements (helices, not random coils)
  • The structure appears compact, not stretched across the entire viewing area
  • For insulin A-chain: expect to see helical regions in the N-terminal and C-terminal portions

Success criteria:

  • ✓ PDB file exists and is 5-15 KB in size
  • ✓ File contains properly formatted ATOM records
  • ✓ Structure displays recognizable secondary structure in visualization software
  • ✓ Confidence score (if reported) is above 70

Common Issues & Fixes

Issue 1: "Service account key not found" Error

Error message:

ERROR: Service account key not found at /home/user/alphafold-key.json

Cause: The script cannot locate your service account JSON key file.

Fix:

# Verify the file exists
ls -l ~/alphafold-key.json

# If missing, regenerate it (Step 2c)
gcloud iam service-accounts keys create ~/alphafold-key.json \
    --iam-account=alphafold-access@$(gcloud config get-value project).iam.gserviceaccount.com

# Or update the SERVICE_ACCOUNT_FILE path in the script to match actual location

Issue 2: "Invalid amino acids found" Error

Error message:

ValueError: Invalid amino acids found: {'X', 'B', 'Z'}

Cause: Your sequence contains non-standard amino acid codes. Only the 20 standard amino acids are accepted: ACDEFGHIKLMNPQRSTVWY.

Fix:

# Add this validation before calling predict_structure()
sequence = "YOUR_SEQUENCE_HERE"

# Remove any whitespace or newlines
sequence = sequence.replace(" ", "").replace("\n", "").replace("\r", "")

# Check for invalid characters
valid_aa = set("ACDEFGHIKLMNPQRSTVWY")
invalid = set(sequence.upper()) - valid_aa

if invalid:
    print(f"Found invalid characters: {invalid}")
    print("Valid amino acids: A C D E F G H I K L M N P Q R S T V W Y")
    # Remove invalid characters (or fix your source sequence)
    sequence = ''.join(aa for aa in sequence.upper() if aa in valid_aa)

Issue 3: Authentication Failures

Error message:

ERROR: Authentication failed: Could not automatically determine credentials

Cause: Google Cloud SDK isn't properly authenticated or the service account lacks necessary permissions.

Fix:

# Re-authenticate your gcloud session
gcloud auth application-default login

# Verify your project is set
gcloud config get-value project

# Check service account exists
gcloud iam service-accounts list

# If the service account is missing, recreate it (Step 2c)

Issue 4: Connection Timeout for Large Proteins

Error message:

ERROR: Request timed out. Try a shorter sequence or increase timeout.

Cause: Large proteins (>200 amino acids) can take 10+ minutes to predict. The default timeout is too short.

Fix:

# In predict_structure.py, increase the timeout parameter:

response = requests.post(
    API_ENDPOINT,
    headers=headers,
    json=payload,
    timeout=1800  # Increase to 30 minutes for large proteins
)

Alternatively, test with shorter sequences first (50-100 amino acids) to verify your setup works before attempting large predictions.

Issue 5: Module Import Errors

Error message:

ModuleNotFoundError: No module named 'google.auth'

Cause: Required packages aren't installed, or you're not using the virtual environment.

Fix:

# Ensure virtual environment is activated
# You should see (alphafold-env) in your prompt
source alphafold-env/bin/activate  # Linux/macOS
# or
alphafold-env\Scripts\activate  # Windows

# Reinstall packages
pip install --upgrade requests biopython google-auth google-auth-oauthlib google-auth-httplib2

# Verify installation
pip list | grep google-auth

Next Steps

Now that you understand the workflow for API-based protein structure prediction, here are ways to extend your capabilities:

Immediate Next Steps:

  • Process multiple sequences: Modify the script to read from a FASTA file and batch process proteins
  • Add error logging: Implement proper logging with the logging module to track prediction jobs
  • Parse confidence scores: Extract per-residue pLDDT scores from the PDB file's B-factor column to identify low-confidence regions
  • Automate

Building a Privacy-Preserving AI Chat Interface: A Practical Guide to Local Inference with Encryption

Building a Privacy-Preserving AI Chat Interface: A Practical Guide to Local Inference with Encryption

With growing concerns around data privacy and AI systems, building applications that don't leak user data has become essential. This tutorial shows you how to create a chat interface that keeps conversations private through local inference and encryption.

What You'll Build

You'll create a fully functional privacy-preserving chat interface that processes AI requests locally with end-to-end encryption for stored data. This is a practical implementation of privacy-first AI architecture—a chat application where conversations never leave your control.

The final system includes:

  • A web-based chat interface with real-time responses
  • Local language model inference (zero external API calls)
  • Fernet symmetric encryption for conversation storage
  • REST API endpoints for chat and history retrieval
  • Visual confirmation of privacy features

This hands-on project teaches fundamental patterns for building AI applications that respect user privacy. You'll understand exactly where data flows, how to prevent leakage, and how to implement encryption correctly. The result is a template you can extend for production privacy-first AI applications.

Prerequisites

  • Python 3.10 or higher installed and accessible from command line
  • pip package manager (included with Python)
  • 8GB RAM minimum (16GB recommended for smoother performance)
  • 10GB free disk space for model weights and dependencies
  • CUDA-capable GPU optional (tutorial includes CPU fallback)
  • Basic command line familiarity
  • Basic understanding of REST APIs and HTTP requests
  • Text editor or IDE for writing Python code
  • Web browser for testing the interface
  • Estimated time: 60-90 minutes including model download

Step-by-Step Instructions

Step 1: Set Up Your Python Environment

Create a dedicated directory and virtual environment to isolate dependencies:

mkdir privacy-ai-chat
cd privacy-ai-chat
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Example output:

(venv) user@machine:~/privacy-ai-chat$

What this does: Creates an isolated Python environment where all packages install locally. This prevents conflicts with system Python packages and makes the project portable.

Step 2: Install Core Dependencies

Install required packages. Choose the PyTorch installation based on your hardware:

For CPU-only systems:

# Core packages
pip install flask==3.0.0 flask-cors==4.0.0 cryptography==41.0.7

# PyTorch CPU version
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Transformers and utilities
pip install transformers==4.36.2 accelerate==0.25.0

For CUDA GPU systems (check CUDA version with nvidia-smi):

# Core packages
pip install flask==3.0.0 flask-cors==4.0.0 cryptography==41.0.7

# PyTorch with CUDA 11.8 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Transformers and utilities
pip install transformers==4.36.2 accelerate==0.25.0

What this does: Installs Flask for the web server, cryptography for encryption, PyTorch for model inference, and Hugging Face Transformers for loading language models. The PyTorch download is approximately 2GB.

Step 3: Create the Encryption Module

Create the encryption layer that protects conversation data. Create a file named crypto_utils.py:

from cryptography.fernet import Fernet
import json
import os

class ConversationEncryptor:
    """Handles encryption/decryption of conversation data using Fernet symmetric encryption"""
    
    def __init__(self, key_file='secret.key'):
        self.key_file = key_file
        self.key = self._load_or_generate_key()
        self.cipher = Fernet(self.key)
    
    def _load_or_generate_key(self):
        """Load existing encryption key or generate new one"""
        if os.path.exists(self.key_file):
            with open(self.key_file, 'rb') as f:
                return f.read()
        else:
            # Generate new key and save it
            key = Fernet.generate_key()
            with open(self.key_file, 'wb') as f:
                f.write(key)
            print(f"Generated new encryption key: {self.key_file}")
            return key
    
    def encrypt_conversation(self, conversation_data):
        """Encrypt conversation dict to bytes"""
        json_str = json.dumps(conversation_data)
        return self.cipher.encrypt(json_str.encode())
    
    def decrypt_conversation(self, encrypted_data):
        """Decrypt bytes back to conversation dict"""
        decrypted_bytes = self.cipher.decrypt(encrypted_data)
        return json.loads(decrypted_bytes.decode())
    
    def save_encrypted(self, conversation_data, filename):
        """Save encrypted conversation to disk"""
        encrypted = self.encrypt_conversation(conversation_data)
        with open(filename, 'wb') as f:
            f.write(encrypted)
    
    def load_encrypted(self, filename):
        """Load and decrypt conversation from disk"""
        with open(filename, 'rb') as f:
            encrypted = f.read()
        return self.decrypt_conversation(encrypted)

What this does: Implements Fernet symmetric encryption for conversation data. The encryption key generates once and persists in secret.key. Without this key file, encrypted conversations cannot be decrypted—this is the foundation of data sovereignty. The class handles serialization (dict to JSON to bytes) and encryption in one step.

Step 4: Build the Local Inference Engine

Create the AI inference component that runs entirely on your hardware. Create local_inference.py:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import warnings
warnings.filterwarnings('ignore')

class LocalLLM:
    """Local language model inference engine - no external API calls"""
    
    def __init__(self, model_name="microsoft/phi-2"):
        """
        Initialize with microsoft/phi-2 (2.7B parameters)
        Small enough for 8GB RAM but produces quality responses
        """
        print(f"Loading model: {model_name}")
        print("First run downloads ~5GB of weights (cached for future use)...")
        
        # Detect available hardware
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {self.device}")
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            trust_remote_code=True
        )
        
        # Load model with appropriate precision
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            trust_remote_code=True,
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
        ).to(self.device)
        
        print("Model loaded successfully!")
    
    def generate_response(self, prompt, max_length=200):
        """
        Generate response using local model
        All computation happens on your hardware - no data transmitted
        """
        # Tokenize input
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        
        # Generate response (no_grad = inference only, no training)
        with torch.no_grad():
            outputs = self.model.generate(
                inputs.input_ids,
                max_length=max_length,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=self.tokenizer.eos_token_id
            )
        
        # Decode tokens back to text
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Remove the prompt echo from response
        response = response[len(prompt):].strip()
        
        return response

What this does: Wraps Hugging Face Transformers for local inference. The model (microsoft/phi-2) is 2.7 billion parameters—large enough for coherent responses but small enough for consumer hardware. On first run, this downloads model weights to ~/.cache/huggingface/ (approximately 5GB). Subsequent runs load from cache instantly. The torch.no_grad() context manager disables gradient computation since we're only doing inference, reducing memory usage.

Step 5: Create the Flask API Server

Build the REST API that connects the encryption and inference components. Create app.py:

from flask import Flask, request, jsonify
from flask_cors import CORS
from local_inference import LocalLLM
from crypto_utils import ConversationEncryptor
import os
from datetime import datetime

# Initialize Flask with static file serving
app = Flask(__name__, static_folder='static', static_url_path='')
CORS(app)  # Enable CORS for web interface

# Initialize privacy-preserving components
print("Initializing privacy-preserving AI system...")
llm = LocalLLM()
encryptor = ConversationEncryptor()

# In-memory conversation store (persisted encrypted to disk)
conversations = {}

@app.route('/')
def index():
    """Serve the chat interface"""
    return app.send_static_file('index.html')

@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint"""
    return jsonify({
        "status": "healthy",
        "model_device": llm.device,
        "encryption_enabled": True
    })

@app.route('/chat', methods=['POST'])
def chat():
    """
    Main chat endpoint - processes message with local inference and encryption
    Expects JSON: {"message": "user message", "session_id": "unique_id"}
    """
    data = request.json
    user_message = data.get('message', '')
    session_id = data.get('session_id', 'default')
    
    if not user_message:
        return jsonify({"error": "No message provided"}), 400
    
    # Initialize conversation history for new sessions
    if session_id not in conversations:
        conversations[session_id] = []
    
    # Add user message to history
    conversations[session_id].append({
        "role": "user",
        "content": user_message,
        "timestamp": datetime.now().isoformat()
    })
    
    # Generate response locally (no external API call)
    print(f"Generating local response for: {user_message[:50]}...")
    response = llm.generate_response(user_message)
    
    # Add AI response to history
    conversations[session_id].append({
        "role": "assistant",
        "content": response,
        "timestamp": datetime.now().isoformat()
    })
    
    # Save encrypted conversation to disk
    encrypted_file = f"conversations/{session_id}.enc"
    os.makedirs("conversations", exist_ok=True)
    encryptor.save_encrypted(conversations[session_id], encrypted_file)
    
    return jsonify({
        "response": response,
        "session_id": session_id,
        "privacy_note": "Conversation encrypted and stored locally"
    })

@app.route('/history/', methods=['GET'])
def get_history(session_id):
    """Retrieve and decrypt conversation history for a session"""
    encrypted_file = f"conversations/{session_id}.enc"
    
    if not os.path.exists(encrypted_file):
        return jsonify({"error": "Session not found"}), 404
    
    # Load and decrypt
    history = encryptor.load_encrypted(encrypted_file)
    
    return jsonify({
        "session_id": session_id,
        "history": history,
        "total_messages": len(history)
    })

if __name__ == '__main__':
    print("\n" + "="*50)
    print("Privacy-Preserving AI Chat Server")
    print("="*50)
    print("✓ Local inference (no API calls)")
    print("✓ Encrypted storage")
    print("✓ Data sovereignty maintained")
    print("="*50 + "\n")
    
    app.run(host='0.0.0.0', port=5000, debug=True)

What this does: Creates a complete REST API with three endpoints:

  • /health - Returns system status and configuration
  • /chat - Accepts user messages, generates responses locally, encrypts and saves conversations
  • /history/<session_id> - Retrieves and decrypts stored conversations

The key privacy feature: every conversation saves to disk in encrypted form. The conversations/ directory contains only encrypted files—readable only with the secret.key file.

Step 6: Create the Web Interface

Build a visual chat interface. Create the directory and file static/index.html:

mkdir static

Then create static/index.html with this content:

<!DOCTYPE html>
<html>
<head>
    <title>Privacy-Preserving AI Chat</title>
    <style>
        body {
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            max-width: 800px;
            margin: 50px auto;
            padding: 20px;
            background: #1a1a1a;
            color: #e0e0e0;
        }
        .privacy-badge {
            background: #2d5016;
            padding: 15px;
            border-radius: 8px;
            margin-bottom: 20px;
            border-left: 4px solid #4caf50;
        }
        .privacy-badge strong {
            font-size: 18px;
        }
        #chat-container {
            background: #2a2a2a;
            border-radius: 8px;
            padding: 20px;
            height: 400px;
            overflow-y: auto;
            margin-bottom: 20px;
            border: 1px solid #3a3a3a;
        }
        .message {
            margin: 10px 0;
            padding: 12px;
            border-radius: 6px;
            line-height: 1.4;
        }
        .user-message {
            background: #1e3a5f;
            margin-left: 20%;
            text-align: right;
        }
        .ai-message {
            background: #3a3a3a;
            margin-right: 20%;
        }
        #input-container {
            display: flex;
            gap: 10px;
        }
        input {
            flex: 1;
            padding: 12px;
            border: 1px solid #444;
            border-radius: 6px;
            background: #2a2a2a;
            color: #e0e0e0;
            font-size: 14px;
        }
        input:focus {
            outline: none;
            border-color: #4caf50;
        }
        button {
            padding: 12px 24px;
            background: #4caf50;
            color: white;
            border: none;
            border-radius: 6px;
            cursor: pointer;
            font-weight: bold;
            font-size: 14px;
        }
        button:hover {
            background: #45a049;
        }
        button:disabled {
            background: #666;
            cursor: not-allowed;
        }
        .loading {
            color: #888;
            font-style: italic;
        }
    </style>
</head>
<body>
    <div class="privacy-badge">
        🔒 <strong>Privacy-First AI</strong>
        <br>✓ Local inference only
        <br>✓ Encrypted storage
        <br>✓ No data sent to external APIs
    </div>
    
    <div id="chat-container"></div>
    
    <div id="input-container">
        <input type="text" id="message-input" placeholder="Type your message..." />
        <button onclick="sendMessage()" id="send-btn">Send</button>
    </div>
    
    <script>
        // Generate unique session ID for this browser session
        const sessionId = 'session_' + Date.now();
        
        async function sendMessage() {
            const input = document.getElementById('message-input');
            const message = input.value.trim();
            
            if (!message) return;
            
            // Disable input while processing
            input.disabled = true;
            document.getElementById('send-btn').disabled = true;
            
            // Display user message
            addMessage(message, 'user');
            input.value = '';
            
            // Show loading indicator
            const loadingId = addMessage('Generating response...', 'ai', true);
            
            try {
                const response = await fetch('http://localhost:5000/chat', {
                    method: 'POST',
                    headers: {'Content-Type': 'application/json'},
                    body: JSON.stringify({
                        message: message,
                        session_id: sessionId
                    })
                });
                
                const data = await response.json();
                
                // Remove loading indicator
                document.getElementById(loadingId).remove();
                
                // Display AI response
                addMessage(data.response, 'ai');
                
            } catch (error) {
                document.getElementById(loadingId).remove();
                addMessage('Error: ' + error.message, 'ai');
            }
            
            // Re-enable input
            input.disabled = false;
            document.getElementById('send-btn').disabled = false;
            input.focus();
        }
        
        function addMessage(text, role, isLoading = false) {
            const container = document.getElementById('chat-container');
            const div = document.createElement('div');
            const messageId = 'msg_' + Date.now() + Math.random();
            div.id = messageId;
            div.className = `message ${role}-message`;
            if (isLoading) div.className += ' loading';
            div.textContent = text;
            container.appendChild(div);
            container.scrollTop = container.scrollHeight;
            return messageId;
        }
        
        // Send message on Enter key
        document.getElementById('message-input').addEventListener('keypress', function(e) {
            if (e.key === 'Enter') sendMessage();
        });
        
        // Focus input on load
        document.getElementById('message-input').focus();
    </script>
</body>
</html>

What this does: Creates a dark-themed chat interface with visual privacy indicators. The JavaScript handles:

  • Sending messages to the local API
  • Displaying user and AI messages in different styles
  • Showing loading states during inference
  • Generating unique session IDs per browser session

The interface emphasizes privacy features with a prominent badge showing the security guarantees.

Step 7: Run the System

Start the privacy-preserving chat server:

python app.py

Example output (first run):

Initializing privacy-preserving AI system...
Loading model: microsoft/phi-2
First run downloads ~5GB of weights (cached for future use)...
Using device: cuda
Model loaded successfully!
Generated new encryption key: secret.key

==================================================
Privacy-Preserving AI Chat Server
==================================================
✓ Local inference (no API calls)
✓ Encrypted storage
✓ Data sovereignty maintained
==================================================

 * Serving Flask app 'app'
 * Debug mode: on
 * Running on http://0.0.0.0:5000

What this does: Starts the Flask development server. On first run, downloads the phi-2 model weights (approximately 5GB). Subsequent runs load the cached model in seconds. The server listens on port 5000 and is accessible at http://localhost:5000.

Open your web browser and navigate to http://localhost:5000 to see the chat interface.

Verification

Test each component to confirm the system works correctly:

Test 1: Verify Server Health

Check that the server is running and configured correctly:

curl http://localhost:5000/health

Expected output:

{
  "status": "healthy",
  "model_device": "cuda",
  "encryption_enabled": true
}

The model_device field shows cuda if using GPU or cpu if using CPU inference.

Test 2: Send a Chat Message via API

Test the chat endpoint directly:

curl -X POST http://localhost:5000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is machine learning?", "session_id": "test_session"