Thursday, February 19, 2026

Building a Local LLM Evaluation Pipeline with Ragas and Ollama

Building a Local LLM Evaluation Pipeline with Ragas and Ollama

What You'll Build

A working evaluation pipeline that tests your RAG (Retrieval-Augmented Generation) system locally using Ragas metrics and Ollama models. No OpenAI API keys required, no cloud dependencies—everything runs on your machine.

You'll create a Python script that takes questions, retrieves context from a document store, generates answers with a local LLM, and scores them across four key metrics: faithfulness, answer relevancy, context precision, and context recall. This approach costs nothing, keeps your data local, and provides repeatable evaluation results. You'll walk away with a complete pipeline you can adapt to evaluate your own RAG systems.

Prerequisites

  • Python 3.10+ (check with python --version)
  • pip package manager (included with Python)
  • Ollama installed and running - Download from https://ollama.ai/download
  • 8GB+ RAM (16GB recommended for smoother operation)
  • 5GB free disk space (for model downloads)
  • Basic understanding of RAG - you've built or used one before
  • Estimated time: 45-60 minutes

Step-by-Step Instructions

Step 1: Set Up Your Environment

Create a directory and virtual environment to isolate dependencies:

mkdir rag-eval-local
cd rag-eval-local
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Expected output: Your terminal prompt shows (venv) at the beginning, indicating the virtual environment is active.

Step 2: Install Required Packages

Install Ragas and the Langchain Ollama integration:

pip install ragas==0.1.9 langchain-community==0.2.10 langchain-ollama==0.1.0 datasets==2.14.0

What each package does:

  • ragas - Evaluation framework providing RAG metrics
  • langchain-ollama - Connects Langchain to local Ollama models
  • langchain-community - Required dependency for Langchain integrations
  • datasets - Handles data formatting for Ragas

Expected output: Successful installation messages ending with "Successfully installed ragas-0.1.9..." (installation takes 1-2 minutes).

Step 3: Pull the Required Ollama Models

Download the LLM for evaluation and the embedding model (takes 10-15 minutes depending on connection speed):

# Pull the instruction-tuned LLM for evaluation (4.1GB)
ollama pull mistral:7b-instruct

# Pull the embedding model (274MB)
ollama pull nomic-embed-text

Expected output:

pulling manifest
pulling 4a03f83c5f0d... 100% ▕████████████████▏ 4.1 GB
pulling e6836092461f... 100% ▕████████████████▏  7.7 KB
pulling 4a03f83c5f0e... 100% ▕████████████████▏   11 KB
success

Why these models: Mistral 7B Instruct is fast enough for local evaluation while maintaining good quality. Nomic-embed-text is an open embedding model optimized for retrieval tasks and runs efficiently on Ollama.

Verify the models are ready:

ollama list

You should see both models listed with their sizes.

Step 4: Create Sample Data

Create sample_data.py with test data. In production, you'd load this from your actual RAG system, but this lets us focus on evaluation mechanics:

# sample_data.py
from datasets import Dataset

# Sample RAG outputs to evaluate
# Each entry represents: question asked, answer generated, 
# contexts retrieved, and ground truth answer
eval_data = {
    "question": [
        "What is the capital of France?",
        "How does photosynthesis work?",
        "What are the main causes of climate change?"
    ],
    "answer": [
        "The capital of France is Paris, a major European city known for its art, fashion, and culture.",
        "Photosynthesis is the process where plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.",
        "Climate change is primarily caused by human activities that release greenhouse gases, especially burning fossil fuels, deforestation, and industrial processes."
    ],
    "contexts": [
        ["Paris is the capital and most populous city of France. It is located in north-central France."],
        ["Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy. It uses carbon dioxide and water, producing glucose and oxygen."],
        ["The primary cause of climate change is the burning of fossil fuels like coal, oil, and gas, which releases carbon dioxide. Deforestation also contributes significantly."]
    ],
    "ground_truth": [
        "Paris is the capital of France.",
        "Photosynthesis converts light energy into chemical energy using CO2 and water, producing glucose and oxygen.",
        "Climate change is mainly caused by burning fossil fuels and deforestation, which release greenhouse gases."
    ]
}

# Convert to Ragas-compatible dataset format
dataset = Dataset.from_dict(eval_data)

Data structure explained: Ragas expects contexts as a list of lists (each question can have multiple retrieved context chunks), while other fields are simple lists. This mirrors what your RAG system produces.

Step 5: Configure Ragas with Ollama

Create the main evaluation script. This connects Ragas to your local Ollama models instead of cloud APIs.

Create evaluate_rag.py:

# evaluate_rag.py
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from langchain_ollama import ChatOllama, OllamaEmbeddings
from sample_data import dataset

# Configure the Ollama LLM for evaluation
llm = ChatOllama(
    model="mistral:7b-instruct",
    temperature=0,  # Deterministic outputs for consistent evaluation
    num_ctx=4096,   # Context window size
)

# Configure the embedding model
embeddings = OllamaEmbeddings(
    model="nomic-embed-text"
)

# Run evaluation with local models
# The llm and embeddings parameters override Ragas' default OpenAI models
result = evaluate(
    dataset,
    metrics=[
        faithfulness,        # Is answer grounded in context?
        answer_relevancy,    # Does answer address the question?
        context_precision,   # Are retrieved contexts relevant?
        context_recall,      # Do contexts contain needed info?
    ],
    llm=llm,
    embeddings=embeddings,
)

# Display results
print("\n=== Evaluation Results ===")
print(result)

# Save detailed per-question results
result_df = result.to_pandas()
result_df.to_csv("evaluation_results.csv", index=False)
print("\nDetailed results saved to evaluation_results.csv")

Key configuration choices:

  • temperature=0 makes evaluation deterministic—you want consistent scores across runs
  • num_ctx=4096 provides enough context window for Ragas' evaluation prompts
  • We pass our local models directly to evaluate() instead of using default OpenAI models

Step 6: Run the Evaluation

Verify Ollama is running (it should auto-start after installation), then execute the evaluation script:

# Check Ollama is running
ollama list

# Run the evaluation
python evaluate_rag.py

Expected output:

Evaluating: 100%|████████████████████| 12/12 [00:45<00:00,  3.78s/it]

=== Evaluation Results ===
{'faithfulness': 0.8333, 'answer_relevancy': 0.9123, 'context_precision': 0.8889, 'context_recall': 0.9167}

Detailed results saved to evaluation_results.csv

What just happened: Ragas evaluated 12 items (4 metrics × 3 questions). Each metric uses the LLM to judge quality through carefully designed prompts. Execution time varies by hardware—expect 30-60 seconds on modern machines.

Step 7: Understand Your Metrics

Open evaluation_results.csv to see per-question scores. Here's what each metric measures:

  • Faithfulness (0-1): Is the answer grounded in the retrieved context? Scores near 1 mean no hallucinations.
  • Answer Relevancy (0-1): Does the answer actually address the question asked?
  • Context Precision (0-1): Are the retrieved contexts relevant to answering the question?
  • Context Recall (0-1): Do the contexts contain all information needed to answer?

Create a visualization script to better understand the results. Create visualize.py:

# visualize.py
import pandas as pd
import matplotlib.pyplot as plt

# Load evaluation results
df = pd.read_csv("evaluation_results.csv")

# Calculate average scores across all questions
metrics = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
scores = df[metrics].mean()

# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(metrics, scores, color='steelblue')
plt.ylim(0, 1)
plt.ylabel('Score')
plt.title('RAG System Evaluation Metrics')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('metrics.png', dpi=150)
print("Visualization saved to metrics.png")

Install matplotlib if needed, then run the visualization:

pip install matplotlib==3.7.1
python visualize.py

Open metrics.png to see your results visualized. This makes it easier to spot which aspects of your RAG system need improvement.

Verification

Confirm everything worked correctly with these checks:

  1. Check the CSV exists and has content:
    ls -lh evaluation_results.csv
    head evaluation_results.csv
    
    You should see a file with multiple columns including your metrics and scores.
  2. Verify scores are reasonable: Open evaluation_results.csv—all metric scores should be between 0 and 1. If you see NaN, null, or values outside this range, something failed.
  3. Test with intentionally bad data: Modify sample_data.py to add a clearly wrong answer:
    # Add this to the end of the "answer" list in sample_data.py
    "The capital of France is London."
    # Add corresponding question, contexts, and ground_truth entries
    
    Run python evaluate_rag.py again—faithfulness should drop significantly for that question.
  4. Test consistency: Run python evaluate_rag.py three times with the original data. Scores should be identical or vary by less than 0.02 due to temperature=0.

Success looks like: You can run the evaluation repeatedly and get consistent scores, the CSV contains per-question breakdowns, and intentionally bad answers score lower than good ones.

Common Issues & Fixes

Issue 1: "Connection refused" or "Ollama not found"

Error message:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=11434): 
Max retries exceeded with url: /api/generate

Cause: Ollama service isn't running.

Fix: Start Ollama manually:

# macOS/Linux
ollama serve

# Or check if it's already running
ps aux | grep ollama

# Windows - Ollama should run as a system service
# Check system tray for Ollama icon

On macOS, Ollama typically runs as a background service after installation. If ollama list works, the service is running.

Issue 2: Evaluation hangs or takes extremely long

Symptoms: Progress bar stuck at 0% for 5+ minutes, or system becomes unresponsive.

Cause: Model not fully loaded in memory, or system is swapping to disk.

Fix: Cancel with Ctrl+C and verify models load properly:

# Test model responds quickly
ollama run mistral:7b-instruct "What is 2+2?"
# Should respond within 5-10 seconds

If the model loads slowly or your system is swapping, try a smaller quantized model:

# Pull 4-bit quantized version (smaller, faster)
ollama pull mistral:7b-instruct-q4_0

# Update evaluate_rag.py to use it
# Change: model="mistral:7b-instruct-q4_0"

Issue 3: Low or inconsistent scores on clearly good answers

Symptoms: Faithfulness of 0.3 when answer clearly matches context, or scores vary wildly between runs.

Cause: Model may be misinterpreting Ragas' evaluation prompts.

Fix: Try an alternative model that handles instruction-following better:

# Pull Llama 3 (often more reliable for evaluation)
ollama pull llama3:8b-instruct

# Update evaluate_rag.py
# Change: model="llama3:8b-instruct"

Or re-pull the current model to ensure you have the latest version:

ollama pull mistral:7b-instruct

Issue 4: ImportError or ModuleNotFoundError

Error message:

ModuleNotFoundError: No module named 'ragas'

Cause: Virtual environment not activated, or packages not installed.

Fix: Ensure virtual environment is active and reinstall:

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Verify activation - should show venv path
which python

# Reinstall packages
pip install ragas==0.1.9 langchain-community==0.2.10 langchain-ollama==0.1.0 datasets==2.14.0

Next Steps

Now that you have a working local evaluation pipeline, here's how to extend it:

  • Integrate with your actual RAG system: Replace sample_data.py with outputs from your production system. Export questions, generated answers, retrieved contexts, and ground truth answers in the same format.
  • Add custom metrics: Ragas supports custom evaluators for domain-specific requirements. See the official documentation at https://docs.ragas.io/en/latest/concepts/metrics/custom.html
  • Batch evaluation: Process larger datasets by loading them from CSV or JSON files. Ragas handles batching automatically, but consider chunking very large datasets (1000+ questions) to avoid memory issues.
  • Track metrics over time: Store results in a database (SQLite works well) to monitor how system changes affect metrics. This helps you catch regressions early.
  • Compare different approaches: Evaluate different chunking strategies, embedding models, or retrieval methods by swapping out the contexts while keeping questions constant. This isolates what actually improves performance.
  • Automate with CI/CD: Add evaluation to your testing pipeline to catch quality regressions before deployment.

Local models score differently than GPT-4 would, but they're consistent and free. This makes them ideal for iterative development where you need to run hundreds of evaluations. You can always run a final validation with commercial APIs once you've narrowed down your best approach.

Helpful resources:

If you hit issues not covered here, the Ragas GitHub Issues page is actively maintained, and the Langchain community forums have good coverage of Ollama integration questions.

No comments:

Post a Comment