Code Rag Benchmark Plan

Kodit Benchmark: SWE-bench Implementation

Overview

This document describes the SWE-bench benchmark implementation for evaluating Kodit’s code retrieval capabilities. The benchmark uses real-world GitHub issues from the SWE-bench dataset to measure how much Kodit’s retrieval improves LLM patch generation.


1. Why SWE-bench?

SWE-bench tests repository-level issue resolution—a task where retrieval provides significant value:

Kodit CapabilitySWE-bench RequirementAlignment
Index Git repositoriesReal GitHub repos at specific commits✅ Perfect
Hybrid search (BM25 + semantic)Find relevant code for bug fixing✅ Perfect
AST-based snippet extractionLocate functions/classes to modify✅ Perfect
Filter by repositoryEach task targets a specific repo✅ Perfect

Why SWE-bench over RepoEval?

FeatureSWE-benchRepoEval
Exact commit hashesbase_commit field❌ Snapshots only
Evaluation method✅ Real test execution⚠️ Token similarity
Task complexityReal bug fixesFunction completion
Retrieval impactHigh (large repos)Medium

From the SWE-bench leaderboard:

  • RAG-based approaches (BM25 retrieval + LLM) achieve 4-7% on Lite
  • Agentless-Lite with embedding retrieval achieves 32% on Lite
  • This demonstrates significant headroom for better retrieval

2. Dataset

2.1 Data Source

The benchmark uses the official SWE-bench datasets from Hugging Face:

DatasetSizeUse Case
princeton-nlp/SWE-bench_Lite300 instancesPrimary benchmark
princeton-nlp/SWE-bench_Verified500 instancesExtended benchmark

2.2 Repositories

SWE-bench Lite covers 12 popular Python repositories:

RepositoryInstancesDescription
django/django114Web framework
sympy/sympy77Symbolic mathematics
matplotlib/matplotlib23Plotting library
scikit-learn/scikit-learn23Machine learning
pytest-dev/pytest17Testing framework
sphinx-doc/sphinx16Documentation generator
astropy/astropy6Astronomy library
psf/requests6HTTP library
pylint-dev/pylint6Code linter
pydata/xarray5N-D arrays
mwaskom/seaborn4Statistical visualization
pallets/flask3Web microframework

2.3 Instance Format

Each instance contains:

{
    "instance_id": "django__django-11049",        # Unique identifier
    "repo": "django/django",                      # GitHub repository
    "base_commit": "17455e924e24...",            # Exact commit to checkout
    "problem_statement": "...",                   # Issue description (natural language)
    "hints_text": "...",                          # Optional hints
    "patch": "diff --git a/...",                  # Ground truth fix
    "test_patch": "diff --git a/...",            # Test additions
    "FAIL_TO_PASS": ["test_invalid_string..."],  # Tests that should pass after fix
    "PASS_TO_PASS": ["test_other..."],           # Tests that should remain passing
    "version": "3.0",                             # Library version
    "environment_setup_commit": "...",           # Commit for environment setup
}

2.4 Example Task

Instance: django__django-11049
Commit: 17455e924e243e7a55e8a38f45966d8cbb27c273

Problem Statement:
  Correct expected format in invalid DurationField error message.
  The current error message says "[DD] [HH:[MM:]]ss[.uuuuuu]" but should
  be "[DD] [[HH:]MM:]ss[.uuuuuu]" because seconds are mandatory.

Expected Patch:
  diff --git a/django/db/models/fields/__init__.py
  -                     "[DD] [HH:[MM:]]ss[.uuuuuu] format.")
  +                     "[DD] [[HH:]MM:]ss[.uuuuuu] format.")

Tests to Fix:
  ["test_invalid_string (model_fields.test_durationfield.TestValidation)"]

3. Experimental Design

3.1 Conditions

ConditionDescription
BaselineLLM generates patch with only the problem statement
BM25LLM generates patch with BM25-retrieved context (SWE-bench baseline)
KoditLLM generates patch with Kodit-retrieved context
OracleLLM generates patch with gold file context (upper bound)

3.2 Metrics

Primary Metric:

  • Resolve Rate: Percentage of instances where generated patch makes FAIL_TO_PASS tests pass
  • Resolve Rate Delta: Resolve(Kodit) - Resolve(BM25) — the improvement over baseline RAG

Secondary Metrics:

  • Retrieval Recall@k: Fraction of modified files found in top-k results
  • Context Utilization: How often retrieved context appears in generated patches

3.3 Evaluation

Evaluation uses the official SWE-bench harness with Docker containers:

  1. Apply generated patch to repository at base_commit
  2. Run FAIL_TO_PASS tests in isolated environment
  3. Verify PASS_TO_PASS tests still pass (no regressions)
  4. Instance is “resolved” only if all conditions met

Running the SWE-bench Harness

Install and run the official evaluation harness:

# Install SWE-bench
pip install swebench

# Run evaluation (requires Docker with ~100GB disk space)
python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Lite \
    --predictions_path results/predictions.jsonl \
    --max_workers 8 \
    --run_id kodit_eval

# For Mac M-series (ARM), build images locally:
python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Lite \
    --predictions_path results/predictions.jsonl \
    --max_workers 8 \
    --namespace '' \
    --run_id kodit_eval

Performance: ~30 mins for Lite (300 instances) on 16 cores with cache_level=env.

Cloud option: Run on Modal to avoid local Docker setup:

pip install modal swebench[modal]
modal setup
python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Lite \
    --predictions_path results/predictions.jsonl \
    --modal true

4. Running the Benchmark

4.1 Quick Start

# Step 1: Setup - clone repos at specific commits, index with Kodit
uv run kodit benchmark setup --dataset swebench-lite

# Step 2: Run a single instance (for testing)
uv run kodit benchmark run-one django__django-11049

# Step 3: Run full benchmark
uv run kodit benchmark run --dataset swebench-lite --condition kodit

# Step 4: Evaluate predictions
uv run kodit benchmark evaluate results/predictions.jsonl

4.2 CLI Commands

# Show available instances
uv run kodit benchmark list --dataset swebench-lite --repo django/django

# Run specific instances
uv run kodit benchmark run \
    --instances django__django-11049 django__django-13447 \
    --model claude-3-5-sonnet-20241022 \
    --condition kodit

# Compare conditions
uv run kodit benchmark compare \
    results/baseline.jsonl \
    results/kodit.jsonl

4.3 Configuration Options

OptionDefaultDescription
--datasetswebench-liteDataset variant (lite, verified, full)
--modelclaude-3-5-sonnet-20241022LiteLLM model identifier
--conditionkoditRetrieval condition (baseline, bm25, kodit, oracle)
--top-k5Number of files/snippets to retrieve
--instancesallSpecific instance IDs to run
--repoallFilter to specific repository

5. Architecture

5.1 Directory Structure

benchmarks/
├── __init__.py
├── cli.py                    # CLI commands (setup, run, evaluate)
├── swebench/
│   ├── __init__.py
│   ├── instance.py           # SWEBenchInstance dataclass
│   ├── loader.py             # HuggingFace dataset loader
│   ├── repository.py         # Git clone/checkout management
│   ├── retriever.py          # Kodit retrieval wrapper
│   ├── prompt.py             # Prompt templates
│   ├── generator.py          # LLM patch generation
│   └── evaluator.py          # SWE-bench harness wrapper
├── repos/                    # Cloned repositories (gitignored)
│   └── django__django-11049/ # Instance-specific checkout
├── results/                  # Benchmark outputs
│   └── predictions.jsonl
└── cache/                    # Indexed repository cache

5.2 Setup Process

The setup command prepares repositories for benchmarking:

  1. Load dataset: Fetch from princeton-nlp/SWE-bench_Lite

  2. Clone repositories: For each unique (repo, base_commit) pair:

    git clone https://github.com/{repo} repos/{instance_id}
    cd repos/{instance_id}
    git checkout {base_commit}
  3. Index with Kodit: For each cloned repository:

    • POST to /api/v1/repositories with file:// URI
    • Wait for indexing to complete
    • Store mapping: instance_id → repository_id
  4. Cache index: Save Kodit database for reuse

5.3 Benchmark Pipeline

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Load Instance│───▶│   Retrieve   │───▶│ Build Prompt │
│  from HF     │    │  from Kodit  │    │  (issue+ctx) │
└──────────────┘    └──────────────┘    └──────┬───────┘
                                               │
┌──────────────┐    ┌──────────────┐           │
│   Evaluate   │◀───│   Generate   │◀──────────┘
│  with Docker │    │    Patch     │
└──────────────┘    └──────────────┘

6. Implementation Details

6.1 Retrieval Strategy

The KoditRetriever queries Kodit with the problem statement:

class KoditRetriever:
    async def retrieve(self, instance: SWEBenchInstance, k: int = 5) -> list[RetrievedFile]:
        # Extract key terms from problem statement
        keywords = self._extract_keywords(instance.problem_statement)

        results = await self._search_service.search(
            user_intent=instance.problem_statement,
            keywords=keywords,
            source_repo=f"github.com/{instance.repo}",
        )

        # Group snippets by file, return top-k files
        files = self._group_by_file(results)
        return files[:k]

6.2 Prompt Template

Following the SWE-bench BM25 baseline format:

You will be provided with a partial code base and an issue statement explaining
a problem to resolve.

<issue>
{problem_statement}
</issue>

<code>
[start of {file_path_1}]
{file_content_1}
[end of {file_path_1}]

[start of {file_path_2}]
{file_content_2}
[end of {file_path_2}]
</code>

Generate a patch in unified diff format that resolves the issue.
Only output the patch, no explanations.

6.3 Prediction Format (Target Output)

Our benchmark must produce a JSONL file compatible with the SWE-bench evaluation harness:

{"instance_id": "django__django-11049", "model_name_or_path": "kodit-claude", "model_patch": "diff --git a/django/db/models/fields/__init__.py b/django/db/models/fields/__init__.py\nindex abc123..def456 100644\n--- a/django/db/models/fields/__init__.py\n+++ b/django/db/models/fields/__init__.py\n@@ -1000,7 +1000,7 @@\n-                     \"[DD] [HH:[MM:]]ss[.uuuuuu] format.\")\n+                     \"[DD] [[HH:]MM:]ss[.uuuuuu] format.\")"}
{"instance_id": "django__django-13447", "model_name_or_path": "kodit-claude", "model_patch": "diff --git a/..."}

Required fields:

  • instance_id: Must match exactly (e.g., django__django-11049)
  • model_name_or_path: Identifier for tracking (e.g., kodit-claude-sonnet)
  • model_patch: Unified diff format with diff --git header

This file is then passed to the SWE-bench harness:

python -m swebench.harness.run_evaluation \
    --predictions_path results/predictions.jsonl \
    ...

7. Results Format

7.1 Per-Instance Results

{
  "instance_id": "django__django-11049",
  "condition": "kodit",
  "retrieved_files": ["django/db/models/fields/__init__.py"],
  "retrieval_recall": 1.0,
  "generated_patch": "diff --git a/...",
  "resolved": true,
  "fail_to_pass_results": {"test_invalid_string": "PASSED"},
  "latency_ms": 2340
}

7.2 Aggregate Results

{
  "benchmark": "swebench-lite",
  "model": "claude-3-5-sonnet-20241022",
  "timestamp": "2024-01-15T10:30:00Z",
  "conditions": {
    "baseline": {
      "resolve_rate": 0.15,
      "retrieval_recall_5": 0.0,
      "instances_run": 300
    },
    "bm25": {
      "resolve_rate": 0.22,
      "retrieval_recall_5": 0.45,
      "instances_run": 300
    },
    "kodit": {
      "resolve_rate": 0.28,
      "retrieval_recall_5": 0.62,
      "instances_run": 300
    }
  },
  "kodit_delta_vs_baseline": "+13%",
  "kodit_delta_vs_bm25": "+6%"
}

8. Expected Results

Based on SWE-bench leaderboard data and CodeRAG-Bench findings:

ConditionExpected Resolve Rate
Baseline (no retrieval)~15%
BM25 retrieval~22%
Kodit retrieval~28%
Oracle (gold files)~45%

Key Insight: The gap between BM25 (22%) and Oracle (45%) represents the potential improvement from better retrieval. Kodit’s hybrid search should capture more of this potential than pure BM25.


9. Troubleshooting

Repository cloning fails

  • Ensure network access to GitHub
  • Some repos may require authentication for private forks
  • Use --skip-clone if repos already exist locally

Indexing takes too long

  • Large repos (django, sympy) can take 10-30 minutes
  • Use --repo flag to test with smaller repos first (flask, requests)
  • Pre-indexed caches can be shared across runs

Docker evaluation fails

  • Ensure Docker daemon is running
  • SWE-bench requires significant disk space for containers
  • Use --dry-run to test pipeline without evaluation

API key issues

  • Set appropriate API keys for your LLM provider
  • ANTHROPIC_API_KEY for Claude models
  • OPENAI_API_KEY for OpenAI models

10. Implementation Checklist

#TaskPriorityStatus
1Create SWEBenchInstance dataclassHigh✅ DONE
2Implement HuggingFace dataset loader (download command)High✅ DONE
3Implement prepare-instance command (clone repo, index with Kodit)HighTODO
4Implement Kodit retrieval wrapperHighTODO
5Implement prompt builderHighTODO
6Implement patch generatorHighTODO
7Output predictions in SWE-bench JSONL formatHighTODO
8Add result aggregation and reportingMediumTODO
9Add BM25 baseline comparisonMediumTODO

Completed

  • SWEBenchInstance (src/benchmark/swebench/instance.py): Immutable dataclass with all SWE-bench fields
  • DatasetLoader (src/benchmark/swebench/loader.py): Downloads from HuggingFace, saves/loads JSON
  • download command: uv run kodit-benchmark download --dataset lite
  • Dataset stored at: benchmarks/data/swebench-lite.json (300 instances)

Next Step

prepare-instance command: Takes a single instance ID, clones the repo at the exact commit, starts a fresh Kodit server, and indexes the repository.

uv run kodit-benchmark prepare-instance django__django-11049

This will:

  1. Look up the instance from the downloaded dataset
  2. Clone django/django to benchmarks/repos/django__django-11049/
  3. Checkout the exact base_commit
  4. Start a fresh Kodit server (using existing start-kodit infrastructure)
  5. Index the repository via Kodit API
  6. Wait for indexing to complete

11. References