Model Comparison Tool

A standalone CLI tool to compare the tagging performance of AI models using your existing Karakeep bookmarks.

Features

  • Two comparison modes:
    • Model vs Model: Compare two AI models against each other
    • Model vs Existing: Compare a new model against existing AI-generated tags on your bookmarks
  • Fetches existing bookmarks from your Karakeep instance
  • Runs tagging inference with AI models
  • Random shuffling: Models/tags are randomly assigned to "Model A" or "Model B" for each bookmark to eliminate bias
  • Blind comparison: Model names are hidden during voting (only shown as "Model A" and "Model B")
  • Interactive voting interface
  • Shows final results with winner

Setup

Environment Variables

Required environment variables:

# Karakeep API configuration
KARAKEEP_API_KEY=your_api_key_here
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com

# Comparison mode (default: model-vs-model)
# - "model-vs-model": Compare two models against each other
# - "model-vs-existing": Compare a model against existing AI tags
COMPARISON_MODE=model-vs-model

# Models to compare
# MODEL1_NAME: The new model to test (always required)
# MODEL2_NAME: The second model to compare against (required only for model-vs-model mode)
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet

# OpenAI/OpenRouter API configuration (for running inference)
OPENAI_API_KEY=your_openai_or_openrouter_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1  # Optional, defaults to OpenAI

# Optional: Number of bookmarks to test (default: 10)
COMPARE_LIMIT=10

Using OpenRouter

For OpenRouter, set:

OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=your_openrouter_key

Using OpenAI Directly

For OpenAI directly:

OPENAI_API_KEY=your_openai_key
# OPENAI_BASE_URL can be omitted for direct OpenAI

Usage

cd tools/compare-models
pnpm install
pnpm run

Run with environment file

Create a .env file:

KARAKEEP_API_KEY=your_api_key
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
OPENAI_API_KEY=your_openai_key
COMPARE_LIMIT=10

Then run:

pnpm run

Using directly with node

If you prefer to run the compiled JavaScript directly:

pnpm build
export KARAKEEP_API_KEY=your_api_key
export KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
export MODEL1_NAME=gpt-4o-mini
export MODEL2_NAME=claude-3-5-sonnet
export OPENAI_API_KEY=your_openai_key
node dist/index.js

Comparison Modes

Model vs Model Mode

Compare two different AI models against each other:

COMPARISON_MODE=model-vs-model
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet

This mode runs inference with both models on each bookmark and lets you choose which tags are better.

Model vs Existing Mode

Compare a new model against existing AI-generated tags on your bookmarks:

COMPARISON_MODE=model-vs-existing
MODEL1_NAME=gpt-4o-mini
# MODEL2_NAME is not required in this mode

This mode is useful for:

  • Testing if a new model produces better tags than your current model
  • Evaluating whether to switch from one model to another
  • Quality assurance on existing AI tags

Note: This mode only compares bookmarks that already have AI-generated tags (tags with attachedBy: "ai"). Bookmarks without AI tags are automatically filtered out.

Usage Flow

  1. The tool fetches your latest link bookmarks from Karakeep

    • In model-vs-existing mode, only bookmarks with existing AI tags are included
  2. For each bookmark, it randomly assigns the options to "Model A" or "Model B" and runs tagging

  3. You'll see a side-by-side comparison (randomly shuffled each time):

    === Bookmark 1/10 ===
    How to Build Better AI Systems
    https://example.com/article
    This article explores modern approaches to...
    
    ─────────────────────────────────────
    
    Model A (blind):
      • ai
      • machine-learning
      • engineering
    
    Model B (blind):
      • artificial-intelligence
      • ML
      • software-development
    
    ─────────────────────────────────────
    
    Which tags do you prefer? [1=Model A, 2=Model B, s=skip, q=quit] >
    
  4. Choose your preference:

    • 1 - Vote for Model A
    • 2 - Vote for Model B
    • s or skip - Skip this comparison
    • q or quit - Exit early and show current results
  5. After completing all comparisons (or quitting early), results are displayed:

    ───────────────────────────────────────
    === FINAL RESULTS ===
    ───────────────────────────────────────
    gpt-4o-mini: 6 votes
    claude-3-5-sonnet: 3 votes
    Skipped: 1
    Errors: 0
    ───────────────────────────────────────
    Total bookmarks tested: 10
    
    🏆 WINNER: gpt-4o-mini
    ───────────────────────────────────────
    
  6. The actual model names are only shown in the final results - during voting you see only "Model A" and "Model B"

Bookmark Filtering

The tool currently tests only:

  • Link-type bookmarks (not text notes or assets)
  • Non-archived bookmarks
  • Latest N bookmarks (where N is COMPARE_LIMIT)
  • In model-vs-existing mode: Only bookmarks with existing AI tags (tags with attachedBy: "ai")

Architecture

This tool leverages Karakeep's shared infrastructure:

  • API Client: Uses @karakeep/sdk for type-safe API interactions with proper authentication
  • Inference: Reuses @karakeep/shared/inference for OpenAI client with structured output support
  • Prompts: Uses @karakeep/shared/prompts for consistent tagging prompt generation with token management
  • No code duplication - all core functionality is shared with the main Karakeep application

Error Handling

  • If a model fails to generate tags for a bookmark, an error is shown and comparison continues
  • Errors are counted separately in final results
  • Missing required environment variables will cause the tool to exit with a clear error message

Build

To build a standalone binary:

pnpm build

The built binary will be in dist/index.js.

Notes

  • The tool is designed for manual, human-in-the-loop evaluation
  • No results are persisted - they're only displayed in console
  • Content is fetched with includeContent=true from Karakeep API
  • Uses Karakeep SDK (@karakeep/sdk) for type-safe API interactions
  • Inference runs sequentially to keep state management simple
  • Recommended to use pnpm run for the best experience (uses tsx for development)
  • Random shuffling: For each bookmark, models are randomly assigned to "Model A" or "Model B" to eliminate position bias. The actual model names are only revealed in the final results.