Two comparison modes:
- Model vs Model: Compare two AI models against each other
- Model vs Existing: Compare a new model against existing AI-generated tags on your bookmarks
Fetches existing bookmarks from your Karakeep instance
Runs tagging inference with AI models
Random shuffling: Models/tags are randomly assigned to "Model A" or "Model B" for each bookmark to eliminate bias
Blind comparison: Model names are hidden during voting (only shown as "Model A" and "Model B")
Interactive voting interface
Shows final results with winner

Setup

Environment Variables

Required environment variables:

# Karakeep API configuration
KARAKEEP_API_KEY=your_api_key_here
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com

# Comparison mode (default: model-vs-model)
# - "model-vs-model": Compare two models against each other
# - "model-vs-existing": Compare a model against existing AI tags
COMPARISON_MODE=model-vs-model

# Models to compare
# MODEL1_NAME: The new model to test (always required)
# MODEL2_NAME: The second model to compare against (required only for model-vs-model mode)
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet

# OpenAI/OpenRouter API configuration (for running inference)
OPENAI_API_KEY=your_openai_or_openrouter_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1  # Optional, defaults to OpenAI

# Optional: Number of bookmarks to test (default: 10)
COMPARE_LIMIT=10

Using OpenRouter

For OpenRouter, set:

OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=your_openrouter_key

Using OpenAI Directly

For OpenAI directly:

OPENAI_API_KEY=your_openai_key
# OPENAI_BASE_URL can be omitted for direct OpenAI

Usage

Run with pnpm (Recommended)

cd tools/compare-models
pnpm install
pnpm run

Run with environment file

Create a .env file:

KARAKEEP_API_KEY=your_api_key
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
OPENAI_API_KEY=your_openai_key
COMPARE_LIMIT=10

Then run:

pnpm run

Using directly with node

If you prefer to run the compiled JavaScript directly:

pnpm build
export KARAKEEP_API_KEY=your_api_key
export KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
export MODEL1_NAME=gpt-4o-mini
export MODEL2_NAME=claude-3-5-sonnet
export OPENAI_API_KEY=your_openai_key
node dist/index.js

Comparison Modes

Model vs Model Mode

Compare two different AI models against each other:

COMPARISON_MODE=model-vs-model
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet

This mode runs inference with both models on each bookmark and lets you choose which tags are better.

Model vs Existing Mode

Compare a new model against existing AI-generated tags on your bookmarks:

COMPARISON_MODE=model-vs-existing
MODEL1_NAME=gpt-4o-mini
# MODEL2_NAME is not required in this mode

This mode is useful for:

Testing if a new model produces better tags than your current model
Evaluating whether to switch from one model to another
Quality assurance on existing AI tags

Note: This mode only compares bookmarks that already have AI-generated tags (tags with attachedBy: "ai"). Bookmarks without AI tags are automatically filtered out.

Usage Flow

The tool fetches your latest link bookmarks from Karakeep
- In model-vs-existing mode, only bookmarks with existing AI tags are included
For each bookmark, it randomly assigns the options to "Model A" or "Model B" and runs tagging

You'll see a side-by-side comparison (randomly shuffled each time):

=== Bookmark 1/10 ===
How to Build Better AI Systems
https://example.com/article
This article explores modern approaches to...

─────────────────────────────────────

Model A (blind):
  • ai
  • machine-learning
  • engineering

Model B (blind):
  • artificial-intelligence
  • ML
  • software-development

─────────────────────────────────────

Which tags do you prefer? [1=Model A, 2=Model B, s=skip, q=quit] >

Choose your preference:
- 1 - Vote for Model A
- 2 - Vote for Model B
- s or skip - Skip this comparison
- q or quit - Exit early and show current results

After completing all comparisons (or quitting early), results are displayed:

───────────────────────────────────────
=== FINAL RESULTS ===
───────────────────────────────────────
gpt-4o-mini: 6 votes
claude-3-5-sonnet: 3 votes
Skipped: 1
Errors: 0
───────────────────────────────────────
Total bookmarks tested: 10

🏆 WINNER: gpt-4o-mini
───────────────────────────────────────

The actual model names are only shown in the final results - during voting you see only "Model A" and "Model B"

Bookmark Filtering

The tool currently tests only:

Link-type bookmarks (not text notes or assets)
Non-archived bookmarks
Latest N bookmarks (where N is COMPARE_LIMIT)
In model-vs-existing mode: Only bookmarks with existing AI tags (tags with attachedBy: "ai")

Architecture

This tool leverages Karakeep's shared infrastructure:

API Client: Uses @karakeep/sdk for type-safe API interactions with proper authentication
Inference: Reuses @karakeep/shared/inference for OpenAI client with structured output support
Prompts: Uses @karakeep/shared/prompts for consistent tagging prompt generation with token management
No code duplication - all core functionality is shared with the main Karakeep application

Error Handling

If a model fails to generate tags for a bookmark, an error is shown and comparison continues
Errors are counted separately in final results
Missing required environment variables will cause the tool to exit with a clear error message

Build

To build a standalone binary:

pnpm build

The built binary will be in dist/index.js.

Notes

The tool is designed for manual, human-in-the-loop evaluation
No results are persisted - they're only displayed in console
Content is fetched with includeContent=true from Karakeep API
Uses Karakeep SDK (@karakeep/sdk) for type-safe API interactions
Inference runs sequentially to keep state management simple
Recommended to use pnpm run for the best experience (uses tsx for development)
Random shuffling: For each bookmark, models are randomly assigned to "Model A" or "Model B" to eliminate position bias. The actual model names are only revealed in the final results.