Model Comparison Tool
A standalone CLI tool to compare the tagging performance of AI models using your existing Karakeep bookmarks.
Features
- Two comparison modes:
- Model vs Model: Compare two AI models against each other
- Model vs Existing: Compare a new model against existing AI-generated tags on your bookmarks
- Fetches existing bookmarks from your Karakeep instance
- Runs tagging inference with AI models
- Random shuffling: Models/tags are randomly assigned to "Model A" or "Model B" for each bookmark to eliminate bias
- Blind comparison: Model names are hidden during voting (only shown as "Model A" and "Model B")
- Interactive voting interface
- Shows final results with winner
Setup
Environment Variables
Required environment variables:
# Karakeep API configuration
KARAKEEP_API_KEY=your_api_key_here
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
# Comparison mode (default: model-vs-model)
# - "model-vs-model": Compare two models against each other
# - "model-vs-existing": Compare a model against existing AI tags
COMPARISON_MODE=model-vs-model
# Models to compare
# MODEL1_NAME: The new model to test (always required)
# MODEL2_NAME: The second model to compare against (required only for model-vs-model mode)
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
# OpenAI/OpenRouter API configuration (for running inference)
OPENAI_API_KEY=your_openai_or_openrouter_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1 # Optional, defaults to OpenAI
# Optional: Number of bookmarks to test (default: 10)
COMPARE_LIMIT=10
Using OpenRouter
For OpenRouter, set:
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=your_openrouter_key
Using OpenAI Directly
For OpenAI directly:
OPENAI_API_KEY=your_openai_key
# OPENAI_BASE_URL can be omitted for direct OpenAI
Usage
Run with pnpm (Recommended)
cd tools/compare-models
pnpm install
pnpm run
Run with environment file
Create a .env file:
KARAKEEP_API_KEY=your_api_key
KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
OPENAI_API_KEY=your_openai_key
COMPARE_LIMIT=10
Then run:
pnpm run
Using directly with node
If you prefer to run the compiled JavaScript directly:
pnpm build
export KARAKEEP_API_KEY=your_api_key
export KARAKEEP_SERVER_ADDR=https://your-karakeep-instance.com
export MODEL1_NAME=gpt-4o-mini
export MODEL2_NAME=claude-3-5-sonnet
export OPENAI_API_KEY=your_openai_key
node dist/index.js
Comparison Modes
Model vs Model Mode
Compare two different AI models against each other:
COMPARISON_MODE=model-vs-model
MODEL1_NAME=gpt-4o-mini
MODEL2_NAME=claude-3-5-sonnet
This mode runs inference with both models on each bookmark and lets you choose which tags are better.
Model vs Existing Mode
Compare a new model against existing AI-generated tags on your bookmarks:
COMPARISON_MODE=model-vs-existing
MODEL1_NAME=gpt-4o-mini
# MODEL2_NAME is not required in this mode
This mode is useful for:
- Testing if a new model produces better tags than your current model
- Evaluating whether to switch from one model to another
- Quality assurance on existing AI tags
Note: This mode only compares bookmarks that already have AI-generated tags (tags with attachedBy: "ai"). Bookmarks without AI tags are automatically filtered out.
Usage Flow
-
The tool fetches your latest link bookmarks from Karakeep
- In model-vs-existing mode, only bookmarks with existing AI tags are included
-
For each bookmark, it randomly assigns the options to "Model A" or "Model B" and runs tagging
-
You'll see a side-by-side comparison (randomly shuffled each time):
=== Bookmark 1/10 === How to Build Better AI Systems https://example.com/article This article explores modern approaches to... ───────────────────────────────────── Model A (blind): • ai • machine-learning • engineering Model B (blind): • artificial-intelligence • ML • software-development ───────────────────────────────────── Which tags do you prefer? [1=Model A, 2=Model B, s=skip, q=quit] > -
Choose your preference:
1- Vote for Model A2- Vote for Model Bsorskip- Skip this comparisonqorquit- Exit early and show current results
-
After completing all comparisons (or quitting early), results are displayed:
─────────────────────────────────────── === FINAL RESULTS === ─────────────────────────────────────── gpt-4o-mini: 6 votes claude-3-5-sonnet: 3 votes Skipped: 1 Errors: 0 ─────────────────────────────────────── Total bookmarks tested: 10 🏆 WINNER: gpt-4o-mini ─────────────────────────────────────── -
The actual model names are only shown in the final results - during voting you see only "Model A" and "Model B"
Bookmark Filtering
The tool currently tests only:
- Link-type bookmarks (not text notes or assets)
- Non-archived bookmarks
- Latest N bookmarks (where N is COMPARE_LIMIT)
- In model-vs-existing mode: Only bookmarks with existing AI tags (tags with
attachedBy: "ai")
Architecture
This tool leverages Karakeep's shared infrastructure:
- API Client: Uses
@karakeep/sdkfor type-safe API interactions with proper authentication - Inference: Reuses
@karakeep/shared/inferencefor OpenAI client with structured output support - Prompts: Uses
@karakeep/shared/promptsfor consistent tagging prompt generation with token management - No code duplication - all core functionality is shared with the main Karakeep application
Error Handling
- If a model fails to generate tags for a bookmark, an error is shown and comparison continues
- Errors are counted separately in final results
- Missing required environment variables will cause the tool to exit with a clear error message
Build
To build a standalone binary:
pnpm build
The built binary will be in dist/index.js.
Notes
- The tool is designed for manual, human-in-the-loop evaluation
- No results are persisted - they're only displayed in console
- Content is fetched with
includeContent=truefrom Karakeep API - Uses Karakeep SDK (
@karakeep/sdk) for type-safe API interactions - Inference runs sequentially to keep state management simple
- Recommended to use
pnpm runfor the best experience (uses tsx for development) - Random shuffling: For each bookmark, models are randomly assigned to "Model A" or "Model B" to eliminate position bias. The actual model names are only revealed in the final results.