63 Commits

Author SHA1 Message Date
Nick Sweeting
0a2ac11b01
more binary fixes 2026-01-05 02:26:33 -08:00
Nick Sweeting
b80e80439d
more binary fixes 2026-01-05 02:18:38 -08:00
Nick Sweeting
7ceaeae2d9
rename archive_org to archivedotorg, add BinaryWorker, fix config pass-through 2026-01-04 22:38:15 -08:00
Nick Sweeting
456aaee287
more migration id/uuid and config propagation fixes 2026-01-04 16:16:26 -08:00
Nick Sweeting
839ae744cf
simplify entrypoints for orchestrator and workers 2026-01-04 13:17:07 -08:00
Nick Sweeting
5449971777
better kill tree 2026-01-02 04:33:41 -08:00
Nick Sweeting
dd77511026
unified Process source of truth and better screenshot tests 2026-01-02 04:20:34 -08:00
Nick Sweeting
65ee09ceab
move tests into subfolder, add missing install hooks 2026-01-02 00:22:07 -08:00
Nick Sweeting
876feac522
actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage 2026-01-01 15:50:00 -08:00
Nick Sweeting
a04e4a7345
cleanup migrations, json, jsonl 2025-12-31 15:36:43 -08:00
Nick Sweeting
f12c3b4b55
less healthstats 2025-12-31 12:34:32 -08:00
Nick Sweeting
bd757188e4
keep stripping healthstats from iface and other things 2025-12-31 12:34:31 -08:00
Nick Sweeting
73fde81fce
more migrations tweaks 2025-12-31 12:34:31 -08:00
Claude
9bf7a520a0
Update tests for new Process model-based architecture
- Remove pid_utils tests (module deleted in dev)
- Update orchestrator tests to use Process model for tracking
- Add tests for Process.current(), cleanup_stale_running(), terminate()
- Add tests for Process hierarchy (parent/child, root, depth)
- Add tests for Process.get_running(), get_running_count()
- Add tests for ProcessMachine state machine
- Update machine model tests to match current API (from_jsonl vs from_json)
2025-12-31 11:51:42 +00:00
Claude
a063d8cd43
Merge remote-tracking branch 'origin/dev' into claude/analyze-test-coverage-mWgwv 2025-12-31 11:45:22 +00:00
claude[bot]
b2132d1f14 Fix cubic review issues: process_type detection, cmd storage, PID cleanup, and migration
- Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation
- Fix worker process_type detection: explicitly set to WORKER after registration
- Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently
- Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN)
- Fix get_running_workers() to return process_id instead of incorrectly named worker_id
- Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed
- Fix migration to include all indexes in state_operations (parent_id, process_type)
- Fix documentation to use Machine.current() scoping and StatusChoices constants

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:42:07 +00:00
Claude
0cb5f0712d
Add comprehensive tests for machine/process models, orchestrator, and search backends
This adds new test coverage for previously untested areas:

Machine module (archivebox/machine/tests/):
- Machine, NetworkInterface, Binary, Process model tests
- BinaryMachine and ProcessMachine state machine tests
- JSONL serialization/deserialization tests
- Manager method tests

Workers module (archivebox/workers/tests/):
- PID file utility tests (write, read, cleanup)
- Orchestrator lifecycle and queue management tests
- Worker spawning logic tests
- Idle detection and exit condition tests

Search backends:
- SQLite FTS5 search tests with real indexed content
- Phrase search, stemming, and unicode support
- Ripgrep search tests with archive directory structure
- Environment variable configuration tests

Binary provider plugins:
- pip provider hook tests
- npm provider hook tests with PATH updates
- apt provider hook tests
2025-12-31 11:33:27 +00:00
claude[bot]
5121b0e5f9 Merge branch 'dev' into claude/refactor-process-management-WcQyZ
Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR.

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:28:47 +00:00
claude[bot]
ee201a0f83 Fix code review issues in process management refactor
- Add pwd validation in Process.launch() to prevent crashes
- Fix psutil returncode handling (use wait() return value, not returncode attr)
- Add None check for proc.pid in cleanup_stale_running()
- Add stale process cleanup in Orchestrator.is_running()
- Ensure orchestrator process_type is correctly set to ORCHESTRATOR
- Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown)
- Throttle cleanup_stale_running() to once per 30 seconds for performance
- Fix worker process_type to use TypeChoices.WORKER consistently
- Fix get_running_workers() API to return list of dicts (not Process objects)
- Only delete PID files after successful kill or confirmed stale
- Fix migration index names to match between SQL and Django state
- Remove db_index=True from process_type (index created manually)
- Update documentation to reflect actual implementation
- Add explanatory comments to empty except blocks
- Fix exit codes to use Unix convention (128 + signal number)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-31 11:14:47 +00:00
Claude
b822352fc3
Delete pid_utils.py and migrate to Process model
DELETED:
- workers/pid_utils.py (-192 lines) - replaced by Process model methods

SIMPLIFIED:
- crawls/models.py Crawl.cleanup() (80 lines -> 10 lines)
- hooks.py: deleted process_is_alive() and kill_process() (-45 lines)

UPDATED to use Process model:
- core/models.py: Snapshot.cleanup() and has_running_background_hooks()
- machine/models.py: Binary.cleanup()
- workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start
- workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running

All subprocess management now uses:
- Process.current() for registering current process
- Process.get_running() / get_running_count() for querying
- Process.cleanup_stale_running() for cleanup
- safe_kill_process() for validated PID killing

Total line reduction: ~250 lines
2025-12-31 10:15:22 +00:00
Claude
2d3a2fec57
Add terminate, kill_tree, and query methods to Process model
This consolidates scattered subprocess management logic into the Process model:

- terminate(): Graceful SIGTERM → wait → SIGKILL (replaces stop_worker, etc.)
- kill_tree(): Kill process and all OS children (replaces os.killpg logic)
- kill_children_db(): Kill DB-tracked child processes
- get_running(): Query running processes by type (replaces get_all_worker_pids)
- get_running_count(): Count running processes (replaces get_running_worker_count)
- stop_all(): Stop all processes of a type
- get_next_worker_id(): Get next worker ID for spawning

Added Phase 8 to TODO documenting ~390 lines that can be deleted after
consolidation, including workers/pid_utils.py which becomes obsolete.

Also includes migration 0002 for parent FK and process_type fields.
2025-12-31 10:08:45 +00:00
Nick Sweeting
95d61b001e
fix migrations 2025-12-31 01:40:59 -08:00
Nick Sweeting
3d8c62ffb1
fix extensions dir paths add personas migration 2025-12-31 01:40:59 -08:00
Nick Sweeting
dd2302ad92
new jsonl cli interface 2025-12-30 16:12:53 -08:00
Nick Sweeting
ba8c28a866
use process_set for related name not processes 2025-12-30 12:55:23 -08:00
Claude
bc273c5a7f
feat: add schema_version to JSONL outputs and remove dead code
- Add schema_version (archivebox.VERSION) to all to_jsonl() outputs:
  - Snapshot.to_jsonl()
  - ArchiveResult.to_jsonl()
  - Binary.to_jsonl()
  - Process.to_jsonl()

- Update CLI commands to use model methods directly:
  - archivebox_snapshot.py: snapshot.to_jsonl()
  - archivebox_extract.py: result.to_jsonl()

- Remove dead wrapper functions from misc/jsonl.py:
  - snapshot_to_jsonl()
  - archiveresult_to_jsonl()
  - binary_to_jsonl()
  - process_to_jsonl()
  - machine_to_jsonl()

- Update tests to use model methods directly
2025-12-30 19:24:53 +00:00
Claude
a5206e7648
refactor: move to_jsonl() methods to models
Move JSONL serialization from standalone functions to model methods
to mirror the from_jsonl() pattern:

- Add Binary.to_jsonl() method
- Add Process.to_jsonl() method
- Add ArchiveResult.to_jsonl() method
- Add Snapshot.to_jsonl() method
- Update write_index_jsonl() to use model methods
- Update jsonl.py functions to be thin wrappers
2025-12-30 18:35:22 +00:00
Nick Sweeting
95beddc5fc
more migration fixes 2025-12-29 22:12:57 -08:00
Nick Sweeting
2e350d317d
fix initial migrtaions 2025-12-29 21:27:31 -08:00
Nick Sweeting
147d567d3f
fix migrations 2025-12-29 19:25:26 -08:00
Nick Sweeting
30c60eef76
much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533
use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d
wip 2025-12-28 17:51:54 -08:00
Claude
1b5a816022
Implement hook step-based concurrency system
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
2025-12-28 13:47:25 +00:00
Nick Sweeting
50e527ec65
way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Claude
741c098a2b
Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh 2025-12-27 05:53:06 +00:00
Nick Sweeting
2f81c0cc76
add overrides options to binproviders 2025-12-26 20:39:56 -08:00
Claude
ae2ab5b273
Add Python 3.13 support with uuid7 backport compatibility
- Create uuid_compat.py module that provides uuid7 for Python <3.14
  using uuid_extensions package, and native uuid.uuid7 for Python 3.14+
- Update all model files and migrations to use archivebox.uuid_compat
- Add uuid7 conditional dependency in pyproject.toml for Python <3.14
- Update requires-python to >=3.13 (from >=3.14)
- Update GitHub workflows, lock_pkgs.sh to use Python 3.13
- Update tool configs (ruff, pyright, uv) for Python 3.13

This enables running ArchiveBox on Python 3.13 while maintaining
forward compatibility with Python 3.14's native uuid7 support.
2025-12-27 01:07:30 +00:00
Nick Sweeting
bb53228ebf
remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26
logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186
remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c
wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81
wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
c1335fed37
Remove ABID system and KVTag model - use UUIDv7 IDs exclusively
This commit completes the simplification of the ID system by:

- Removing the ABID (ArchiveBox ID) system entirely
- Removing the base_models/abid.py file
- Removing KVTag model in favor of the existing Tag model in core/models.py
- Simplifying all models to use standard UUIDv7 primary keys
- Removing ABID-related admin functionality
- Cleaning up commented-out ABID code from views and statemachines
- Deleting migration files for ABID field removal (no longer needed)

All models now use simple UUIDv7 ids via `id = models.UUIDField(primary_key=True, default=uuid7)`

Note: Old migrations containing ABID references are preserved for database
migration history compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-24 06:13:49 -08:00
Nick Sweeting
651ba0b11c
add new Process model to Machine models 2024-12-12 21:45:55 -08:00
Nick Sweeting
569081a9eb
rename abid_utils to base_models 2024-11-18 19:40:05 -08:00
Nick Sweeting
c8e186f21b
fix plugin loading order, admin, abx-pkg 2024-11-16 06:44:12 -08:00
Nick Sweeting
b3c1cb716e
move abx plugins inside vendor dir 2024-10-28 04:07:35 -07:00
Nick Sweeting
80d8a6b667
split archivebox.use into archivebox.reads and archivebox.writes 2024-10-15 01:03:01 -07:00
Nick Sweeting
aaf069fab0
remove tags field from Machine admin 2024-10-15 01:02:13 -07:00