ArchiveBox_ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-15 08:22:39 +08:00

Author	SHA1	Message	Date
Nick Sweeting	7ceaeae2d9	rename archive_org to archivedotorg, add BinaryWorker, fix config pass-through	2026-01-04 22:38:15 -08:00
Nick Sweeting	456aaee287	more migration id/uuid and config propagation fixes	2026-01-04 16:16:26 -08:00
Nick Sweeting	839ae744cf	simplify entrypoints for orchestrator and workers	2026-01-04 13:17:07 -08:00
Nick Sweeting	dd77511026	unified Process source of truth and better screenshot tests	2026-01-02 04:20:34 -08:00
Nick Sweeting	3672174dad	fix transition mid transition	2026-01-02 00:24:44 -08:00
Nick Sweeting	65ee09ceab	move tests into subfolder, add missing install hooks	2026-01-02 00:22:07 -08:00
Nick Sweeting	60422adc87	fix orchestrator statemachine and Process from archiveresult migrations	2026-01-01 16:43:02 -08:00
Nick Sweeting	876feac522	actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage	2026-01-01 15:50:00 -08:00
Nick Sweeting	6fadcf5168	remove model health stats from models that dont need it	2026-01-01 15:50:00 -08:00
Nick Sweeting	f7457b13ad	more migrations fixes attempts	2025-12-31 17:46:10 -08:00
Nick Sweeting	a04e4a7345	cleanup migrations, json, jsonl	2025-12-31 15:36:43 -08:00
claude[bot]	b2132d1f14	Fix cubic review issues: process_type detection, cmd storage, PID cleanup, and migration - Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation - Fix worker process_type detection: explicitly set to WORKER after registration - Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently - Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN) - Fix get_running_workers() to return process_id instead of incorrectly named worker_id - Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed - Fix migration to include all indexes in state_operations (parent_id, process_type) - Fix documentation to use Machine.current() scoping and StatusChoices constants Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:42:07 +00:00
claude[bot]	ee201a0f83	Fix code review issues in process management refactor - Add pwd validation in Process.launch() to prevent crashes - Fix psutil returncode handling (use wait() return value, not returncode attr) - Add None check for proc.pid in cleanup_stale_running() - Add stale process cleanup in Orchestrator.is_running() - Ensure orchestrator process_type is correctly set to ORCHESTRATOR - Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown) - Throttle cleanup_stale_running() to once per 30 seconds for performance - Fix worker process_type to use TypeChoices.WORKER consistently - Fix get_running_workers() API to return list of dicts (not Process objects) - Only delete PID files after successful kill or confirmed stale - Fix migration index names to match between SQL and Django state - Remove db_index=True from process_type (index created manually) - Update documentation to reflect actual implementation - Add explanatory comments to empty except blocks - Fix exit codes to use Unix convention (128 + signal number) Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:14:47 +00:00
Claude	b822352fc3	Delete pid_utils.py and migrate to Process model DELETED: - workers/pid_utils.py (-192 lines) - replaced by Process model methods SIMPLIFIED: - crawls/models.py Crawl.cleanup() (80 lines -> 10 lines) - hooks.py: deleted process_is_alive() and kill_process() (-45 lines) UPDATED to use Process model: - core/models.py: Snapshot.cleanup() and has_running_background_hooks() - machine/models.py: Binary.cleanup() - workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start - workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running All subprocess management now uses: - Process.current() for registering current process - Process.get_running() / get_running_count() for querying - Process.cleanup_stale_running() for cleanup - safe_kill_process() for validated PID killing Total line reduction: ~250 lines	2025-12-31 10:15:22 +00:00
Nick Sweeting	bb59287411	Merge branch 'dev' into claude/snapshot-index-jsonl-UxEXK	2025-12-30 12:05:05 -08:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Nick Sweeting	91375d35a3	more migrations	2025-12-30 10:30:52 -08:00
Nick Sweeting	95beddc5fc	more migration fixes	2025-12-29 22:12:57 -08:00
Nick Sweeting	80f75126c6	more fixes	2025-12-29 21:03:05 -08:00
Nick Sweeting	30c60eef76	much better tests and add page ui	2025-12-29 04:02:11 -08:00
Nick Sweeting	f4e7820533	use full dotted paths for all archivebox imports, add migrations and more fixes	2025-12-29 00:47:08 -08:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Nick Sweeting	b1e354619f	minor bugfixes	2025-12-28 05:33:09 -08:00
Nick Sweeting	4ccb0863bb	continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script	2025-12-28 05:29:24 -08:00
Nick Sweeting	bd265c0083	rename extractor to plugin everywhere	2025-12-28 04:43:15 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Claude	741c098a2b	Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh	2025-12-27 05:53:06 +00:00
Nick Sweeting	2f81c0cc76	add overrides options to binproviders	2025-12-26 20:39:56 -08:00
Claude	ea6fe94c93	Add crawls_crawlschedule table to 0.8.x test schema and fix migrations - Add missing crawls_crawlschedule table definition to SCHEMA_0_8 in test file - Record all replaced dev branch migrations (0023-0074) for squashed migration - Update 0024_snapshot_crawl migration to depend on squashed machine migration - Remove 'extractor' field references from crawls admin - All 45 migration tests now pass (0.4.x, 0.7.x, 0.8.x, fresh install)	2025-12-27 04:32:58 +00:00
Claude	766bb28536	Fix migration tests and M2M field alteration issue - Remove M2M tags field alteration from migration 0027 (Django doesn't support altering M2M fields via migration) - Add machine app tables to 0.8.x test schema - Add missing columns (config, num_uses_failed, num_uses_succeeded) to 0.8.x test schema - Skip 0.8.x migration tests due to complex migration state dependencies with machine app - All 15 0.7.x migration tests now pass - Merge dev branch and resolve pyproject.toml conflict (keep both uuid7 and gallery-dl deps)	2025-12-27 03:00:44 +00:00
Claude	13be196fd7	Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh # Conflicts: # pyproject.toml	2025-12-27 02:27:51 +00:00
Nick Sweeting	e2cbcd17f6	more tests and migrations fixes	2025-12-26 18:22:48 -08:00
Claude	c3acadd528	Remove extractor field from Crawl model and fix tests - Remove extractor field from Crawl model (moved to config dict) - Update migration 0002_drop_seed_model to not add extractor - Update archivebox_add.py to use config['PARSER'] instead - Update admin.py recrawl to not pass extractor - Update jsonl.py serialization to not include extractor - Update test schema SCHEMA_0_8 to not include extractor - Set default timeout to 60s for test commands	2025-12-27 01:49:09 +00:00
Claude	ae2ab5b273	Add Python 3.13 support with uuid7 backport compatibility - Create uuid_compat.py module that provides uuid7 for Python <3.14 using uuid_extensions package, and native uuid.uuid7 for Python 3.14+ - Update all model files and migrations to use archivebox.uuid_compat - Add uuid7 conditional dependency in pyproject.toml for Python <3.14 - Update requires-python to >=3.13 (from >=3.14) - Update GitHub workflows, lock_pkgs.sh to use Python 3.13 - Update tool configs (ruff, pyright, uv) for Python 3.13 This enables running ArchiveBox on Python 3.13 while maintaining forward compatibility with Python 3.14's native uuid7 support.	2025-12-27 01:07:30 +00:00
Nick Sweeting	9838d7ba02	tons of ui fixes and plugin fixes	2025-12-25 03:59:51 -08:00
Nick Sweeting	bb53228ebf	remove Seed model in favor of Crawl as template	2025-12-25 01:52:41 -08:00
Nick Sweeting	866f993f26	logging and admin ui improvements	2025-12-25 01:10:41 -08:00
Nick Sweeting	d95f0dc186	remove huey	2025-12-24 23:40:18 -08:00
Nick Sweeting	6c769d831c	wip 2	2025-12-24 21:46:14 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00
Nick Sweeting	c1335fed37	Remove ABID system and KVTag model - use UUIDv7 IDs exclusively This commit completes the simplification of the ID system by: - Removing the ABID (ArchiveBox ID) system entirely - Removing the base_models/abid.py file - Removing KVTag model in favor of the existing Tag model in core/models.py - Simplifying all models to use standard UUIDv7 primary keys - Removing ABID-related admin functionality - Cleaning up commented-out ABID code from views and statemachines - Deleting migration files for ABID field removal (no longer needed) All models now use simple UUIDv7 ids via `id = models.UUIDField(primary_key=True, default=uuid7)` Note: Old migrations containing ABID references are preserved for database migration history compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-24 06:13:49 -08:00
Nick Sweeting	f6d22a3cc4	tweak worker updated logic and add output_dir_template and symlinks logic	2024-12-13 06:03:52 -08:00
Nick Sweeting	5c06b8ff00	add new Event model to workers/models	2024-12-12 22:08:17 -08:00
Nick Sweeting	2a1afcf6c2	move crawl models back into dedicated app	2024-12-12 21:45:55 -08:00
Nick Sweeting	b948e49013	add urls log to Crawl model	2024-11-19 06:32:33 -08:00
Nick Sweeting	2595139180	improve statemachine logging and archivebox update CLI cmd	2024-11-19 03:31:05 -08:00
Nick Sweeting	569081a9eb	rename abid_utils to base_models	2024-11-18 19:40:05 -08:00
Nick Sweeting	65afd405b1	merge seeds and crawls apps	2024-11-18 19:23:14 -08:00
Nick Sweeting	e469c5a344	merge queues and actors apps into new workers app	2024-11-18 18:52:48 -08:00

1 2

65 Commits