ArchiveBox_ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-02-20 00:56:07 +08:00

Author	SHA1	Message	Date
Claude	1a86789523	Move Chrome default args to config.json CHROME_ARGS - Add comprehensive default CHROME_ARGS in config.json with 55+ flags for deterministic rendering, security, performance, and UI suppression - Update chrome_utils.js launchChromium() to read CHROME_ARGS and CHROME_ARGS_EXTRA from environment variables (set by get_config()) - Add getEnvArray() helper to parse JSON arrays or comma-separated strings from environment variables - Separate args into three categories: 1. baseArgs: Static flags from CHROME_ARGS config (configurable) 2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.) 3. extraArgs: User overrides from CHROME_ARGS_EXTRA - Add CHROME_SANDBOX config option to control --no-sandbox flag Args are now configurable via: - config.json defaults - ArchiveBox.conf file - Environment variables - Per-crawl/snapshot config overrides	2025-12-31 00:57:29 +00:00
Claude	877b5f91c2	Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system - Add _derive_persona_paths() in configset.py to automatically derive CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA when not explicitly set. This allows plugins to use these paths without knowing about the persona system. - Update chrome_utils.js launchChromium() to accept userDataDir option and pass --user-data-dir to Chrome. Also cleans up SingletonLock before launch. - Update killZombieChrome() to clean up SingletonLock files from all persona chrome_user_data directories after killing zombies. - Update chrome_cleanup() in misc/util.py to handle persona-based user data directories when cleaning up stale Chrome state. - Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from env (derived by get_config()). Config priority flow: ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot) -> get_config() derives: CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions -> hooks receive these as env vars without needing persona logic	2025-12-31 00:21:07 +00:00
Nick Sweeting	08366cfa46	document chrome configs	2025-12-30 12:42:50 -08:00
Nick Sweeting	80f75126c6	more fixes	2025-12-29 21:03:05 -08:00
Nick Sweeting	64dccb7a19	passing	2025-12-29 18:55:57 -08:00
Nick Sweeting	5549a79869	more speed fixes	2025-12-29 18:55:37 -08:00
Nick Sweeting	abf5f44134	more debug logging	2025-12-29 18:53:52 -08:00
Nick Sweeting	bcf0513d05	more debug logging	2025-12-29 18:50:04 -08:00
Nick Sweeting	7e6e3be9e7	messing with chrome install process to reuse cached chromium with pinned version	2025-12-29 18:49:36 -08:00
Nick Sweeting	b670612685	centralize chrome pid and zombie logic in chrome_utils	2025-12-29 17:57:23 -08:00
Nick Sweeting	4ba3e8d120	fix extension loading and consolidate chromium logic	2025-12-29 17:47:37 -08:00
Nick Sweeting	638b3ba774	add modalcloser plugin	2025-12-29 14:36:15 -08:00
Nick Sweeting	8c69124935	make infiniscroll plugin also expand details and comments sections	2025-12-29 13:55:27 -08:00
Nick Sweeting	b649db5294	fix infiniscroll plugin	2025-12-29 13:55:26 -08:00
Nick Sweeting	690f0669cd	remove uneeded test	2025-12-29 13:30:25 -08:00
Nick Sweeting	73e977ea97	ytdlp fixes	2025-12-29 13:26:50 -08:00
Nick Sweeting	967c5d53e0	make plugin config more consistent	2025-12-29 13:21:46 -08:00
Nick Sweeting	8d76b2b0c6	add infiniscroll plugin	2025-12-29 13:14:40 -08:00
Claude	ac64c77341	move default yt-dlp args to config.json YTDLP_ARGS for user override - Move hardcoded default args from Python to config.json YTDLP_ARGS - Add get_ytdlp_args() function to read from YTDLP_ARGS env var - Keep format arg with max_size in code (depends on YTDLP_MAX_SIZE) - YTDLP_ARGS can be overridden as JSON array in environment	2025-12-29 19:38:37 +00:00
Claude	a5654e877f	rename media plugin to ytdlp with backwards-compatible aliases - Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias	2025-12-29 19:09:05 +00:00
Nick Sweeting	30c60eef76	much better tests and add page ui	2025-12-29 04:02:11 -08:00
Nick Sweeting	f4e7820533	use full dotted paths for all archivebox imports, add migrations and more fixes	2025-12-29 00:47:08 -08:00
Nick Sweeting	1e4d3ffd11	improve plugin tests and config	2025-12-29 00:45:23 -08:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Claude	1b5a816022	Implement hook step-based concurrency system This implements the hook concurrency plan from TODO_hook_concurrency.md: ## Schema Changes - Add Snapshot.current_step (IntegerField 0-9, default=0) - Create migration 0034_snapshot_current_step.py - Fix uuid_compat imports in migrations 0032 and 0003 ## Core Logic - Add extract_step(hook_name) utility - extracts step from __XX_ pattern - Add is_background_hook(hook_name) utility - checks for .bg. suffix - Update Snapshot.create_pending_archiveresults() to create one AR per hook - Update ArchiveResult.run() to handle hook_name field - Add Snapshot.advance_step_if_ready() method for step advancement - Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready() ## Worker Coordination - Update ArchiveResultWorker.get_queue() for step-based filtering - ARs are only claimable when their step <= snapshot.current_step ## Hook Renumbering - Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53, title→54, readability→55, headers→55, mercury→56, htmltotext→57 - Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg, forumdl→65.bg, papersdl→66.bg - Step 7 (URL extraction): parse_* hooks moved to 70-75 Background hooks (.bg suffix) don't block step advancement, enabling long-running downloads to continue while other hooks proceed.	2025-12-28 13:47:25 +00:00
Nick Sweeting	4ccb0863bb	continue renaming extractor to plugin, add plan for hook concurrency, add chrome kill helper script	2025-12-28 05:29:24 -08:00
Nick Sweeting	bd265c0083	rename extractor to plugin everywhere	2025-12-28 04:43:15 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Claude	b632894bc9	Update views, API, and exports for new ArchiveResult output fields Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.	2025-12-27 20:28:22 +00:00
Claude	e3ba599812	Update install hooks to respect XYZ_BINARY env vars - All install hooks now respect their respective XYZ_BINARY env vars (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.) - Support both absolute paths (/usr/bin/wget2) and binary names (wget2) - Dynamic bin_name used in Dependency JSONL output - Updated 11 install hooks to follow the new pattern - Mark checklist items as complete in TODO_hook_architecture.md	2025-12-27 10:12:45 +00:00
Claude	8c846b7d1c	Rename validate hooks to install hooks - Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks	2025-12-27 10:06:34 +00:00
Claude	2623c6cc11	Complete JS hooks to clean JSONL format + rename background hooks - Update 12 remaining JS snapshot hooks to output clean JSONL - Remove RESULT_JSON= prefix, START_TS=, END_TS=, STATUS= output - Rename 3 background hooks with .bg. suffix: - consolelog -> on_Snapshot__21_consolelog.bg.js - ssl -> on_Snapshot__23_ssl.bg.js - responses -> on_Snapshot__24_responses.bg.js - Update TODO_hook_architecture.md with completion status	2025-12-27 09:46:59 +00:00
Claude	c52eef1459	Update Python/JS hooks to clean JSONL format + add audit report Phase 4 Plugin Audit Progress: - Audited all 6 Dependency hooks (all already compliant) - Audited all 11 Crawl Validate hooks (all already compliant) - Updated 8 Python Snapshot hooks to clean JSONL format - Updated 1 JS Snapshot hook (title.js) to clean JSONL format Snapshot hooks updated to remove: - RESULT_JSON= prefix - Extra output lines (START_TS=, END_TS=, DURATION=, VERSION=, OUTPUT=, STATUS=) Now output clean JSONL: {"type": "ArchiveResult", "status": "...", "output_str": "..."} Added implementation report to TODO_hook_architecture.md documenting: - All completed phases (1, 3, 6, 7) - Plugin audit results with status tables - Remaining 13 JS hooks that need updating - Files modified list	2025-12-27 09:31:03 +00:00
Claude	741c098a2b	Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh	2025-12-27 05:53:06 +00:00
Nick Sweeting	2f81c0cc76	add overrides options to binproviders	2025-12-26 20:39:56 -08:00
Nick Sweeting	9bc5d99488	add overrides options to binproviders	2025-12-26 20:16:58 -08:00
Claude	13be196fd7	Merge remote-tracking branch 'origin/dev' into claude/improve-test-suite-xm6Bh # Conflicts: # pyproject.toml	2025-12-27 02:27:51 +00:00
Nick Sweeting	6fdc52cc57	add papersdl plugin	2025-12-26 18:25:52 -08:00
Nick Sweeting	e2cbcd17f6	more tests and migrations fixes	2025-12-26 18:22:48 -08:00
Claude	0941aca4a3	Improve test suite: remove mocks and add 0.8.x migration tests - Remove mock-based tests from plugin tests (headers, singlefile, ublock, captcha2) - Replace fake cache tests with real double-install tests that verify cache behavior - Add SCHEMA_0_8 and seed_0_8_data() for testing 0.8.x data directory migrations - Add TestMigrationFrom08x class with comprehensive migration tests: - Snapshot count preservation - Crawl record preservation - Snapshot-to-crawl relationship preservation - Tag preservation - ArchiveResult status preservation - CLI command verification after migration - Add more CLI tests for add command (tags, multiple URLs, file input) - All tests now use real functionality without mocking	2025-12-26 23:01:49 +00:00
Nick Sweeting	0fbcbd2616	gallerydl template	2025-12-26 11:55:19 -08:00
Nick Sweeting	4fd7fcdbcf	new gallerydl plugin and more	2025-12-26 11:55:03 -08:00
Nick Sweeting	9838d7ba02	tons of ui fixes and plugin fixes	2025-12-25 03:59:51 -08:00
Nick Sweeting	866f993f26	logging and admin ui improvements	2025-12-25 01:10:41 -08:00
Nick Sweeting	d95f0dc186	remove huey	2025-12-24 23:40:18 -08:00
Nick Sweeting	6c769d831c	wip 2	2025-12-24 21:46:14 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00

47 Commits