ArchiveBox_ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-02-04 18:37:26 +08:00

Author	SHA1	Message	Date
Nick Sweeting	ec4b27056e	wip	2026-01-21 03:19:56 -08:00
Nick Sweeting	86e7973334	cleanup tui, startup, card templtes, and more	2026-01-19 14:33:20 -08:00
Nick Sweeting	bef67760db	working singlefile	2026-01-19 03:05:49 -08:00
Nick Sweeting	b5bbc3b549	better tui	2026-01-19 01:53:32 -08:00
Nick Sweeting	1cb2d5070e	bump version	2026-01-19 01:11:59 -08:00
Nick Sweeting	c7b2217cd6	tons of fixes with codex	2026-01-19 01:00:53 -08:00
Nick Sweeting	0a2ac11b01	more binary fixes	2026-01-05 02:26:33 -08:00
Nick Sweeting	b80e80439d	more binary fixes	2026-01-05 02:18:38 -08:00
Nick Sweeting	7ceaeae2d9	rename archive_org to archivedotorg, add BinaryWorker, fix config pass-through	2026-01-04 22:38:15 -08:00
Nick Sweeting	456aaee287	more migration id/uuid and config propagation fixes	2026-01-04 16:16:26 -08:00
Nick Sweeting	839ae744cf	simplify entrypoints for orchestrator and workers	2026-01-04 13:17:07 -08:00
Nick Sweeting	dd77511026	unified Process source of truth and better screenshot tests	2026-01-02 04:20:34 -08:00
Nick Sweeting	3672174dad	fix transition mid transition	2026-01-02 00:24:44 -08:00
Nick Sweeting	65ee09ceab	move tests into subfolder, add missing install hooks	2026-01-02 00:22:07 -08:00
Nick Sweeting	c2afb40350	fix lib bin dir and archivebox add hanging	2026-01-01 16:58:47 -08:00
Nick Sweeting	9008cefca2	codecov, migrations, orchestrator fixes	2026-01-01 16:57:04 -08:00
Nick Sweeting	60422adc87	fix orchestrator statemachine and Process from archiveresult migrations	2026-01-01 16:43:02 -08:00
Nick Sweeting	876feac522	actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage	2026-01-01 15:50:00 -08:00
Nick Sweeting	6fadcf5168	remove model health stats from models that dont need it	2026-01-01 15:50:00 -08:00
Nick Sweeting	e903fa1d2b	Fix: Make SingleFile use SINGLEFILE_CHROME_ARGS with fallback to CHROME_ARGS (#1754 ) Fixes #1445 This PR resolves the issue where SingleFile was not respecting Chrome user data directory and other Chrome launch options that work for other Chrome-based extractors (PDF, Screenshot, etc.). ## Changes - Added `SINGLEFILE_CHROME_ARGS` config option with fallback to `CHROME_ARGS` - Updated SingleFile extractor to pass Chrome arguments via `--browser-args` - Updated documentation This ensures SingleFile respects the same Chrome configuration as other Chrome-based extractors. Generated with [Claude Code](https://claude.ai/code)	2026-01-01 14:34:05 -08:00
Claude	09a1ca3134	Fix hook priority conflicts and standardize on_Binary naming on_Snapshot priority fixes: - redirects.bg.js stays at 31, staticfile.bg.js → 32 - headers.js stays at 55, readability.py → 56 - mercury.py → 57, htmltotext.py → 58 on_Binary hooks now have numeric priorities: - 10: npm_install.py - 11: pip_install.py - 12: brew_install.py - 13: apt_install.py - 14: custom_install.py - 15: env_install.py	2026-01-01 01:31:52 +00:00
Claude	4d33084496	Remove redundant chrome_validate hook, rename wget_validate to wget_install - Delete chrome/on_Crawl__10_chrome_validate.py (duplicates chrome_install) - Rename wget/on_Crawl__11_wget_validate.py → on_Crawl__06_wget_install.py All hooks now follow consistent naming: install, launch, or config	2025-12-31 23:41:40 +00:00
Nick Sweeting	a04e4a7345	cleanup migrations, json, jsonl	2025-12-31 15:36:43 -08:00
Claude	4c77949197	Clean up on_Crawl hooks: remove duplicates and standardize naming Deleted dead/duplicate hooks: - wget/on_Crawl__10_install_wget.py (duplicate of __10_wget_validate_config.py) - chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one) - chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version) - singlefile/on_Crawl__20_install_singlefile_extension.js (disabled/dead) - istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy) - ublock/on_Crawl__03_ublock.js (legacy, kept __20 version) - Entire captcha2/ plugin (legacy version of twocaptcha/) Renamed hooks to follow consistent pattern: on_Crawl__XX_<plugin>_<action>.<ext> Priority bands: 00-09: Binary/extension installation 10-19: Config validation 20-29: Browser launch and post-launch config Final hooks: 00 ripgrep_install.py, 01 chrome_install.py 02 istilldontcareaboutcookies_install.js 03 ublock_install.js, 04 singlefile_install.js 05 twocaptcha_install.js 10 chrome_validate.py, 11 wget_validate.py 20 chrome_launch.bg.js, 25 twocaptcha_config.js	2025-12-31 22:47:36 +00:00
Nick Sweeting	73fde81fce	more migrations tweaks	2025-12-31 12:34:31 -08:00
Nick Sweeting	469932b469	more	2025-12-31 12:34:31 -08:00
Nick Sweeting	72f6a91b31	more progress bar and migrations fixes	2025-12-31 12:34:31 -08:00
Nick Sweeting	d5c0c64dcd	fix progress bars	2025-12-31 12:34:29 -08:00
Nick Sweeting	cb97f6651b	Add DNS traffic recorder plugin (#1748 )	2025-12-31 11:02:43 -08:00
Nick Sweeting	60a4581ed8	Add tests for accessibility, parse_dom_outlinks, and consolelog plugins (#1749 )	2025-12-31 11:01:56 -08:00
claude[bot]	1f84d1b467	Fix test assertions to fail when data is missing - Add assertIsNotNone for accessibility_data to ensure test fails if no data generated - Capture and report JSON decode errors in parse_dom_outlinks test - Add assertIsNotNone for outlinks_data with error details - Removes conditional checks that allowed tests to pass without verifying functionality Addresses review comments from cubic-dev-ai Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 19:00:30 +00:00
claude[bot]	483929391d	Fix test assertions to fail properly and add NXDOMAIN deduplication - test_seo.py: Add assertIsNotNone before conditional to catch SEO extraction failures - test_ssl.py: Add assertIsNotNone to ensure SSL data is captured from HTTPS URLs - test_pip_provider.py: Assert jsonl_found variable to verify binary discovery - dns plugin: Deduplicate NXDOMAIN records using seenResolutions map Tests now fail when functionality doesn't work (no cheating). Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 19:00:28 +00:00
Nick Sweeting	edc83bfac6	Add persona CLI command with browser cookie import (#1747 )	2025-12-31 10:56:40 -08:00
Claude	2a68248602	Update all Chrome plugins to use shared chrome_utils.js Refactored 8 plugins to import shared utilities instead of duplicating code locally: - consolelog, redirects: Complete rewrite using shared utils - modalcloser, staticfile: Use readCdpUrl, readTargetId, parseArgs - dom, screenshot, pdf: Remove local parseArgs/getCdpUrl - headers: Import getEnv, getEnvBool, getEnvInt, parseArgs Removes ~380 lines of duplicated boilerplate code.	2025-12-31 18:35:25 +00:00
Claude	263335dc6d	Add tests for merkletree and custom binary provider plugins - merkletree: Tests merkle tree generation with real files, empty directory handling, and disabled mode - custom: Tests custom bash command execution and binary discovery	2025-12-31 18:30:04 +00:00
Claude	9703a8e88c	Add tests for responses, staticfile, and env provider plugins - responses: Tests network response capture during page load - staticfile: Tests static file detection and download skip for HTML - env: Tests PATH-based binary discovery (python3, bash)	2025-12-31 18:28:01 +00:00
Claude	cfa5edb160	Add tests for accessibility, parse_dom_outlinks, and consolelog plugins Real integration tests using Chrome sessions with example.com: - accessibility: Tests page outline and accessibility tree extraction - parse_dom_outlinks: Tests link extraction and categorization - consolelog: Tests console output capture	2025-12-31 18:25:48 +00:00
Claude	47d9874c1f	Merge remote-tracking branch 'origin/dev' into claude/dns-traffic-recorder-plugin-dNbxC	2025-12-31 18:24:56 +00:00
claude[bot]	08383c4d83	Fix tautological assertion in SEO test The assertion was checking 'has_seo_data or seo_data' inside an 'if seo_data:' block, making it always truthy. Changed to just check 'has_seo_data' to properly verify that expected SEO keys were extracted. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 18:19:47 +00:00
Claude	5d8c93eaf4	Consolidate CDP connection logic into chrome_utils.js Add shared snapshot hook utilities to chrome_utils.js: - parseArgs(): CLI argument parsing - waitForChromeSession(): Wait for CDP session files - readCdpUrl(): Read CDP WebSocket URL - readTargetId(): Read target page ID - connectToPage(): High-level browser/page connection - waitForPageLoaded(): Wait for navigation completion Refactor ssl, responses, and dns plugins to use shared utilities, eliminating ~100 lines of duplicated code across plugins.	2025-12-31 12:15:30 +00:00
Claude	73425fa984	Add persona CLI command with browser cookie import - Add `archivebox persona create/list/update/delete` commands - Support `--import=chrome\|firefox\|brave` to copy browser profile - Extract cookies via CDP to generate cookies.txt for non-browser tools - Fix JSDoc comment parsing issue in chrome_utils.js	2025-12-31 12:13:07 +00:00
Claude	f2c20f141c	Refactor dns plugin to use chrome_utils.js Import shared utilities (getEnv, getEnvBool, getEnvInt) from chrome_utils.js instead of duplicating them locally. Also use DNS_TIMEOUT config for dynamic timeout calculations.	2025-12-31 12:08:28 +00:00
Claude	13148fd6b5	Add DNS traffic recorder plugin Records hostname → IP resolutions during page load using Chrome CDP. Uses Network.responseReceived events to capture DNS resolution data and writes one JSON line per record to dns.jsonl. Features: - Captures hostname to IP address mappings (A/AAAA records) - Records failed DNS lookups (NXDOMAIN) - Deduplicates resolution records per page load - Integrates with existing Chrome plugin infrastructure	2025-12-31 12:05:02 +00:00
Claude	8a0acdebcd	Add SSL, redirects, SEO plugin tests and fix fake test issues - Add real integration tests for SSL, redirects, and SEO plugins using Chrome session helpers for live URL testing - Remove fake "format" tests that just created dicts and asserted on them (apt, pip, npm provider output format tests) - Remove npm integration test that created dirs then checked they existed - Fix SQLite search test to use SQLITEFTS_DB constant instead of hardcoded value	2025-12-31 12:00:00 +00:00
Claude	a063d8cd43	Merge remote-tracking branch 'origin/dev' into claude/analyze-test-coverage-mWgwv	2025-12-31 11:45:22 +00:00
Claude	0cb5f0712d	Add comprehensive tests for machine/process models, orchestrator, and search backends This adds new test coverage for previously untested areas: Machine module (archivebox/machine/tests/): - Machine, NetworkInterface, Binary, Process model tests - BinaryMachine and ProcessMachine state machine tests - JSONL serialization/deserialization tests - Manager method tests Workers module (archivebox/workers/tests/): - PID file utility tests (write, read, cleanup) - Orchestrator lifecycle and queue management tests - Worker spawning logic tests - Idle detection and exit condition tests Search backends: - SQLite FTS5 search tests with real indexed content - Phrase search, stemming, and unicode support - Ripgrep search tests with archive directory structure - Environment variable configuration tests Binary provider plugins: - pip provider hook tests - npm provider hook tests with PATH updates - apt provider hook tests	2025-12-31 11:33:27 +00:00
claude[bot]	5121b0e5f9	Merge branch 'dev' into claude/refactor-process-management-WcQyZ Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR. Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>	2025-12-31 11:28:47 +00:00
Nick Sweeting	1d15901304	fix process health stats	2025-12-31 01:40:59 -08:00
Nick Sweeting	3d8c62ffb1	fix extensions dir paths add personas migration	2025-12-31 01:40:59 -08:00
Nick Sweeting	8dab2966cc	Consolidate Chrome test helpers across all plugin tests (#1738 ) <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk	2025-12-31 01:25:39 -08:00

1 2 3

114 Commits