ArchiveBox_ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-01-18 18:12:51 +08:00

Author	SHA1	Message	Date
Nick Sweeting	dd2302ad92	new jsonl cli interface	2025-12-30 16:12:53 -08:00
Claude	69965a2782	fix: correct CLI pipeline data flow for crawl -> snapshot -> extract - archivebox crawl: creates Crawl records, outputs Crawl JSONL - archivebox snapshot: accepts Crawl JSONL, creates Snapshots, outputs Snapshot JSONL - archivebox extract: accepts Snapshot JSONL, runs extractors, outputs ArchiveResult JSONL Changes: - Add Crawl.from_jsonl() method for creating Crawl from JSONL records - Rewrite archivebox_crawl.py to create Crawl jobs without immediately starting them - Update archivebox_snapshot.py to accept both Crawl JSONL and plain URLs - Update jsonl.py docstring to document the pipeline	2025-12-30 19:42:41 +00:00
Claude	ae648c9bc1	refactor: move remaining JSONL methods to models, clean up jsonl.py - Add Tag.to_jsonl() method with schema_version - Add Crawl.to_jsonl() method with schema_version - Fix Tag.from_jsonl() to not depend on jsonl.py helper - Update tests to use Snapshot.from_jsonl() instead of non-existent get_or_create_snapshot Remove model-specific functions from misc/jsonl.py: - tag_to_jsonl() - use Tag.to_jsonl() instead - crawl_to_jsonl() - use Crawl.to_jsonl() instead - get_or_create_tag() - use Tag.from_jsonl() instead - process_jsonl_records() - use model from_jsonl() methods directly jsonl.py now only contains generic I/O utilities: - Type constants (TYPE_SNAPSHOT, etc.) - parse_line(), read_stdin(), read_file(), read_args_or_stdin() - write_record(), write_records() - filter_by_type(), process_records()	2025-12-30 19:30:18 +00:00
Claude	bc273c5a7f	feat: add schema_version to JSONL outputs and remove dead code - Add schema_version (archivebox.VERSION) to all to_jsonl() outputs: - Snapshot.to_jsonl() - ArchiveResult.to_jsonl() - Binary.to_jsonl() - Process.to_jsonl() - Update CLI commands to use model methods directly: - archivebox_snapshot.py: snapshot.to_jsonl() - archivebox_extract.py: result.to_jsonl() - Remove dead wrapper functions from misc/jsonl.py: - snapshot_to_jsonl() - archiveresult_to_jsonl() - binary_to_jsonl() - process_to_jsonl() - machine_to_jsonl() - Update tests to use model methods directly	2025-12-30 19:24:53 +00:00
Claude	a5206e7648	refactor: move to_jsonl() methods to models Move JSONL serialization from standalone functions to model methods to mirror the from_jsonl() pattern: - Add Binary.to_jsonl() method - Add Process.to_jsonl() method - Add ArchiveResult.to_jsonl() method - Add Snapshot.to_jsonl() method - Update write_index_jsonl() to use model methods - Update jsonl.py functions to be thin wrappers	2025-12-30 18:35:22 +00:00
Claude	d36079829b	feat: replace index.json with index.jsonl flat JSONL format Switch from hierarchical index.json to flat index.jsonl format for snapshot metadata storage. Each line is a self-contained JSON record with a 'type' field (Snapshot, ArchiveResult, Binary, Process). Changes: - Add JSONL_INDEX_FILENAME constant to constants.py - Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants - Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters - Add Snapshot.write_index_jsonl() to write new format - Add Snapshot.read_index_jsonl() to read new format - Add Snapshot.convert_index_json_to_jsonl() for migration - Update Snapshot.reconcile_with_index() to handle both formats - Update fs_migrate to convert during filesystem migration - Update load_from_directory/create_from_directory for both formats - Update legacy.py parse_json_links_details for JSONL support The new format is easier to parse, extend, and mix record types.	2025-12-30 18:21:06 +00:00
Nick Sweeting	f0aa19fa7d	wip	2025-12-28 17:51:54 -08:00
Nick Sweeting	bd265c0083	rename extractor to plugin everywhere	2025-12-28 04:43:15 -08:00
Nick Sweeting	50e527ec65	way better plugin hooks system wip	2025-12-28 03:39:59 -08:00
Claude	b632894bc9	Update views, API, and exports for new ArchiveResult output fields Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.	2025-12-27 20:28:22 +00:00
Claude	c3acadd528	Remove extractor field from Crawl model and fix tests - Remove extractor field from Crawl model (moved to config dict) - Update migration 0002_drop_seed_model to not add extractor - Update archivebox_add.py to use config['PARSER'] instead - Update admin.py recrawl to not pass extractor - Update jsonl.py serialization to not include extractor - Update test schema SCHEMA_0_8 to not include extractor - Set default timeout to 60s for test commands	2025-12-27 01:49:09 +00:00
Nick Sweeting	bb53228ebf	remove Seed model in favor of Crawl as template	2025-12-25 01:52:41 -08:00
Nick Sweeting	1915333b81	wip major changes	2025-12-24 20:10:38 -08:00

13 Commits