176 Commits

Author SHA1 Message Date
claude[bot]
c2bb4b25cb Implement native LDAP authentication support
- Create archivebox/config/ldap.py with LDAPConfig class
- Create archivebox/ldap/ Django app with custom auth backend
- Update core/settings.py to conditionally load LDAP when enabled
- Add LDAP_CREATE_SUPERUSER support to auto-grant superuser privileges
- Add comprehensive tests in test_auth_ldap.py (no mocks, no skips)
- LDAP only activates if django-auth-ldap is installed and LDAP_ENABLED=True
- Helpful error messages when LDAP libraries are missing or config is incomplete

Fixes #1664

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2026-01-05 21:30:26 +00:00
Nick Sweeting
7ceaeae2d9
rename archive_org to archivedotorg, add BinaryWorker, fix config pass-through 2026-01-04 22:38:15 -08:00
Nick Sweeting
456aaee287
more migration id/uuid and config propagation fixes 2026-01-04 16:16:26 -08:00
Nick Sweeting
c2afb40350
fix lib bin dir and archivebox add hanging 2026-01-01 16:58:47 -08:00
Nick Sweeting
60422adc87
fix orchestrator statemachine and Process from archiveresult migrations 2026-01-01 16:43:02 -08:00
Nick Sweeting
876feac522
actually working migration path from 0.7.2 and 0.8.6 + renames and test coverage 2026-01-01 15:50:00 -08:00
Claude
04c23badc2
Fix output path structure for 0.9.x data directory
- Update Crawl.output_dir_parent to use username instead of user_id
  for consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier debugging:
  users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
  can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
2025-12-31 08:18:24 +00:00
Claude
b8a66c4a84
Convert Persona to Django ModelWithConfig, add to get_config()
- Convert Persona from plain Python class to Django model with ModelWithConfig
- Add config JSONField for persona-specific config overrides
- Add get_derived_config() method that returns config with derived paths:
  - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA

- Update get_config() to accept persona parameter in merge chain:
  get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot)

- Remove _derive_persona_paths() - derivation now happens in Persona model

- Merge order (highest to lowest priority):
  1. snapshot.config
  2. crawl.config
  3. user.config
  4. persona.get_derived_config()  <- NEW
  5. environment variables
  6. ArchiveBox.conf file
  7. plugin defaults
  8. core defaults

Usage:
  config = get_config(persona=crawl.persona, crawl=crawl)
  config['CHROME_USER_DATA_DIR']  # derived from persona
2025-12-31 01:07:29 +00:00
Claude
877b5f91c2
Derive CHROME_USER_DATA_DIR from ACTIVE_PERSONA in config system
- Add _derive_persona_paths() in configset.py to automatically derive
  CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA
  when not explicitly set. This allows plugins to use these paths
  without knowing about the persona system.

- Update chrome_utils.js launchChromium() to accept userDataDir option
  and pass --user-data-dir to Chrome. Also cleans up SingletonLock
  before launch.

- Update killZombieChrome() to clean up SingletonLock files from all
  persona chrome_user_data directories after killing zombies.

- Update chrome_cleanup() in misc/util.py to handle persona-based
  user data directories when cleaning up stale Chrome state.

- Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR
  and CHROME_EXTENSIONS_DIR from env (derived by get_config()).

Config priority flow:
  ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot)
  -> get_config() derives:
     CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data
     CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions
  -> hooks receive these as env vars without needing persona logic
2025-12-31 00:21:07 +00:00
claude[bot]
762cddc8c5 fix: address PR review comments from cubic-dev-ai
- Add JSONL_INDEX_FILENAME to ALLOWED_IN_DATA_DIR for consistency
- Fix fallback logic in legacy.py to try JSON when JSONL parsing fails
- Replace bare except clauses with specific exception types
- Fix stdin double-consumption in archivebox_crawl.py
- Merge CLI --tag option with crawl tags in archivebox_snapshot.py
- Remove tautological mock tests (covered by integration tests)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-30 20:09:51 +00:00
Claude
d36079829b
feat: replace index.json with index.jsonl flat JSONL format
Switch from hierarchical index.json to flat index.jsonl format for
snapshot metadata storage. Each line is a self-contained JSON record
with a 'type' field (Snapshot, ArchiveResult, Binary, Process).

Changes:
- Add JSONL_INDEX_FILENAME constant to constants.py
- Add TYPE_PROCESS and TYPE_MACHINE to jsonl.py type constants
- Add binary_to_jsonl(), process_to_jsonl(), machine_to_jsonl() converters
- Add Snapshot.write_index_jsonl() to write new format
- Add Snapshot.read_index_jsonl() to read new format
- Add Snapshot.convert_index_json_to_jsonl() for migration
- Update Snapshot.reconcile_with_index() to handle both formats
- Update fs_migrate to convert during filesystem migration
- Update load_from_directory/create_from_directory for both formats
- Update legacy.py parse_json_links_details for JSONL support

The new format is easier to parse, extend, and mix record types.
2025-12-30 18:21:06 +00:00
claude[bot]
329d185d95 Fix: Make CUSTOM_TEMPLATES_DIR configurable again
Resolves issue #1484 where CUSTOM_TEMPLATES_DIR configuration was
being ignored. The setting was previously removed from ServerConfig
and hardcoded as a constant, preventing users from customizing the
templates directory location.

Changes:
- Added CUSTOM_TEMPLATES_DIR field to StorageConfig in common.py
- Updated settings.py to use STORAGE_CONFIG.CUSTOM_TEMPLATES_DIR
- Updated paths.py to use configurable value in version output

Users can now configure the custom templates directory via:
- ArchiveBox.conf: CUSTOM_TEMPLATES_DIR = ./custom_templates
- Environment variable: export CUSTOM_TEMPLATES_DIR=/path/to/templates
- Defaults to DATA_DIR/user_templates if not configured

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
2025-12-29 21:50:21 +00:00
Claude
88d7906033
Add MAX_URL_ATTEMPTS config option to stop retries after too many failures
Adds a new MAX_URL_ATTEMPTS configuration option (default: 50) that stops
retrying ArchiveResult hooks for a snapshot once that many failures have
been recorded. This prevents infinite retry loops for problematic URLs.

When the limit is reached, any pending ArchiveResults for that snapshot
are marked as SKIPPED with an explanatory message.
2025-12-29 20:20:50 +00:00
Nick Sweeting
30c60eef76
much better tests and add page ui 2025-12-29 04:02:11 -08:00
Nick Sweeting
f4e7820533
use full dotted paths for all archivebox imports, add migrations and more fixes 2025-12-29 00:47:08 -08:00
Nick Sweeting
f0aa19fa7d
wip 2025-12-28 17:51:54 -08:00
Nick Sweeting
50e527ec65
way better plugin hooks system wip 2025-12-28 03:39:59 -08:00
Nick Sweeting
9838d7ba02
tons of ui fixes and plugin fixes 2025-12-25 03:59:51 -08:00
Nick Sweeting
bb53228ebf
remove Seed model in favor of Crawl as template 2025-12-25 01:52:41 -08:00
Nick Sweeting
866f993f26
logging and admin ui improvements 2025-12-25 01:10:41 -08:00
Nick Sweeting
d95f0dc186
remove huey 2025-12-24 23:40:18 -08:00
Nick Sweeting
6c769d831c
wip 2 2025-12-24 21:46:14 -08:00
Nick Sweeting
1915333b81
wip major changes 2025-12-24 20:10:38 -08:00
Nick Sweeting
ac53fdf677
make chrome binary and configs directly runnable and make extractor use external bin 2024-12-06 02:06:39 -08:00
Nick Sweeting
c9a05c9d94
working archivebox update CLI cmd 2024-11-19 02:32:05 -08:00
Nick Sweeting
328eb98a38
move main funcs into cli files and switch to using click for CLI 2024-11-19 00:18:51 -08:00
Nick Sweeting
4a5d607296
move logging_util into archivebox.misc subfolder 2024-11-18 19:08:49 -08:00
Nick Sweeting
e469c5a344
merge queues and actors apps into new workers app 2024-11-18 18:52:48 -08:00
Nick Sweeting
67c22b2df0
fix config set not working with constants 2024-11-18 04:27:37 -08:00
Nick Sweeting
c8e186f21b
fix plugin loading order, admin, abx-pkg 2024-11-16 06:44:12 -08:00
Nick Sweeting
684a394cba
add HOSTNAME to config.permissions 2024-11-16 02:45:58 -08:00
Nick Sweeting
9b24fe7390
merge dev 2024-11-02 17:34:33 -07:00
Nick Sweeting
721427a484
hide progress bar on startup 2024-10-31 07:11:15 -07:00
Nick Sweeting
d93aa46949
fix django.forms.JSONField does not exist 500 error 2024-10-28 18:47:45 -07:00
Nick Sweeting
b3c1cb716e
move abx plugins inside vendor dir 2024-10-28 04:07:35 -07:00
Nick Sweeting
60f0458c77
rename configfile to collection 2024-10-24 15:40:24 -07:00
Nick Sweeting
9e40dd69a4
more config improvements, move away from settings GLOBALS to getters 2024-10-24 14:50:07 -07:00
Nick Sweeting
312e40b95b
finally get rid of config/legacy in favor of configfile.py and django.py 2024-10-21 03:06:19 -07:00
Nick Sweeting
b3107ab830
move final legacy config to plugins and fix archivebox config cmd and add search opt 2024-10-21 02:56:00 -07:00
Nick Sweeting
7a6f1f36d2
trigger abx.pm.hook.ready from core.AppConfig.ready 2024-10-21 01:31:02 -07:00
Nick Sweeting
a211461ffc
fix LIB_DIR and TMP_DIR loading when primary option isnt available 2024-10-21 00:35:56 -07:00
Nick Sweeting
80d8a6b667
split archivebox.use into archivebox.reads and archivebox.writes 2024-10-15 01:03:01 -07:00
Nick Sweeting
df79b8e038
rename config sections to match old sections 2024-10-15 01:01:34 -07:00
Nick Sweeting
01ba6d49d3
new vastly simplified plugin spec without pydantic 2024-10-14 21:50:47 -07:00
Nick Sweeting
86380a1ef2
fix .archivebox_id being created outside collection dir 2024-10-14 17:35:43 -07:00
Nick Sweeting
6e7071bd19
add new binproviders and binaries args to install and version, bump pydantic-pkgr version 2024-10-11 00:45:59 -07:00
Nick Sweeting
0c29e08f73
avoid creating collection id file on every startup since its not needed 2024-10-09 19:12:08 -07:00
Nick Sweeting
de7ab65f11
ignore errors when chowning at initial startup 2024-10-09 04:48:09 -07:00
Nick Sweeting
ad675a8e7c
properly handle chowning DATA_DIR on init when using sudo 2024-10-09 04:39:09 -07:00
Nick Sweeting
1b7aca130b
properly detect sudo UID 2024-10-09 04:02:46 -07:00