feat: adds crawl resource cache #25

Merged
admin merged 1 commits from feature/crawl-resource-cache into main 2026-05-14 09:27:07 +00:00
Owner

Summary

Adds resource-level memoization for full crawl runs so unchanged employee profiles can be skipped without creating redundant snapshots.

What changed

  • Added parse_resource_cache storage for crawled profile resources.
  • Added conditional HTTP requests with ETag / Last-Modified support.
  • Reuses cached response bodies on 304 Not Modified.
  • Tracks profile resources used during parsing: main HTML, publications API, and graduation theses API.
  • Skips new EmployeeSnapshot creation when parsed checksum and parser version are unchanged.
  • Added crawl_runs.skipped_count and surfaced it in admin/API/MCP progress payloads.
  • Normalized date-dependent experience text for checksums, so automatic “years of experience” drift does not count as profile changes.
  • Updated README MCP tools list and parsing progress docs.
  • Bumped service version to 0.6.0.

Testing

  • Added coverage for conditional cache reuse via ETag.
  • Added coverage for skipping snapshots when checksum is unchanged.
  • Added checksum regression tests for widget changes and date-dependent experience text.
  • Updated progress/version tests.

Validation run:

.venv\Scripts\python.exe -m pytest
## Summary Adds resource-level memoization for full crawl runs so unchanged employee profiles can be skipped without creating redundant snapshots. ## What changed - Added `parse_resource_cache` storage for crawled profile resources. - Added conditional HTTP requests with `ETag` / `Last-Modified` support. - Reuses cached response bodies on `304 Not Modified`. - Tracks profile resources used during parsing: main HTML, publications API, and graduation theses API. - Skips new `EmployeeSnapshot` creation when parsed checksum and parser version are unchanged. - Added `crawl_runs.skipped_count` and surfaced it in admin/API/MCP progress payloads. - Normalized date-dependent experience text for checksums, so automatic “years of experience” drift does not count as profile changes. - Updated README MCP tools list and parsing progress docs. - Bumped service version to `0.6.0`. ## Testing - Added coverage for conditional cache reuse via `ETag`. - Added coverage for skipping snapshots when checksum is unchanged. - Added checksum regression tests for widget changes and date-dependent experience text. - Updated progress/version tests. Validation run: ```bash .venv\Scripts\python.exe -m pytest
admin added 1 commit 2026-05-14 09:26:58 +00:00
admin merged commit 4b91effee3 into main 2026-05-14 09:27:07 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/miem_workers#25