Compare commits

...

2 Commits

Author SHA1 Message Date
4b91effee3 Merge pull request 'feat: adds crawl resource cache' (#25) from feature/crawl-resource-cache into main
Reviewed-on: #25
2026-05-14 09:27:06 +00:00
Anton
6724b3f369 feat: adds crawl resource cache 2026-05-14 12:21:44 +03:00
20 changed files with 1192 additions and 73 deletions

592
MCP_DESCRIPTION.md Normal file
View File

@@ -0,0 +1,592 @@
# MCP: описание работы, структуры и тулзов
Документ описывает MCP endpoint сервиса `miem-employees` по текущей реализации в `app/mcp.py`.
## Где находится MCP
- FastAPI router: `app.mcp.router`
- Подключение к приложению: `app/main.py`
- HTTP endpoint: `POST /mcp`
- Локально при обычном запуске API: `http://localhost:8000/mcp`
- В Docker Compose через отдельный сервис `mcp`: `http://localhost:8001/mcp`
- Авторизация на уровне приложения: отсутствует. Заголовок `Authorization` не проверяется и не влияет на ответ.
Если доступ к MCP нужно ограничить, это должно делаться внешним контуром: bind на localhost, VPN, firewall, reverse proxy или отдельная сетевая политика.
## Протокол
Endpoint принимает JSON-RPC 2.0 over HTTP.
Общий формат запроса:
```json
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/list",
"params": {}
}
```
Общий формат успешного ответа:
```json
{
"jsonrpc": "2.0",
"id": 1,
"result": {}
}
```
Общий формат ошибки:
```json
{
"jsonrpc": "2.0",
"id": 1,
"error": {
"code": -32601,
"message": "Method not found"
}
}
```
Поддерживаемая версия MCP-протокола:
```text
2024-11-05
```
Имя сервиса:
```text
miem-employees
```
Версия сервера берется из `app.version.BACKEND_VERSION`.
## Поддерживаемые JSON-RPC методы
### initialize
Возвращает метаданные MCP-сервера и capabilities.
Запрос:
```json
{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {}
}
```
Ответ:
```json
{
"jsonrpc": "2.0",
"id": 1,
"result": {
"protocolVersion": "2024-11-05",
"serverInfo": {
"name": "miem-employees",
"version": "0.5.0"
},
"capabilities": {
"tools": {}
}
}
}
```
### tools/list
Возвращает список доступных tools с JSON Schema для аргументов.
Запрос:
```json
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/list",
"params": {}
}
```
Ответ содержит массив `result.tools`.
### tools/call
Вызывает один tool по имени.
Запрос:
```json
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "search_employees",
"arguments": {
"query": "Сергеев",
"limit": 20
}
}
}
```
Ответ tool всегда заворачивается в MCP content-массив:
```json
{
"jsonrpc": "2.0",
"id": 1,
"result": {
"content": [
{
"type": "text",
"text": "{\"items\":[]}"
}
]
}
}
```
Поле `text` содержит сериализованный JSON с `ensure_ascii=false`. Клиент должен распарсить это поле как JSON, если ему нужна структурированная нагрузка.
## Ошибки
- Неизвестный JSON-RPC метод: `code = -32601`, `message = "Method not found"`.
- Исключения при обработке tool: `code = -32000`, `message` содержит текст исключения.
- Если сущность не найдена внутри отдельных tools, HTTP и JSON-RPC ответ остаются успешными, а полезная нагрузка содержит `{"error": "not_found"}`.
## Источники данных
MCP читает данные из основной базы через SQLAlchemy session из `app.db.get_db`.
Основные таблицы и модели:
- `employees`: текущая карточка сотрудника, статус, профиль, `current_data`, checksum.
- `crawl_runs`: история запусков парсинга.
- `crawl_run_employee_changes`: детальные изменения сотрудников в рамках запуска.
- `crawl_errors`: ошибки парсинга в рамках запуска.
- `dataset_versions`: версии полного набора сотрудников.
- `dataset_version_items`: состав конкретной версии набора сотрудников.
## Общая структура employee payload
Краткая карточка сотрудника:
```json
{
"profile_key": "staff:avsergeev",
"profile_id": "avsergeev",
"full_name": "Сергеев Алексей Викторович",
"status": "active",
"canonical_url": "https://www.hse.ru/staff/avsergeev",
"last_seen_at": "2026-05-14T10:00:00+00:00",
"dismissed_at": null
}
```
В sync payload дополнительно отдается `checksum`.
Полная карточка дополнительно содержит:
```json
{
"data": {
"contacts": {},
"sections": []
}
}
```
`data` соответствует распарсенному JSON профиля сотрудника. Внутри `sections` могут быть секции с публикациями, курсами, ВКР, таблицами, ссылками и произвольными текстовыми блоками.
## Tools
### get_service_info
Назначение: вернуть метаданные сервиса, список tools и текущую версию набора сотрудников.
Аргументы: отсутствуют.
Возвращает:
```json
{
"service_name": "miem-employees",
"backend_version": "0.5.0",
"protocolVersion": "2024-11-05",
"tools": [],
"dataset": {
"hash": "sha256",
"previous_hash": "sha256 или null",
"created_at": "2026-05-14T10:00:00+00:00",
"crawl_run_id": 123,
"employee_count": 100,
"active_count": 95,
"dismissed_count": 5
}
}
```
Особенность: перед ответом сервис создает актуальную `dataset_version`, если текущий набор сотрудников еще не имеет версии.
### sync_employees
Назначение: синхронизировать клиентский кэш сотрудников по hash набора данных.
Аргументы:
```json
{
"client_hash": "sha256 или null",
"include_data": true
}
```
- `client_hash`: hash версии, которая уже есть у клиента. Если не передан, отдается полный snapshot.
- `include_data`: управляет включением полного `data` в карточки сотрудников. По умолчанию `true`.
Полный ответ без `client_hash`:
```json
{
"mode": "full",
"from_hash": null,
"to_hash": "current-sha256",
"dataset": {},
"items": []
}
```
Если клиентский hash совпадает с текущим:
```json
{
"mode": "delta",
"from_hash": "current-sha256",
"to_hash": "current-sha256",
"dataset": {},
"changes": {
"added": [],
"updated": [],
"dismissed": [],
"removed": []
}
}
```
Если `client_hash` неизвестен серверу:
```json
{
"mode": "full",
"from_hash": "missing",
"to_hash": "current-sha256",
"dataset": {},
"items": [],
"reason": "unknown_client_hash"
}
```
Если `client_hash` найден и отличается от текущего:
```json
{
"mode": "delta",
"from_hash": "old-sha256",
"to_hash": "current-sha256",
"dataset": {},
"changes": {
"added": [],
"updated": [],
"dismissed": [],
"removed": []
}
}
```
Логика delta:
- `added`: сотрудник появился в новой версии.
- `updated`: изменился checksum или статус, и сотрудник активен.
- `dismissed`: сотрудник есть в новой версии, но получил статус `dismissed`.
- `removed`: `profile_key` был в старой версии, но отсутствует в новой.
Hash набора считается по отсортированному списку `{profile_key, status, checksum}`.
### search_employees
Назначение: найти сотрудников по ФИО или canonical URL.
Аргументы:
```json
{
"query": "Сергеев",
"status": "active",
"limit": 20
}
```
- `query`: обязательный по schema, но в коде пустая строка означает поиск без текстового фильтра.
- `status`: опционально, только `active` или `dismissed`.
- `limit`: максимум 100, по умолчанию 20.
Возвращает массив кратких employee payload без `data`:
```json
[
{
"profile_key": "staff:avsergeev",
"profile_id": "avsergeev",
"full_name": "Сергеев Алексей Викторович",
"status": "active",
"canonical_url": "https://www.hse.ru/staff/avsergeev",
"last_seen_at": "2026-05-14T10:00:00+00:00",
"dismissed_at": null
}
]
```
### get_employee
Назначение: получить одну карточку сотрудника.
Аргументы:
```json
{
"profile_id_or_url": "avsergeev"
}
```
Поиск выполняется по:
- `profile_key`
- `profile_id`
- точному `canonical_url`
- частичному совпадению `canonical_url`
Возвращает полный employee payload с `data`.
Если сотрудник не найден:
```json
{
"error": "not_found"
}
```
### list_employee_publications
Назначение: вернуть публикации сотрудника из распарсенных секций профиля.
Аргументы:
```json
{
"profile_id_or_url": "avsergeev"
}
```
Сервис ищет секции `current_data.sections` с `type = "publications"` и объединяет массивы `publications`.
Ответ:
```json
{
"employee": {},
"items": [
{
"title": "Название публикации",
"text": "Полное описание",
"url": "https://..."
}
]
}
```
Если сотрудник или данные профиля отсутствуют:
```json
{
"items": []
}
```
### list_employee_courses
Назначение: вернуть курсы преподавания сотрудника из распарсенных секций профиля.
Аргументы:
```json
{
"profile_id_or_url": "avsergeev"
}
```
Сервис ищет секции `current_data.sections` с `type = "courses_by_year"` и объединяет массивы `courses`.
Ответ:
```json
{
"employee": {},
"items": [
{
"title": "Название курса",
"url": "https://..."
}
]
}
```
Если сотрудник или данные профиля отсутствуют:
```json
{
"items": []
}
```
### get_crawl_status
Назначение: вернуть последний запуск парсинга.
Аргументы: отсутствуют.
Ответ:
```json
{
"id": 123,
"status": "completed",
"source_url": "https://miem.hse.ru/persons",
"started_at": "2026-05-14T10:00:00+00:00",
"finished_at": "2026-05-14T10:10:00+00:00",
"found_count": 100,
"parsed_count": 98,
"error_count": 2,
"dismissed_count": 1
}
```
Если запусков еще не было:
```json
{
"status": "never_run"
}
```
### get_crawl_run_details
Назначение: вернуть детальную информацию по конкретному запуску парсинга: summary, изменения сотрудников и ошибки.
Аргументы:
```json
{
"run_id": 123
}
```
Ответ:
```json
{
"id": 123,
"source_url": "https://miem.hse.ru/persons",
"status": "completed",
"status_display": "Завершен",
"started_at": "2026-05-14T10:00:00+00:00",
"finished_at": "2026-05-14T10:10:00+00:00",
"started_display": "14.05.2026 13:00",
"finished_display": "14.05.2026 13:10",
"found_count": 100,
"parsed_count": 98,
"new_count": 3,
"error_count": 2,
"dismissed_count": 1,
"processed_count": 100,
"progress_percent": 100.0,
"message": null,
"changes_detail_available": true,
"changes": {
"new": [],
"missing_from_source": [],
"dismissed": []
},
"errors": []
}
```
Если запуск не найден:
```json
{
"error": "not_found"
}
```
## Примеры curl
Список tools:
```bash
curl http://localhost:8001/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'
```
Поиск сотрудника:
```bash
curl http://localhost:8001/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"search_employees","arguments":{"query":"Сергеев","limit":5}}}'
```
Полная синхронизация:
```bash
curl http://localhost:8001/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"sync_employees","arguments":{"include_data":false}}}'
```
Delta-синхронизация:
```bash
curl http://localhost:8001/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":4,"method":"tools/call","params":{"name":"sync_employees","arguments":{"client_hash":"known-sha256","include_data":true}}}'
```
## Как MCP используется клиентом
1. Клиент вызывает `initialize` и проверяет `protocolVersion`.
2. Клиент вызывает `tools/list`, чтобы получить актуальный список tools и input schemas.
3. Для поиска и точечных запросов клиент вызывает `tools/call` с `search_employees`, `get_employee`, `list_employee_publications`, `list_employee_courses`, `get_crawl_status` или `get_crawl_run_details`.
4. Для локального кэша клиент вызывает `get_service_info` или `sync_employees`.
5. Клиент хранит последний `dataset.hash`.
6. При следующей синхронизации клиент передает hash как `client_hash`.
7. Сервер возвращает пустую delta, delta с изменениями или полный snapshot, если hash неизвестен.
## Важные особенности реализации
- MCP endpoint read-only: tools не запускают парсинг и не меняют сотрудников напрямую.
- `get_service_info` и `sync_employees` могут создать новую запись `dataset_versions`, если состояние сотрудников изменилось и новой версии еще нет.
- Все tool payloads возвращаются как JSON-строка внутри `content[0].text`.
- `search_employees` ищет через `ilike` по `full_name` и `canonical_url`.
- `get_employee` допускает частичный URL, поэтому строка `133709486` может найти `https://www.hse.ru/org/persons/133709486`.
- Временные значения сериализуются через `isoformat()`, display-поля для админских payload формируются в часовом поясе `Europe/Moscow`.

View File

@@ -44,7 +44,6 @@ uvicorn app.main:app --reload
- `Dashboard`: общая статистика, последний добавленный сотрудник, прогресс текущего/последнего парсинга и ручной запуск. - `Dashboard`: общая статистика, последний добавленный сотрудник, прогресс текущего/последнего парсинга и ручной запуск.
- `Directory`: настраиваемая таблица сотрудников с фильтрами, сортировкой, пагинацией и выбором колонок. - `Directory`: настраиваемая таблица сотрудников с фильтрами, сортировкой, пагинацией и выбором колонок.
- `Employees`: простая legacy-таблица сотрудников.
- `Runs`: история запусков, ошибки и progress bar. - `Runs`: история запусков, ошибки и progress bar.
## Docker Compose ## Docker Compose
@@ -75,9 +74,10 @@ curl -X POST http://localhost:8000/api/crawl-runs --cookie "miem_admin_session=.
- новые сотрудники добавляются в `employees`; - новые сотрудники добавляются в `employees`;
- количество новых сотрудников за запуск сохраняется в `crawl_runs.new_count`; - количество новых сотрудников за запуск сохраняется в `crawl_runs.new_count`;
- активные сотрудники, исчезнувшие из текущего списка источника, получают статус `dismissed` и `dismissed_at`; - активные сотрудники, исчезнувшие из текущего списка источника, получают статус `dismissed` и `dismissed_at`;
- каждый успешный разбор сохраняет запись в `employee_snapshots`. - каждый успешный новый или измененный разбор сохраняет запись в `employee_snapshots`;
- неизмененные профили учитываются в `crawl_runs.skipped_count` и не получают новый snapshot.
Во время выполнения парсинга `found_count`, `parsed_count` и `error_count` обновляются в базе. Админка опрашивает `/api/crawl-runs/latest` и показывает прогресс как `parsed_count + error_count / found_count`. Во время выполнения парсинга `found_count`, `parsed_count`, `skipped_count` и `error_count` обновляются в базе. Админка опрашивает `/api/crawl-runs/latest` и показывает прогресс как `(parsed_count + skipped_count + error_count) / found_count`.
## MCP ## MCP
@@ -85,13 +85,18 @@ Endpoint: `POST /mcp`, без авторизации на уровне прил
Поддерживаемые tools: Поддерживаемые tools:
- `get_service_info()`
- `sync_employees(client_hash?, include_data?)`
- `search_employees(query, status?, limit?)` - `search_employees(query, status?, limit?)`
- `get_employee(profile_id_or_url)` - `get_employee(profile_id_or_url)`
- `list_employee_publications(profile_id_or_url)` - `list_employee_publications(profile_id_or_url)`
- `list_employee_courses(profile_id_or_url)` - `list_employee_courses(profile_id_or_url)`
- `get_crawl_status()` - `get_crawl_status()`
- `get_crawl_run_details(run_id)`
Пример локального legacy-режима со статическим токеном: `get_service_info` возвращает метаданные сервиса, список tools и текущую версию набора сотрудников. `sync_employees` отдает полный snapshot или delta по `client_hash`; checksum набора строится по сотрудникам, их статусам и текущим checksums. Ответы tools возвращаются как JSON-строка внутри MCP `content[0].text`.
Пример локального запроса списка tools:
```bash ```bash
curl http://localhost:8001/mcp \ curl http://localhost:8001/mcp \
@@ -110,4 +115,4 @@ docker compose exec postgres pg_dump -U miem miem_workers > backup.sql
docker compose down docker compose down
``` ```
Версия сервиса: `0.4.5`. Админка всегда показывает версии backend и frontend в footer. Версия сервиса: `0.6.0`. Админка всегда показывает версии backend и frontend в footer.

View File

@@ -208,6 +208,7 @@ def _run_payload(run: CrawlRun) -> dict:
"finished_at": run.finished_at.isoformat() if run.finished_at else None, "finished_at": run.finished_at.isoformat() if run.finished_at else None,
"found_count": run.found_count, "found_count": run.found_count,
"parsed_count": run.parsed_count, "parsed_count": run.parsed_count,
"skipped_count": run.skipped_count,
"error_count": run.error_count, "error_count": run.error_count,
"dismissed_count": run.dismissed_count, "dismissed_count": run.dismissed_count,
} }

View File

@@ -70,6 +70,7 @@ class CrawlRun(Base):
finished_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True)) finished_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
found_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False) found_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False)
parsed_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False) parsed_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False)
skipped_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False)
new_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False) new_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False)
error_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False) error_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False)
dismissed_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False) dismissed_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False)
@@ -137,6 +138,27 @@ class ParserSource(Base):
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, nullable=False) created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, nullable=False)
class ParseResourceCache(Base):
__tablename__ = "parse_resource_cache"
__table_args__ = (
UniqueConstraint("profile_key", "resource_key", "request_fingerprint", name="uq_parse_resource_cache_resource"),
Index("ix_parse_resource_cache_profile_key", "profile_key"),
)
id: Mapped[int] = mapped_column(Integer, primary_key=True)
profile_key: Mapped[str] = mapped_column(String(255), nullable=False)
resource_key: Mapped[str] = mapped_column(String(255), nullable=False)
method: Mapped[str] = mapped_column(String(16), nullable=False)
url: Mapped[str] = mapped_column(Text, nullable=False)
request_fingerprint: Mapped[str] = mapped_column(String(64), nullable=False)
etag: Mapped[str | None] = mapped_column(Text)
last_modified: Mapped[str | None] = mapped_column(Text)
body_hash: Mapped[str] = mapped_column(String(64), nullable=False)
body_snapshot: Mapped[bytes] = mapped_column(LargeBinary, nullable=False)
parser_version: Mapped[str | None] = mapped_column(String(32))
fetched_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, nullable=False)
class DatasetVersion(Base): class DatasetVersion(Base):
__tablename__ = "dataset_versions" __tablename__ = "dataset_versions"
__table_args__ = ( __table_args__ = (

View File

@@ -1,3 +1,5 @@
import hashlib
import json
import re import re
from urllib.parse import urljoin from urllib.parse import urljoin
@@ -149,22 +151,42 @@ def parse_person_profile(
headers: dict[str, str], headers: dict[str, str],
timeout: int, timeout: int,
use_playwright: bool = False, use_playwright: bool = False,
resource_cache=None,
) -> dict | None: ) -> dict | None:
normalized_url = normalize_profile_url(source_url) normalized_url = normalize_profile_url(source_url)
if not normalized_url: if not normalized_url:
return None return None
response = session.get(normalized_url, headers=headers, timeout=timeout) profile_type, profile_id = parse_profile_identity(normalized_url)
response.raise_for_status() cache_profile_key = f"{profile_type}:{profile_id}"
html = response.text resource_manifest = []
html = _fetch_text(
session,
normalized_url,
headers,
timeout,
resource_cache=resource_cache,
profile_key=cache_profile_key,
resource_key="main-html",
resource_manifest=resource_manifest,
)
if use_playwright: if use_playwright:
html = _render_with_playwright(normalized_url, html) html = _render_with_playwright(normalized_url, html)
soup = BeautifulSoup(html, "html.parser") soup = BeautifulSoup(html, "html.parser")
profile_type, profile_id = parse_profile_identity(normalized_url)
header = extract_person_header(soup, normalized_url) header = extract_person_header(soup, normalized_url)
tabs = extract_person_tabs(soup, normalized_url) tabs = extract_person_tabs(soup, normalized_url)
sections = extract_sections(soup, normalized_url) sections = extract_sections(soup, normalized_url)
sections = enrich_sections_from_hse_widgets(session, soup, normalized_url, headers, timeout, sections) sections = enrich_sections_from_hse_widgets(
session,
soup,
normalized_url,
headers,
timeout,
sections,
resource_cache=resource_cache,
profile_key=cache_profile_key,
resource_manifest=resource_manifest,
)
internal_links = [tab["href"] for tab in tabs if tab.get("href")] internal_links = [tab["href"] for tab in tabs if tab.get("href")]
return { return {
@@ -181,6 +203,7 @@ def parse_person_profile(
"employee_internal_links": internal_links, "employee_internal_links": internal_links,
"parser_version": BACKEND_VERSION, "parser_version": BACKEND_VERSION,
"_html": html, "_html": html,
"_resource_manifest": resource_manifest,
} }
@@ -191,13 +214,33 @@ def enrich_sections_from_hse_widgets(
headers: dict[str, str], headers: dict[str, str],
timeout: int, timeout: int,
sections: list[dict], sections: list[dict],
resource_cache=None,
profile_key: str | None = None,
resource_manifest: list[dict] | None = None,
) -> list[dict]: ) -> list[dict]:
enriched = list(sections) enriched = list(sections)
publications = _load_widget_publications(session, soup, headers, timeout) publications = _load_widget_publications(
session,
soup,
headers,
timeout,
resource_cache=resource_cache,
profile_key=profile_key,
resource_manifest=resource_manifest,
)
if publications: if publications:
enriched = _upsert_publications_section(enriched, publications) enriched = _upsert_publications_section(enriched, publications)
theses = _load_widget_graduation_theses(session, soup, source_url, headers, timeout) theses = _load_widget_graduation_theses(
session,
soup,
source_url,
headers,
timeout,
resource_cache=resource_cache,
profile_key=profile_key,
resource_manifest=resource_manifest,
)
if theses: if theses:
enriched = _upsert_graduation_theses_section(enriched, theses) enriched = _upsert_graduation_theses_section(enriched, theses)
return enriched return enriched
@@ -226,7 +269,16 @@ def _render_with_playwright(source_url: str, fallback_html: str) -> str:
return fallback_html return fallback_html
def _load_widget_publications(session: Session, soup: BeautifulSoup, headers: dict[str, str], timeout: int) -> list[dict]: def _load_widget_publications(
session: Session,
soup: BeautifulSoup,
headers: dict[str, str],
timeout: int,
*,
resource_cache=None,
profile_key: str | None = None,
resource_manifest: list[dict] | None = None,
) -> list[dict]:
script = soup.select_one('script[data-widget-name="AuthorSearch"][data-author]') script = soup.select_one('script[data-widget-name="AuthorSearch"][data-author]')
if not script: if not script:
return [] return []
@@ -251,14 +303,29 @@ def _load_widget_publications(session: Session, soup: BeautifulSoup, headers: di
}, },
} }
try: try:
response = session.post( if resource_cache and profile_key:
"https://publications.hse.ru/api/searchPubs", text = _fetch_text(
json=payload, session,
headers=headers, "https://publications.hse.ru/api/searchPubs",
timeout=timeout, headers,
) timeout,
response.raise_for_status() resource_cache=resource_cache,
data = response.json() profile_key=profile_key,
resource_key=f"publications-page-{page_id}",
resource_manifest=resource_manifest,
method="POST",
json_payload=payload,
)
data = json.loads(text)
else:
response = session.post(
"https://publications.hse.ru/api/searchPubs",
json=payload,
headers=headers,
timeout=timeout,
)
response.raise_for_status()
data = response.json()
except Exception: except Exception:
return publications return publications
@@ -309,6 +376,10 @@ def _load_widget_graduation_theses(
source_url: str, source_url: str,
headers: dict[str, str], headers: dict[str, str],
timeout: int, timeout: int,
*,
resource_cache=None,
profile_key: str | None = None,
resource_manifest: list[dict] | None = None,
) -> list[dict]: ) -> list[dict]:
script = soup.select_one('script[src*="/n/stat/vkr/app.js"][data-person-id]') script = soup.select_one('script[src*="/n/stat/vkr/app.js"][data-person-id]')
if not script: if not script:
@@ -320,14 +391,30 @@ def _load_widget_graduation_theses(
request_headers = {**headers, "x-portal-language": "ru"} request_headers = {**headers, "x-portal-language": "ru"}
try: try:
response = session.get( url = urljoin(source_url, api_url)
urljoin(source_url, api_url), params = {"supervisorId": person_id}
params={"supervisorId": person_id}, if resource_cache and profile_key:
headers=request_headers, text = _fetch_text(
timeout=timeout, session,
) url,
response.raise_for_status() request_headers,
data = response.json() timeout,
resource_cache=resource_cache,
profile_key=profile_key,
resource_key="graduation-theses",
resource_manifest=resource_manifest,
params=params,
)
data = json.loads(text)
else:
response = session.get(
url,
params=params,
headers=request_headers,
timeout=timeout,
)
response.raise_for_status()
data = response.json()
except Exception: except Exception:
return [] return []
@@ -629,3 +716,62 @@ def _dedupe_dicts(items: list[dict]) -> list[dict]:
seen.add(key) seen.add(key)
unique.append(item) unique.append(item)
return unique return unique
def _fetch_text(
session: Session,
url: str,
headers: dict[str, str],
timeout: int,
*,
resource_cache=None,
profile_key: str | None = None,
resource_key: str,
resource_manifest: list[dict] | None,
method: str = "GET",
json_payload: object | None = None,
params: dict | None = None,
) -> str:
if resource_cache and profile_key:
cached = resource_cache.fetch_text(
session,
profile_key=profile_key,
resource_key=resource_key,
method=method,
url=url,
headers=headers,
timeout=timeout,
json_payload=json_payload,
params=params,
)
if resource_manifest is not None:
resource_manifest.append(
{
"resource_key": resource_key,
"method": method,
"url": url,
"body_hash": cached.body_hash,
"from_cache": cached.from_cache,
"status_code": cached.status_code,
}
)
return cached.text
if method.upper() == "POST":
response = session.post(url, json=json_payload, headers=headers, timeout=timeout, params=params)
else:
response = session.get(url, headers=headers, timeout=timeout, params=params)
response.raise_for_status()
text = response.text
if resource_manifest is not None:
resource_manifest.append(
{
"resource_key": resource_key,
"method": method,
"url": url,
"body_hash": hashlib.sha256(text.encode("utf-8")).hexdigest(),
"from_cache": False,
"status_code": response.status_code,
}
)
return text

View File

@@ -153,7 +153,7 @@ def stats_payload(db: Session) -> dict[str, Any]:
def run_payload(run: CrawlRun | None) -> dict[str, Any] | None: def run_payload(run: CrawlRun | None) -> dict[str, Any] | None:
if not run: if not run:
return None return None
processed = run.parsed_count + run.error_count processed = run.parsed_count + run.skipped_count + run.error_count
percent = round((processed / run.found_count) * 100, 1) if run.found_count else 0 percent = round((processed / run.found_count) * 100, 1) if run.found_count else 0
return { return {
"id": run.id, "id": run.id,
@@ -166,6 +166,7 @@ def run_payload(run: CrawlRun | None) -> dict[str, Any] | None:
"finished_display": format_admin_datetime(run.finished_at), "finished_display": format_admin_datetime(run.finished_at),
"found_count": run.found_count, "found_count": run.found_count,
"parsed_count": run.parsed_count, "parsed_count": run.parsed_count,
"skipped_count": run.skipped_count,
"new_count": run.new_count, "new_count": run.new_count,
"error_count": run.error_count, "error_count": run.error_count,
"dismissed_count": run.dismissed_count, "dismissed_count": run.dismissed_count,

View File

@@ -1,6 +1,7 @@
import gzip import gzip
import hashlib import hashlib
import json import json
import re
import time import time
from datetime import datetime, timezone from datetime import datetime, timezone
@@ -14,6 +15,7 @@ from app.parser.collector import collect_profile_links
from app.parser.profile import parse_person_profile from app.parser.profile import parse_person_profile
from app.parser.profile_url import profile_key from app.parser.profile_url import profile_key
from app.services.dataset_versions import get_or_create_current_version from app.services.dataset_versions import get_or_create_current_version
from app.services.resource_cache import ResourceCache
HEADERS = { HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; MIEMEmployeesBot/0.1.0; +https://miem.hse.ru/)" "User-Agent": "Mozilla/5.0 (compatible; MIEMEmployeesBot/0.1.0; +https://miem.hse.ru/)"
@@ -29,8 +31,10 @@ def run_crawl(db: Session, settings: Settings) -> CrawlRun:
found_keys: set[str] = set() found_keys: set[str] = set()
parsed_count = 0 parsed_count = 0
skipped_count = 0
try: try:
with requests.Session() as session: with requests.Session() as session:
resource_cache = ResourceCache(db)
urls = collect_profile_links(session, source.source_url, HEADERS, settings.request_timeout) urls = collect_profile_links(session, source.source_url, HEADERS, settings.request_timeout)
if settings.crawl_limit: if settings.crawl_limit:
urls = urls[: settings.crawl_limit] urls = urls[: settings.crawl_limit]
@@ -48,12 +52,17 @@ def run_crawl(db: Session, settings: Settings) -> CrawlRun:
HEADERS, HEADERS,
settings.request_timeout, settings.request_timeout,
settings.parser_use_playwright, settings.parser_use_playwright,
resource_cache=resource_cache,
) )
if not parsed: if not parsed:
continue continue
_upsert_employee(db, run, parsed) _, changed = _upsert_employee(db, run, parsed)
parsed_count += 1 if changed:
parsed_count += 1
else:
skipped_count += 1
run.parsed_count = parsed_count run.parsed_count = parsed_count
run.skipped_count = skipped_count
db.commit() db.commit()
except Exception as exc: except Exception as exc:
run.error_count += 1 run.error_count += 1
@@ -69,7 +78,7 @@ def run_crawl(db: Session, settings: Settings) -> CrawlRun:
finally: finally:
time.sleep(settings.request_delay_seconds) time.sleep(settings.request_delay_seconds)
run.dismissed_count = _mark_dismissed(db, run, found_keys, session, settings.request_timeout) run.dismissed_count = _mark_dismissed(db, run, found_keys, session, settings.request_timeout)
run.status = "completed" run.status = "completed"
get_or_create_current_version(db, crawl_run_id=run.id) get_or_create_current_version(db, crawl_run_id=run.id)
except Exception as exc: except Exception as exc:
@@ -90,20 +99,25 @@ def refresh_employee(db: Session, employee: Employee, settings: Settings) -> Cra
try: try:
with requests.Session() as session: with requests.Session() as session:
resource_cache = ResourceCache(db)
parsed = parse_person_profile( parsed = parse_person_profile(
session, session,
employee.canonical_url, employee.canonical_url,
HEADERS, HEADERS,
settings.request_timeout, settings.request_timeout,
settings.parser_use_playwright, settings.parser_use_playwright,
resource_cache=resource_cache,
) )
if not parsed: if not parsed:
raise ValueError("Профиль не удалось распарсить.") raise ValueError("Профиль не удалось распарсить.")
if _parsed_profile_key(parsed) != employee.profile_key: if _parsed_profile_key(parsed) != employee.profile_key:
raise ValueError("Распарсенный профиль не совпадает с обновляемым сотрудником.") raise ValueError("Распарсенный профиль не совпадает с обновляемым сотрудником.")
_upsert_employee(db, run, parsed) _, changed = _upsert_employee(db, run, parsed)
run.parsed_count = 1 if changed:
run.parsed_count = 1
else:
run.skipped_count = 1
run.status = "completed" run.status = "completed"
get_or_create_current_version(db, crawl_run_id=run.id) get_or_create_current_version(db, crawl_run_id=run.id)
except Exception as exc: except Exception as exc:
@@ -140,8 +154,9 @@ def _parsed_profile_key(parsed: dict) -> str:
return f"{parsed.get('profile_type')}:{parsed.get('profile_id')}" return f"{parsed.get('profile_type')}:{parsed.get('profile_id')}"
def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee: def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> tuple[Employee, bool]:
html = parsed.pop("_html", None) html = parsed.pop("_html", None)
parsed.pop("_resource_manifest", None)
checksum = _checksum(parsed) checksum = _checksum(parsed)
key = _parsed_profile_key(parsed) key = _parsed_profile_key(parsed)
employee = db.scalar(select(Employee).where(Employee.profile_key == key)) employee = db.scalar(select(Employee).where(Employee.profile_key == key))
@@ -160,12 +175,15 @@ def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee:
else: else:
is_new = False is_new = False
parser_version = parsed.get("parser_version")
changed = is_new or employee.current_checksum != checksum or employee.parser_version != parser_version
employee.full_name = parsed.get("full_name") employee.full_name = parsed.get("full_name")
employee.status = "active" employee.status = "active"
employee.last_seen_at = now employee.last_seen_at = now
employee.dismissed_at = None employee.dismissed_at = None
employee.parser_version = parsed.get("parser_version") employee.parser_version = parser_version
employee.current_data = parsed if changed:
employee.current_data = parsed
employee.current_checksum = checksum employee.current_checksum = checksum
db.flush() db.flush()
@@ -179,28 +197,29 @@ def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee:
message="Сотрудник впервые найден в источнике.", message="Сотрудник впервые найден в источнике.",
) )
db.query(ProfileTab).filter(ProfileTab.employee_id == employee.id).delete() if changed:
for tab in parsed.get("tabs") or []: db.query(ProfileTab).filter(ProfileTab.employee_id == employee.id).delete()
for tab in parsed.get("tabs") or []:
db.add(
ProfileTab(
employee_id=employee.id,
title=tab.get("title") or "",
href=tab.get("href") or "",
data_index=tab.get("data_index"),
)
)
db.add( db.add(
ProfileTab( EmployeeSnapshot(
employee_id=employee.id, employee_id=employee.id,
title=tab.get("title") or "", crawl_run_id=run.id,
href=tab.get("href") or "", parsed_data=parsed,
data_index=tab.get("data_index"), html_snapshot=gzip.compress(html.encode("utf-8")) if html else None,
checksum=checksum,
parser_version=parser_version,
) )
) )
return employee, changed
db.add(
EmployeeSnapshot(
employee_id=employee.id,
crawl_run_id=run.id,
parsed_data=parsed,
html_snapshot=gzip.compress(html.encode("utf-8")) if html else None,
checksum=checksum,
parser_version=parsed.get("parser_version"),
)
)
return employee
def _mark_dismissed(db: Session, run: CrawlRun, found_keys: set[str], session: requests.Session, timeout: int) -> int: def _mark_dismissed(db: Session, run: CrawlRun, found_keys: set[str], session: requests.Session, timeout: int) -> int:
@@ -268,5 +287,23 @@ def _record_employee_change(
def _checksum(data: dict) -> str: def _checksum(data: dict) -> str:
payload = json.dumps(data, ensure_ascii=False, sort_keys=True, separators=(",", ":")) payload = json.dumps(_stable_checksum_payload(data), ensure_ascii=False, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(payload.encode("utf-8")).hexdigest() return hashlib.sha256(payload.encode("utf-8")).hexdigest()
def _stable_checksum_payload(value):
if isinstance(value, dict):
return {key: _stable_checksum_payload(item) for key, item in value.items()}
if isinstance(value, list):
return [_stable_checksum_payload(item) for item in value]
if isinstance(value, str):
return _normalize_date_dependent_experience(value)
return value
def _normalize_date_dependent_experience(value: str) -> str:
return re.sub(
r"(?i)(стаж(?:\s+работы)?(?:\s+в\s+ниу\s+вшэ|\s+в\s+вшэ)?\s*:?\s*)\d+\s*(?:год(?:а|ов)?|лет)",
r"\1<experience-years>",
value,
)

View File

@@ -0,0 +1,147 @@
from __future__ import annotations
import gzip
import hashlib
import json
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Any
import requests
from sqlalchemy import select
from sqlalchemy.orm import Session
from app.models import ParseResourceCache
from app.version import BACKEND_VERSION
@dataclass(frozen=True)
class CachedResource:
text: str
body_hash: str
from_cache: bool
status_code: int
class ResourceCache:
def __init__(self, db: Session):
self.db = db
def fetch_text(
self,
session: requests.Session,
*,
profile_key: str,
resource_key: str,
method: str,
url: str,
headers: dict[str, str],
timeout: int,
json_payload: Any | None = None,
params: dict[str, Any] | None = None,
) -> CachedResource:
method = method.upper()
fingerprint = _request_fingerprint(method=method, url=url, json_payload=json_payload, params=params)
cached = self.db.scalar(
select(ParseResourceCache).where(
ParseResourceCache.profile_key == profile_key,
ParseResourceCache.resource_key == resource_key,
ParseResourceCache.request_fingerprint == fingerprint,
)
)
request_headers = dict(headers)
if cached:
if cached.etag:
request_headers["If-None-Match"] = cached.etag
if cached.last_modified:
request_headers["If-Modified-Since"] = cached.last_modified
response = _send(
session,
method=method,
url=url,
headers=request_headers,
timeout=timeout,
json_payload=json_payload,
params=params,
)
if response.status_code == 304 and cached:
cached.fetched_at = datetime.now(timezone.utc)
self.db.flush()
return CachedResource(
text=gzip.decompress(cached.body_snapshot).decode("utf-8"),
body_hash=cached.body_hash,
from_cache=True,
status_code=response.status_code,
)
response.raise_for_status()
text = response.text
body_hash = _body_hash(text)
etag = response.headers.get("ETag") if hasattr(response, "headers") else None
last_modified = response.headers.get("Last-Modified") if hasattr(response, "headers") else None
if cached:
cached.method = method
cached.url = url
cached.etag = etag
cached.last_modified = last_modified
cached.body_hash = body_hash
cached.body_snapshot = gzip.compress(text.encode("utf-8"))
cached.parser_version = BACKEND_VERSION
cached.fetched_at = datetime.now(timezone.utc)
else:
self.db.add(
ParseResourceCache(
profile_key=profile_key,
resource_key=resource_key,
method=method,
url=url,
request_fingerprint=fingerprint,
etag=etag,
last_modified=last_modified,
body_hash=body_hash,
body_snapshot=gzip.compress(text.encode("utf-8")),
parser_version=BACKEND_VERSION,
fetched_at=datetime.now(timezone.utc),
)
)
self.db.flush()
return CachedResource(text=text, body_hash=body_hash, from_cache=False, status_code=response.status_code)
def _send(
session: requests.Session,
*,
method: str,
url: str,
headers: dict[str, str],
timeout: int,
json_payload: Any | None,
params: dict[str, Any] | None,
) -> requests.Response:
if method == "POST":
return session.post(url, json=json_payload, headers=headers, timeout=timeout, params=params)
return session.get(url, headers=headers, timeout=timeout, params=params)
def _request_fingerprint(
*,
method: str,
url: str,
json_payload: Any | None,
params: dict[str, Any] | None,
) -> str:
payload = {
"method": method,
"url": url,
"json": json_payload,
"params": params,
}
encoded = json.dumps(payload, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(encoded.encode("utf-8")).hexdigest()
def _body_hash(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()

View File

@@ -89,12 +89,14 @@
const status = document.querySelector("[data-progress-status]"); const status = document.querySelector("[data-progress-status]");
const processed = document.querySelector("[data-progress-processed]"); const processed = document.querySelector("[data-progress-processed]");
const found = document.querySelector("[data-progress-found]"); const found = document.querySelector("[data-progress-found]");
const skipped = document.querySelector("[data-progress-skipped]");
const errors = document.querySelector("[data-progress-errors]"); const errors = document.querySelector("[data-progress-errors]");
const fill = document.querySelector("[data-progress-fill]"); const fill = document.querySelector("[data-progress-fill]");
const percent = document.querySelector("[data-progress-percent]"); const percent = document.querySelector("[data-progress-percent]");
if (status) status.textContent = run.status_display || run.status; if (status) status.textContent = run.status_display || run.status;
if (processed) processed.textContent = run.processed_count; if (processed) processed.textContent = run.processed_count;
if (found) found.textContent = run.found_count; if (found) found.textContent = run.found_count;
if (skipped) skipped.textContent = run.skipped_count;
if (errors) errors.textContent = run.error_count; if (errors) errors.textContent = run.error_count;
if (fill) fill.style.width = `${run.progress_percent}%`; if (fill) fill.style.width = `${run.progress_percent}%`;
if (percent) percent.textContent = run.progress_percent; if (percent) percent.textContent = run.progress_percent;

View File

@@ -37,6 +37,7 @@
<div class="progress-panel__meta"> <div class="progress-panel__meta">
<span data-progress-status>{{ run.status_display if run else "Ожидание" }}</span> <span data-progress-status>{{ run.status_display if run else "Ожидание" }}</span>
<span>обработано: <span data-progress-processed>{{ run.processed_count if run else 0 }}</span> / <span data-progress-found>{{ run.found_count if run else 0 }}</span></span> <span>обработано: <span data-progress-processed>{{ run.processed_count if run else 0 }}</span> / <span data-progress-found>{{ run.found_count if run else 0 }}</span></span>
<span>без изменений: <span data-progress-skipped>{{ run.skipped_count if run else 0 }}</span></span>
<span>ошибок: <span data-progress-errors>{{ run.error_count if run else 0 }}</span></span> <span>ошибок: <span data-progress-errors>{{ run.error_count if run else 0 }}</span></span>
</div> </div>
<div class="progress-bar" aria-label="Parsing progress"> <div class="progress-bar" aria-label="Parsing progress">
@@ -48,10 +49,10 @@
<section class="panel"> <section class="panel">
<h2 class="panel__title">Последние запуски</h2> <h2 class="panel__title">Последние запуски</h2>
<table class="table"> <table class="table">
<thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Обработано</th><th class="table__head">Ошибки</th><th class="table__head">Старт</th></tr></thead> <thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Обработано</th><th class="table__head">Без изменений</th><th class="table__head">Ошибки</th><th class="table__head">Старт</th></tr></thead>
<tbody> <tbody>
{% for run in runs %} {% for run in runs %}
<tr class="table__row" onclick="window.location.href='/admin/runs/{{ run.id }}'" onkeydown="if (event.key === 'Enter' || event.key === ' ') { event.preventDefault(); window.location.href='/admin/runs/{{ run.id }}'; }" role="link" tabindex="0"><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr> <tr class="table__row" onclick="window.location.href='/admin/runs/{{ run.id }}'" onkeydown="if (event.key === 'Enter' || event.key === ' ') { event.preventDefault(); window.location.href='/admin/runs/{{ run.id }}'; }" role="link" tabindex="0"><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.skipped_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr>
{% endfor %} {% endfor %}
</tbody> </tbody>
</table> </table>

View File

@@ -12,6 +12,7 @@
<div class="stats-strip"> <div class="stats-strip">
<div class="stats-strip__item"><span class="stats-strip__label">Найдено</span><span class="stats-strip__value">{{ run.found_count }}</span></div> <div class="stats-strip__item"><span class="stats-strip__label">Найдено</span><span class="stats-strip__value">{{ run.found_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Обработано</span><span class="stats-strip__value">{{ run.parsed_count }}</span></div> <div class="stats-strip__item"><span class="stats-strip__label">Обработано</span><span class="stats-strip__value">{{ run.parsed_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Без изменений</span><span class="stats-strip__value">{{ run.skipped_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Новые</span><span class="stats-strip__value">{{ run.new_count }}</span></div> <div class="stats-strip__item"><span class="stats-strip__label">Новые</span><span class="stats-strip__value">{{ run.new_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Потеряшки</span><span class="stats-strip__value">{{ run.changes.missing_from_source | length }}</span></div> <div class="stats-strip__item"><span class="stats-strip__label">Потеряшки</span><span class="stats-strip__value">{{ run.changes.missing_from_source | length }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Уволены</span><span class="stats-strip__value">{{ run.dismissed_count }}</span></div> <div class="stats-strip__item"><span class="stats-strip__label">Уволены</span><span class="stats-strip__value">{{ run.dismissed_count }}</span></div>

View File

@@ -8,12 +8,13 @@
</div> </div>
{% set run = runs[0] if runs else none %} {% set run = runs[0] if runs else none %}
{% if run %} {% if run %}
{% set processed = run.parsed_count + run.error_count %} {% set processed = run.parsed_count + run.skipped_count + run.error_count %}
{% set percent = ((processed / run.found_count) * 100) | round(1) if run.found_count else 0 %} {% set percent = ((processed / run.found_count) * 100) | round(1) if run.found_count else 0 %}
<div class="progress-panel" data-progress-panel> <div class="progress-panel" data-progress-panel>
<div class="progress-panel__meta"> <div class="progress-panel__meta">
<span data-progress-status>{{ run.status_display }}</span> <span data-progress-status>{{ run.status_display }}</span>
<span>обработано: <span data-progress-processed>{{ processed }}</span> / <span data-progress-found>{{ run.found_count }}</span></span> <span>обработано: <span data-progress-processed>{{ processed }}</span> / <span data-progress-found>{{ run.found_count }}</span></span>
<span>без изменений: <span data-progress-skipped>{{ run.skipped_count }}</span></span>
<span>ошибок: <span data-progress-errors>{{ run.error_count }}</span></span> <span>ошибок: <span data-progress-errors>{{ run.error_count }}</span></span>
</div> </div>
<div class="progress-bar" aria-label="Parsing progress"> <div class="progress-bar" aria-label="Parsing progress">
@@ -26,6 +27,7 @@
<div class="progress-panel__meta"> <div class="progress-panel__meta">
<span data-progress-status>Ожидание</span> <span data-progress-status>Ожидание</span>
<span>обработано: <span data-progress-processed>0</span> / <span data-progress-found>0</span></span> <span>обработано: <span data-progress-processed>0</span> / <span data-progress-found>0</span></span>
<span>без изменений: <span data-progress-skipped>0</span></span>
<span>ошибок: <span data-progress-errors>0</span></span> <span>ошибок: <span data-progress-errors>0</span></span>
</div> </div>
<div class="progress-bar" aria-label="Parsing progress"> <div class="progress-bar" aria-label="Parsing progress">
@@ -35,10 +37,10 @@
</div> </div>
{% endif %} {% endif %}
<table class="table"> <table class="table">
<thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Найдено</th><th class="table__head">Обработано</th><th class="table__head">Новые</th><th class="table__head">Ошибки</th><th class="table__head">Уволены</th><th class="table__head">Старт</th></tr></thead> <thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Найдено</th><th class="table__head">Обработано</th><th class="table__head">Без изменений</th><th class="table__head">Новые</th><th class="table__head">Ошибки</th><th class="table__head">Уволены</th><th class="table__head">Старт</th></tr></thead>
<tbody> <tbody>
{% for run in runs %} {% for run in runs %}
<tr class="table__row" onclick="window.location.href='/admin/runs/{{ run.id }}'" onkeydown="if (event.key === 'Enter' || event.key === ' ') { event.preventDefault(); window.location.href='/admin/runs/{{ run.id }}'; }" role="link" tabindex="0"><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.found_count }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.new_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.dismissed_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr> <tr class="table__row" onclick="window.location.href='/admin/runs/{{ run.id }}'" onkeydown="if (event.key === 'Enter' || event.key === ' ') { event.preventDefault(); window.location.href='/admin/runs/{{ run.id }}'; }" role="link" tabindex="0"><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.found_count }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.skipped_count }}</td><td class="table__cell">{{ run.new_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.dismissed_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr>
{% endfor %} {% endfor %}
</tbody> </tbody>
</table> </table>

View File

@@ -1,3 +1,3 @@
APP_VERSION = "0.5.0" APP_VERSION = "0.6.0"
FRONTEND_VERSION = "0.5.0" FRONTEND_VERSION = "0.6.0"
BACKEND_VERSION = "0.5.0" BACKEND_VERSION = "0.6.0"

View File

@@ -17,7 +17,14 @@ def crawl_once() -> None:
settings = get_settings() settings = get_settings()
with SessionLocal() as db: with SessionLocal() as db:
run = run_crawl(db, settings) run = run_crawl(db, settings)
logger.info("crawl finished: id=%s status=%s parsed=%s errors=%s", run.id, run.status, run.parsed_count, run.error_count) logger.info(
"crawl finished: id=%s status=%s parsed=%s skipped=%s errors=%s",
run.id,
run.status,
run.parsed_count,
run.skipped_count,
run.error_count,
)
def main() -> None: def main() -> None:

View File

@@ -13,6 +13,7 @@ CREATE TABLE IF NOT EXISTS crawl_runs (
finished_at TIMESTAMPTZ, finished_at TIMESTAMPTZ,
found_count INTEGER NOT NULL DEFAULT 0, found_count INTEGER NOT NULL DEFAULT 0,
parsed_count INTEGER NOT NULL DEFAULT 0, parsed_count INTEGER NOT NULL DEFAULT 0,
skipped_count INTEGER NOT NULL DEFAULT 0,
new_count INTEGER NOT NULL DEFAULT 0, new_count INTEGER NOT NULL DEFAULT 0,
error_count INTEGER NOT NULL DEFAULT 0, error_count INTEGER NOT NULL DEFAULT 0,
dismissed_count INTEGER NOT NULL DEFAULT 0, dismissed_count INTEGER NOT NULL DEFAULT 0,
@@ -73,3 +74,22 @@ CREATE TABLE IF NOT EXISTS profile_tabs (
); );
CREATE INDEX IF NOT EXISTS ix_profile_tabs_employee_id ON profile_tabs (employee_id); CREATE INDEX IF NOT EXISTS ix_profile_tabs_employee_id ON profile_tabs (employee_id);
CREATE TABLE IF NOT EXISTS parse_resource_cache (
id SERIAL PRIMARY KEY,
profile_key VARCHAR(255) NOT NULL,
resource_key VARCHAR(255) NOT NULL,
method VARCHAR(16) NOT NULL,
url TEXT NOT NULL,
request_fingerprint VARCHAR(64) NOT NULL,
etag TEXT,
last_modified TEXT,
body_hash VARCHAR(64) NOT NULL,
body_snapshot BYTEA NOT NULL,
parser_version VARCHAR(32),
fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT uq_parse_resource_cache_resource UNIQUE (profile_key, resource_key, request_fingerprint)
);
CREATE INDEX IF NOT EXISTS ix_parse_resource_cache_profile_key
ON parse_resource_cache (profile_key);

View File

@@ -0,0 +1,21 @@
ALTER TABLE crawl_runs
ADD COLUMN IF NOT EXISTS skipped_count INTEGER NOT NULL DEFAULT 0;
CREATE TABLE IF NOT EXISTS parse_resource_cache (
id SERIAL PRIMARY KEY,
profile_key VARCHAR(255) NOT NULL,
resource_key VARCHAR(255) NOT NULL,
method VARCHAR(16) NOT NULL,
url TEXT NOT NULL,
request_fingerprint VARCHAR(64) NOT NULL,
etag TEXT,
last_modified TEXT,
body_hash VARCHAR(64) NOT NULL,
body_snapshot BYTEA NOT NULL,
parser_version VARCHAR(32),
fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT uq_parse_resource_cache_resource UNIQUE (profile_key, resource_key, request_fingerprint)
);
CREATE INDEX IF NOT EXISTS ix_parse_resource_cache_profile_key
ON parse_resource_cache (profile_key);

View File

@@ -1,6 +1,6 @@
[project] [project]
name = "miem-workers" name = "miem-workers"
version = "0.5.0" version = "0.6.0"
description = "MIEM employees parser, admin API, and MCP server" description = "MIEM employees parser, admin API, and MCP server"
requires-python = ">=3.11" requires-python = ">=3.11"
dependencies = [ dependencies = [

View File

@@ -200,13 +200,14 @@ def test_run_payload_calculates_progress():
status="running", status="running",
found_count=10, found_count=10,
parsed_count=4, parsed_count=4,
skipped_count=2,
error_count=1, error_count=1,
) )
payload = run_payload(run) payload = run_payload(run)
assert payload["processed_count"] == 5 assert payload["processed_count"] == 7
assert payload["progress_percent"] == 50.0 assert payload["progress_percent"] == 70.0
assert payload["status_display"] == "Выполняется" assert payload["status_display"] == "Выполняется"

View File

@@ -20,7 +20,7 @@ def test_health_returns_versions():
response = client.get("/api/health") response = client.get("/api/health")
assert response.status_code == 200 assert response.status_code == 200
assert response.json()["backend_version"] == "0.5.0" assert response.json()["backend_version"] == "0.6.0"
def test_mcp_lists_tools_without_auth_and_ignores_auth_header(): def test_mcp_lists_tools_without_auth_and_ignores_auth_header():
@@ -154,7 +154,7 @@ def test_mcp_service_info_returns_tools_and_dataset_hash():
assert response.status_code == 200 assert response.status_code == 200
payload = json.loads(response.json()["result"]["content"][0]["text"]) payload = json.loads(response.json()["result"]["content"][0]["text"])
assert payload["service_name"] == "miem-employees" assert payload["service_name"] == "miem-employees"
assert payload["backend_version"] == "0.5.0" assert payload["backend_version"] == "0.6.0"
assert payload["dataset"]["hash"] assert payload["dataset"]["hash"]
assert any(tool["name"] == "sync_employees" for tool in payload["tools"]) assert any(tool["name"] == "sync_employees" for tool in payload["tools"])

View File

@@ -1,7 +1,9 @@
import gzip
from datetime import datetime, timezone from datetime import datetime, timezone
from app.models import CrawlRun, CrawlRunEmployeeChange, Employee from app.models import CrawlRun, CrawlRunEmployeeChange, Employee, EmployeeSnapshot, ParseResourceCache
from app.services.crawler import _mark_dismissed, _upsert_employee from app.services.crawler import _checksum, _mark_dismissed, _upsert_employee
from app.services.resource_cache import ResourceCache
class FakeResponse: class FakeResponse:
@@ -17,6 +19,34 @@ class FakeSession:
return FakeResponse(self.statuses[url]) return FakeResponse(self.statuses[url])
class ConditionalResponse:
def __init__(self, status_code, text="", headers=None):
self.status_code = status_code
self._text = text
self.headers = headers or {}
self.text_read = False
@property
def text(self):
self.text_read = True
return self._text
def raise_for_status(self):
return None
class ConditionalSession:
def __init__(self):
self.requests = []
self.not_modified_response = ConditionalResponse(304)
def get(self, url, **kwargs):
self.requests.append((url, kwargs))
if kwargs["headers"].get("If-None-Match") == '"cached"':
return self.not_modified_response
return ConditionalResponse(200, "fresh", {"ETag": '"fresh"'})
def test_mark_dismissed_records_missing_source_when_profile_is_available(db_session): def test_mark_dismissed_records_missing_source_when_profile_is_available(db_session):
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running") run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add(run) db_session.add(run)
@@ -111,3 +141,86 @@ def test_upsert_employee_increments_new_count_and_records_change_for_new_employe
change = db_session.query(CrawlRunEmployeeChange).one() change = db_session.query(CrawlRunEmployeeChange).one()
assert change.change_type == "new" assert change.change_type == "new"
assert change.full_name == "New Person" assert change.full_name == "New Person"
def test_resource_cache_uses_etag_and_reuses_cached_body_on_304(db_session):
db_session.add(
ParseResourceCache(
profile_key="staff:cached",
resource_key="main-html",
method="GET",
url="https://www.hse.ru/staff/cached",
request_fingerprint="020d59db7b358d9023d0f185bcbf5a9c085d3cf2bf91d92d48eee9147e8d0f01",
etag='"cached"',
body_hash="cached-hash",
body_snapshot=gzip.compress("cached body".encode("utf-8")),
parser_version="0.6.0",
)
)
db_session.commit()
session = ConditionalSession()
result = ResourceCache(db_session).fetch_text(
session,
profile_key="staff:cached",
resource_key="main-html",
method="GET",
url="https://www.hse.ru/staff/cached",
headers={"User-Agent": "test"},
timeout=10,
)
assert session.requests[0][1]["headers"]["If-None-Match"] == '"cached"'
assert result.text == "cached body"
assert result.from_cache is True
assert session.not_modified_response.text_read is False
def test_upsert_employee_skips_snapshot_when_checksum_is_unchanged(db_session):
first_run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
second_run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add_all([first_run, second_run])
db_session.commit()
_, first_changed = _upsert_employee(db_session, first_run, _parsed_employee("same"))
_, second_changed = _upsert_employee(db_session, second_run, _parsed_employee("same"))
db_session.commit()
assert first_changed is True
assert second_changed is False
assert db_session.query(EmployeeSnapshot).count() == 1
def test_checksum_changes_when_widget_data_changes():
base = _parsed_employee("widgets")
changed = _parsed_employee("widgets")
changed["sections"] = [
{
"type": "publications",
"publications": [{"id": "1", "title": "New publication"}],
}
]
assert _checksum(base) != _checksum(changed)
def test_checksum_ignores_date_dependent_experience_text():
first = _parsed_employee("experience")
second = _parsed_employee("experience")
first["sections"] = [{"raw_text": "Стаж работы в НИУ ВШЭ: 5 лет"}]
second["sections"] = [{"raw_text": "Стаж работы в НИУ ВШЭ: 6 лет"}]
assert _checksum(first) == _checksum(second)
def _parsed_employee(profile_id: str) -> dict:
return {
"source_url": f"https://www.hse.ru/staff/{profile_id}",
"profile_type": "staff",
"profile_id": profile_id,
"full_name": "Same Person",
"tabs": [],
"sections": [],
"parser_version": "0.6.0",
"_html": "<html></html>",
}