Compare commits

...

25 Commits

Author SHA1 Message Date
Anton
8e19dc9f35 fix: separate news from publications and add employee refresh 2026-05-13 16:11:13 +03:00
5b9d71426d Merge pull request 'fix: support grouped HSE publication API responses' (#21) from fix/grouped-publications-parser into main
Reviewed-on: #21
2026-05-13 09:46:48 +00:00
Anton
efa7192e45 fix: support grouped HSE publication API responses 2026-05-13 12:46:07 +03:00
b27d613143 Merge pull request 'fix: remove mcp-auth from yml-file' (#20) from fix/remove-mcp-auth-compose into main
Reviewed-on: #20
2026-05-08 09:33:17 +00:00
Anton
a1ab1c0319 fix: remove mcp-auth from yml-file 2026-05-08 12:32:40 +03:00
0b4e04544d Merge pull request 'fix: remove MCP application-level authorization' (#19) from fix/remove-mcp-auth into main
Reviewed-on: #19
2026-05-08 09:15:18 +00:00
Anton
7593a460c7 fix: remove MCP application-level authorization 2026-05-08 12:14:19 +03:00
a4e7388bcf Merge pull request 'fix: use direct onclick handlers for run rows' (#18) from fix/direct-run-row-click-handler into main
Reviewed-on: #18
2026-05-07 15:25:26 +00:00
Anton
ac319b3ee5 fix: use direct onclick handlers for run rows 2026-05-07 18:23:14 +03:00
8e004c46ef Merge pull request 'fix: move run navigation from id link to table row' (#17) from fix/run-row-link-target into main
Reviewed-on: #17
2026-05-07 14:04:07 +00:00
Anton
7fa28e8e47 fix: move run navigation from id link to table row 2026-05-07 17:03:36 +03:00
1c4ad0bd9d Merge pull request 'fix: make run rows clickable and limit dashboard runs' (#16) from fix/dashboard-run-row-clicks into main
Reviewed-on: #16
2026-05-07 13:24:25 +00:00
Anton
52c5cc1af1 fix: make run rows clickable and limit dashboard runs 2026-05-07 16:23:39 +03:00
c97ced52b4 Merge pull request 'feat: make dashboard metrics and run rows clickable' (#15) from feature/dashboard-clickable-metrics into main
Reviewed-on: #15
2026-05-07 06:36:27 +00:00
Anton
deaecd8d3b feat: make dashboard metrics and run rows clickable 2026-05-07 09:35:44 +03:00
e4d4271e32 Merge pull request 'feat: track crawl run employee changes and verify dismissals' (#14) from feature/crawl-run-change-details into main
Reviewed-on: #14
2026-05-06 12:14:51 +00:00
Anton
d0459a2c30 feat: track crawl run employee changes and verify dismissals 2026-05-06 15:13:15 +03:00
Anton
2331c7a28d chore: removes sensitive data from docker file 2026-04-29 16:16:06 +03:00
064c34ea32 Merge pull request 'feat: adds oauth server to docker' (#13) from feature/add-oauth-server into main
Reviewed-on: #13
2026-04-29 12:59:55 +00:00
Anton
6a98ae4246 feat: adds oauth server to docker 2026-04-29 15:59:18 +03:00
a6f2883091 Merge pull request 'feat: requires OAuth-only auth mode for MCP agents' (#12) from feature/mcp-oauth-oidc into main
Reviewed-on: #12
2026-04-29 12:22:25 +00:00
Anton
d20b4f396b feat: requires OAuth-only auth mode for MCP agents 2026-04-29 15:08:18 +03:00
c7027bb503 Merge pull request 'feat: adds OAuth/OIDC authentication for MCP' (#11) from feature/mcp-oauth-oidc into main
Reviewed-on: #11
2026-04-29 11:35:00 +00:00
Anton
ad0b15cc6e feat: adds OAuth/OIDC authentication for MCP 2026-04-29 14:33:29 +03:00
af864ecb44 Merge pull request 'fix: enrich HSE profile parsing with publications and theses' (#10) from fix/hse-profile-parser-publications-vkr-pagination into main
Reviewed-on: #10
2026-04-29 11:16:17 +00:00
28 changed files with 941 additions and 77 deletions

View File

@@ -14,7 +14,5 @@ PARSER_USE_PLAYWRIGHT=false
ADMIN_USERNAME=admin ADMIN_USERNAME=admin
ADMIN_PASSWORD=change-me ADMIN_PASSWORD=change-me
SESSION_SECRET=change-me-session-secret SESSION_SECRET=change-me-session-secret
MCP_TOKEN=change-me-mcp-token
API_PORT=8000 API_PORT=8000
MCP_PORT=8001 MCP_PORT=8001

View File

@@ -6,7 +6,7 @@
- `api`: FastAPI, REST API, HTML-админка, healthcheck. - `api`: FastAPI, REST API, HTML-админка, healthcheck.
- `worker`: weekly scheduler, который запускает парсинг по `CRAWL_CRON`. - `worker`: weekly scheduler, который запускает парсинг по `CRAWL_CRON`.
- `mcp`: HTTP MCP endpoint с bearer token. - `mcp`: открытый HTTP MCP endpoint для ИИ-агентов.
- `postgres`: основная БД. - `postgres`: основная БД.
Парсер использует фиксированный источник сотрудников, по умолчанию `https://miem.hse.ru/persons`. Для каждой карточки сохраняются ФИО, должности, год начала работы, контакты, идентификаторы, вкладки профиля, секции, публикации, курсы, ВКР, JSON-снапшот и сжатый HTML-снапшот. Ссылки обходятся только из меню профиля самого сотрудника (`person-menu`), например `#sci`, `#teaching`, `#main`. Парсер использует фиксированный источник сотрудников, по умолчанию `https://miem.hse.ru/persons`. Для каждой карточки сохраняются ФИО, должности, год начала работы, контакты, идентификаторы, вкладки профиля, секции, публикации, курсы, ВКР, JSON-снапшот и сжатый HTML-снапшот. Ссылки обходятся только из меню профиля самого сотрудника (`person-menu`), например `#sci`, `#teaching`, `#main`.
@@ -27,7 +27,6 @@ cp .env.example .env
- `CRAWL_LIMIT`: опциональный лимит профилей для тестового запуска. - `CRAWL_LIMIT`: опциональный лимит профилей для тестового запуска.
- `ADMIN_USERNAME`, `ADMIN_PASSWORD`: логин и пароль админки. - `ADMIN_USERNAME`, `ADMIN_PASSWORD`: логин и пароль админки.
- `SESSION_SECRET`: секрет подписи cookie. - `SESSION_SECRET`: секрет подписи cookie.
- `MCP_TOKEN`: bearer token для `/mcp`.
- `PARSER_USE_PLAYWRIGHT`: включение Playwright-рендера динамических вкладок. - `PARSER_USE_PLAYWRIGHT`: включение Playwright-рендера динамических вкладок.
## Локальный запуск ## Локальный запуск
@@ -82,7 +81,7 @@ curl -X POST http://localhost:8000/api/crawl-runs --cookie "miem_admin_session=.
## MCP ## MCP
Endpoint: `POST /mcp`, авторизация `Authorization: Bearer <MCP_TOKEN>`. Endpoint: `POST /mcp`, без авторизации на уровне приложения.
Поддерживаемые tools: Поддерживаемые tools:
@@ -92,15 +91,16 @@ Endpoint: `POST /mcp`, авторизация `Authorization: Bearer <MCP_TOKEN>
- `list_employee_courses(profile_id_or_url)` - `list_employee_courses(profile_id_or_url)`
- `get_crawl_status()` - `get_crawl_status()`
Пример: Пример локального legacy-режима со статическим токеном:
```bash ```bash
curl http://localhost:8001/mcp \ curl http://localhost:8001/mcp \
-H "Authorization: Bearer change-me-mcp-token" \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'
``` ```
Если MCP нужно ограничить, делайте это на сетевом уровне: localhost binding, VPN, firewall, reverse proxy или другой внешний контур доступа.
## Обслуживание ## Обслуживание
```bash ```bash
@@ -110,4 +110,4 @@ docker compose exec postgres pg_dump -U miem miem_workers > backup.sql
docker compose down docker compose down
``` ```
Версия сервиса: `0.2.8`. Админка всегда показывает версии backend и frontend в footer. Версия сервиса: `0.4.5`. Админка всегда показывает версии backend и frontend в footer.

View File

@@ -8,8 +8,16 @@ from app.config import Settings, get_settings
from app.db import SessionLocal, get_db from app.db import SessionLocal, get_db
from app.models import CrawlError, CrawlRun, Employee from app.models import CrawlError, CrawlRun, Employee
from app.security import SESSION_COOKIE, require_admin, sign_session, verify_admin from app.security import SESSION_COOKIE, require_admin, sign_session, verify_admin
from app.services.admin_data import employee_detail_payload, format_admin_datetime, list_employees_page, run_payload, stats_payload from app.services.admin_data import (
employee_detail_payload,
format_admin_datetime,
list_employees_page,
run_detail_payload,
run_payload,
stats_payload,
)
from app.services.crawl_control import get_running_run, run_crawl_if_idle from app.services.crawl_control import get_running_run, run_crawl_if_idle
from app.services.crawler import refresh_employee
from app.version import BACKEND_VERSION, FRONTEND_VERSION from app.version import BACKEND_VERSION, FRONTEND_VERSION
router = APIRouter(prefix="/admin") router = APIRouter(prefix="/admin")
@@ -22,7 +30,7 @@ def dashboard(request: Request, db: Session = Depends(get_db), settings: Setting
counts = stats_payload(db) counts = stats_payload(db)
counts["runs"] = db.scalar(select(func.count()).select_from(CrawlRun)) or 0 counts["runs"] = db.scalar(select(func.count()).select_from(CrawlRun)) or 0
counts["errors"] = db.scalar(select(func.count()).select_from(CrawlError)) or 0 counts["errors"] = db.scalar(select(func.count()).select_from(CrawlError)) or 0
run_models = db.scalars(select(CrawlRun).order_by(desc(CrawlRun.started_at)).limit(10)).all() run_models = db.scalars(select(CrawlRun).order_by(desc(CrawlRun.started_at)).limit(5)).all()
runs = [run_payload(run) for run in run_models] runs = [run_payload(run) for run in run_models]
return _render(request, "dashboard.html", {"counts": counts, "runs": runs, "latest_run": runs[0] if runs else None}) return _render(request, "dashboard.html", {"counts": counts, "runs": runs, "latest_run": runs[0] if runs else None})
@@ -137,10 +145,31 @@ def employee_detail(
return _render( return _render(
request, request,
"employee_detail.html", "employee_detail.html",
{"employee": employee, "employee_view": employee_detail_payload(employee), "snapshots": snapshots}, {
"employee": employee,
"employee_view": employee_detail_payload(employee),
"snapshots": snapshots,
"refresh_status": request.query_params.get("refresh_status"),
},
) )
@router.post("/employees/{employee_id}/refresh")
def refresh_employee_detail(
employee_id: int,
request: Request,
db: Session = Depends(get_db),
settings: Settings = Depends(get_settings),
):
require_admin(request, settings)
employee = db.get(Employee, employee_id)
if not employee:
return RedirectResponse("/admin/directory", status_code=303)
run = refresh_employee(db, employee, settings)
status = "success" if run.status == "completed" else "error"
return RedirectResponse(f"/admin/employees/{employee_id}?refresh_status={status}", status_code=303)
@router.get("/runs", response_class=HTMLResponse) @router.get("/runs", response_class=HTMLResponse)
def runs(request: Request, db: Session = Depends(get_db), settings: Settings = Depends(get_settings)): def runs(request: Request, db: Session = Depends(get_db), settings: Settings = Depends(get_settings)):
require_admin(request, settings) require_admin(request, settings)
@@ -150,6 +179,20 @@ def runs(request: Request, db: Session = Depends(get_db), settings: Settings = D
return _render(request, "runs.html", {"runs": items, "errors": errors}) return _render(request, "runs.html", {"runs": items, "errors": errors})
@router.get("/runs/{run_id}", response_class=HTMLResponse)
def run_detail(
run_id: int,
request: Request,
db: Session = Depends(get_db),
settings: Settings = Depends(get_settings),
):
require_admin(request, settings)
run = db.get(CrawlRun, run_id)
if not run:
return RedirectResponse("/admin/runs", status_code=303)
return _render(request, "run_detail.html", {"run": run_detail_payload(db, run)})
@router.post("/runs") @router.post("/runs")
def trigger_run( def trigger_run(
request: Request, request: Request,

View File

@@ -8,7 +8,7 @@ from app.config import Settings, get_settings
from app.db import SessionLocal, get_db from app.db import SessionLocal, get_db
from app.models import CrawlRun, Employee from app.models import CrawlRun, Employee
from app.security import require_admin from app.security import require_admin
from app.services.admin_data import employee_display_payload, list_employees_page, run_payload, stats_payload from app.services.admin_data import employee_display_payload, list_employees_page, run_detail_payload, run_payload, stats_payload
from app.services.crawl_control import get_running_run, run_crawl_if_idle from app.services.crawl_control import get_running_run, run_crawl_if_idle
from app.version import BACKEND_VERSION, FRONTEND_VERSION from app.version import BACKEND_VERSION, FRONTEND_VERSION
@@ -88,6 +88,20 @@ def latest_crawl_run(
return {"running": run_payload(running), "latest": run_payload(latest)} return {"running": run_payload(running), "latest": run_payload(latest)}
@router.get("/crawl-runs/{run_id}")
def get_crawl_run(
run_id: int,
request: Request,
db: Session = Depends(get_db),
settings: Settings = Depends(get_settings),
) -> dict:
require_admin(request, settings)
run = db.get(CrawlRun, run_id)
if not run:
return {"error": "not_found"}
return run_detail_payload(db, run) or {"error": "not_found"}
@router.post("/crawl-runs") @router.post("/crawl-runs")
def trigger_crawl( def trigger_crawl(
request: Request, request: Request,

View File

@@ -17,7 +17,6 @@ class Settings(BaseSettings):
admin_username: str = "admin" admin_username: str = "admin"
admin_password: str = "admin" admin_password: str = "admin"
session_secret: str = Field(default="dev-session-secret", min_length=8) session_secret: str = Field(default="dev-session-secret", min_length=8)
mcp_token: str = "dev-mcp-token"
@field_validator("crawl_limit", mode="before") @field_validator("crawl_limit", mode="before")
@classmethod @classmethod
@@ -26,7 +25,6 @@ class Settings(BaseSettings):
return None return None
return value return value
@lru_cache @lru_cache
def get_settings() -> Settings: def get_settings() -> Settings:
return Settings() return Settings()

View File

@@ -4,10 +4,10 @@ from fastapi import APIRouter, Depends, Request
from sqlalchemy import desc, or_, select from sqlalchemy import desc, or_, select
from sqlalchemy.orm import Session from sqlalchemy.orm import Session
from app.config import Settings, get_settings
from app.db import get_db from app.db import get_db
from app.models import CrawlRun, Employee from app.models import CrawlRun, Employee
from app.security import require_mcp_token from app.services.admin_data import run_detail_payload
from app.version import BACKEND_VERSION
router = APIRouter(prefix="/mcp") router = APIRouter(prefix="/mcp")
@@ -46,6 +46,15 @@ TOOLS = [
"description": "Return the latest crawl run status.", "description": "Return the latest crawl run status.",
"inputSchema": {"type": "object", "properties": {}}, "inputSchema": {"type": "object", "properties": {}},
}, },
{
"name": "get_crawl_run_details",
"description": "Return detailed employee changes and errors for one crawl run.",
"inputSchema": {
"type": "object",
"properties": {"run_id": {"type": "integer"}},
"required": ["run_id"],
},
},
] ]
@@ -53,9 +62,7 @@ TOOLS = [
async def mcp_http( async def mcp_http(
request: Request, request: Request,
db: Session = Depends(get_db), db: Session = Depends(get_db),
settings: Settings = Depends(get_settings),
) -> dict: ) -> dict:
require_mcp_token(request, settings)
payload = await request.json() payload = await request.json()
method = payload.get("method") method = payload.get("method")
request_id = payload.get("id") request_id = payload.get("id")
@@ -65,7 +72,7 @@ async def mcp_http(
if method == "initialize": if method == "initialize":
result = { result = {
"protocolVersion": "2024-11-05", "protocolVersion": "2024-11-05",
"serverInfo": {"name": "miem-employees", "version": "0.1.0"}, "serverInfo": {"name": "miem-employees", "version": BACKEND_VERSION},
"capabilities": {"tools": {}}, "capabilities": {"tools": {}},
} }
elif method == "tools/list": elif method == "tools/list":
@@ -94,6 +101,9 @@ def _call_tool(db: Session, name: str, arguments: dict) -> dict:
if name == "get_crawl_status": if name == "get_crawl_status":
run = db.scalar(select(CrawlRun).order_by(desc(CrawlRun.started_at)).limit(1)) run = db.scalar(select(CrawlRun).order_by(desc(CrawlRun.started_at)).limit(1))
return _tool_response(_run_payload(run) if run else {"status": "never_run"}) return _tool_response(_run_payload(run) if run else {"status": "never_run"})
if name == "get_crawl_run_details":
run = db.get(CrawlRun, int(arguments["run_id"]))
return _tool_response(run_detail_payload(db, run) if run else {"error": "not_found"})
raise ValueError(f"Unknown tool: {name}") raise ValueError(f"Unknown tool: {name}")

View File

@@ -41,6 +41,7 @@ class Employee(Base):
snapshots: Mapped[list["EmployeeSnapshot"]] = relationship(back_populates="employee") snapshots: Mapped[list["EmployeeSnapshot"]] = relationship(back_populates="employee")
tabs: Mapped[list["ProfileTab"]] = relationship(back_populates="employee", cascade="all, delete-orphan") tabs: Mapped[list["ProfileTab"]] = relationship(back_populates="employee", cascade="all, delete-orphan")
crawl_run_changes: Mapped[list["CrawlRunEmployeeChange"]] = relationship(back_populates="employee")
class EmployeeSnapshot(Base): class EmployeeSnapshot(Base):
@@ -74,6 +75,31 @@ class CrawlRun(Base):
dismissed_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False) dismissed_count: Mapped[int] = mapped_column(Integer, default=0, nullable=False)
message: Mapped[str | None] = mapped_column(Text) message: Mapped[str | None] = mapped_column(Text)
employee_changes: Mapped[list["CrawlRunEmployeeChange"]] = relationship(back_populates="crawl_run")
class CrawlRunEmployeeChange(Base):
__tablename__ = "crawl_run_employee_changes"
__table_args__ = (
Index("ix_crawl_run_employee_changes_run_id", "crawl_run_id"),
Index("ix_crawl_run_employee_changes_employee_id", "employee_id"),
Index("ix_crawl_run_employee_changes_change_type", "change_type"),
)
id: Mapped[int] = mapped_column(Integer, primary_key=True)
crawl_run_id: Mapped[int] = mapped_column(ForeignKey("crawl_runs.id"), nullable=False)
employee_id: Mapped[int | None] = mapped_column(ForeignKey("employees.id"))
profile_key: Mapped[str] = mapped_column(String(255), nullable=False)
profile_url: Mapped[str] = mapped_column(Text, nullable=False)
full_name: Mapped[str | None] = mapped_column(Text)
change_type: Mapped[str] = mapped_column(String(32), nullable=False)
profile_available: Mapped[bool | None] = mapped_column()
message: Mapped[str | None] = mapped_column(Text)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, nullable=False)
crawl_run: Mapped[CrawlRun] = relationship(back_populates="employee_changes")
employee: Mapped[Employee | None] = relationship(back_populates="crawl_run_changes")
class CrawlError(Base): class CrawlError(Base):
__tablename__ = "crawl_errors" __tablename__ = "crawl_errors"

View File

@@ -263,10 +263,10 @@ def _load_widget_publications(session: Session, soup: BeautifulSoup, headers: di
return publications return publications
result = data.get("result") if isinstance(data, dict) else {} result = data.get("result") if isinstance(data, dict) else {}
items = result.get("items") if isinstance(result, dict) else [] items = _extract_publication_items(result)
if not isinstance(items, list) or not items: if not items:
break break
publications.extend(_normalize_publication_item(item) for item in items if isinstance(item, dict)) publications.extend(_normalize_publication_item(item) for item in items)
total = int(result.get("total") or 0) total = int(result.get("total") or 0)
if not result.get("more") and len(publications) >= total: if not result.get("more") and len(publications) >= total:
@@ -275,6 +275,34 @@ def _load_widget_publications(session: Session, soup: BeautifulSoup, headers: di
return _dedupe_publications(publications) return _dedupe_publications(publications)
def _extract_publication_items(result: object) -> list[dict]:
if not isinstance(result, dict):
return []
return _flatten_publication_items(result.get("items"))
def _flatten_publication_items(value: object) -> list[dict]:
if isinstance(value, list):
return [item for item in value if _is_publication_item(item)]
if not isinstance(value, dict):
return []
nested_items = value.get("items")
if isinstance(nested_items, list):
return [item for item in nested_items if _is_publication_item(item)]
if isinstance(nested_items, dict):
return _flatten_publication_items(nested_items)
publications = []
for child in value.values():
publications.extend(_flatten_publication_items(child))
return publications
def _is_publication_item(value: object) -> bool:
return isinstance(value, dict) and ("id" in value or "title" in value)
def _load_widget_graduation_theses( def _load_widget_graduation_theses(
session: Session, session: Session,
soup: BeautifulSoup, soup: BeautifulSoup,
@@ -359,7 +387,7 @@ def _infer_section_type(title: str, nodes: list) -> str:
lowered = title.lower() lowered = title.lower()
if _has_table(nodes): if _has_table(nodes):
return "table" return "table"
if "публикац" in lowered: if _is_publications_title(lowered):
return "publications" return "publications"
if "учебные курсы" in lowered: if "учебные курсы" in lowered:
return "courses_by_year" return "courses_by_year"
@@ -370,6 +398,10 @@ def _infer_section_type(title: str, nodes: list) -> str:
return "generic" return "generic"
def _is_publications_title(lowered_title: str) -> bool:
return lowered_title.startswith("публикац")
def _has_table(nodes: list) -> bool: def _has_table(nodes: list) -> bool:
return any(isinstance(node, Tag) and (node.name == "table" or node.find("table")) for node in nodes) return any(isinstance(node, Tag) and (node.name == "table" or node.find("table")) for node in nodes)

View File

@@ -44,9 +44,3 @@ def require_admin(request: Request, settings: Settings) -> str:
if not username: if not username:
raise HTTPException(status_code=status.HTTP_303_SEE_OTHER, headers={"Location": "/admin/login"}) raise HTTPException(status_code=status.HTTP_303_SEE_OTHER, headers={"Location": "/admin/login"})
return username return username
def require_mcp_token(request: Request, settings: Settings) -> None:
auth = request.headers.get("authorization", "")
if not auth.startswith("Bearer ") or not hmac.compare_digest(auth.removeprefix("Bearer ").strip(), settings.mcp_token):
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid MCP token")

View File

@@ -8,7 +8,7 @@ from zoneinfo import ZoneInfo
from sqlalchemy import Select, Text, and_, desc, func, or_, select from sqlalchemy import Select, Text, and_, desc, func, or_, select
from sqlalchemy.orm import Session from sqlalchemy.orm import Session
from app.models import CrawlRun, Employee from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee
EMPLOYEE_SORTS = { EMPLOYEE_SORTS = {
"full_name": Employee.full_name, "full_name": Employee.full_name,
@@ -175,6 +175,26 @@ def run_payload(run: CrawlRun | None) -> dict[str, Any] | None:
} }
def run_detail_payload(db: Session, run: CrawlRun | None) -> dict[str, Any] | None:
if not run:
return None
changes = db.scalars(
select(CrawlRunEmployeeChange)
.where(CrawlRunEmployeeChange.crawl_run_id == run.id)
.order_by(CrawlRunEmployeeChange.created_at, CrawlRunEmployeeChange.id)
).all()
errors = db.scalars(select(CrawlError).where(CrawlError.crawl_run_id == run.id).order_by(CrawlError.created_at)).all()
grouped_changes = {"new": [], "missing_from_source": [], "dismissed": []}
for change in changes:
grouped_changes.setdefault(change.change_type, []).append(_change_payload(change))
return {
**(run_payload(run) or {}),
"changes_detail_available": bool(changes),
"changes": grouped_changes,
"errors": [_crawl_error_payload(error) for error in errors],
}
def format_admin_datetime(value: Any) -> str: def format_admin_datetime(value: Any) -> str:
if not value: if not value:
return "Не указано" return "Не указано"
@@ -200,6 +220,52 @@ def _run_status_display(status: str | None) -> str:
return labels.get(status or "", status or "Не указано") return labels.get(status or "", status or "Не указано")
def _change_payload(change: CrawlRunEmployeeChange) -> dict[str, Any]:
return {
"id": change.id,
"employee_id": change.employee_id,
"profile_key": change.profile_key,
"profile_url": change.profile_url,
"full_name": change.full_name,
"change_type": change.change_type,
"change_type_display": _change_type_display(change.change_type),
"profile_available": change.profile_available,
"profile_available_display": _profile_available_display(change.profile_available),
"message": change.message,
"created_at": change.created_at.isoformat() if change.created_at else None,
"created_display": format_admin_datetime(change.created_at),
}
def _crawl_error_payload(error: CrawlError) -> dict[str, Any]:
return {
"id": error.id,
"crawl_run_id": error.crawl_run_id,
"profile_url": error.profile_url,
"error_type": error.error_type,
"message": error.message,
"created_at": error.created_at.isoformat() if error.created_at else None,
"created_display": format_admin_datetime(error.created_at),
}
def _change_type_display(change_type: str | None) -> str:
labels = {
"new": "Новый",
"missing_from_source": "Потеряшка",
"dismissed": "Уволен",
}
return labels.get(change_type or "", change_type or "Не указано")
def _profile_available_display(value: bool | None) -> str:
if value is True:
return "Профиль доступен"
if value is False:
return "Профиль недоступен"
return "Не проверялось"
def _count_section_items(sections: list[dict[str, Any]], section_type: str) -> int: def _count_section_items(sections: list[dict[str, Any]], section_type: str) -> int:
total = 0 total = 0
for section in sections: for section in sections:

View File

@@ -9,7 +9,7 @@ from sqlalchemy import select
from sqlalchemy.orm import Session from sqlalchemy.orm import Session
from app.config import Settings from app.config import Settings
from app.models import CrawlError, CrawlRun, Employee, EmployeeSnapshot, ParserSource, ProfileTab from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee, EmployeeSnapshot, ParserSource, ProfileTab
from app.parser.collector import collect_profile_links from app.parser.collector import collect_profile_links
from app.parser.profile import parse_person_profile from app.parser.profile import parse_person_profile
from app.parser.profile_url import profile_key from app.parser.profile_url import profile_key
@@ -68,7 +68,7 @@ def run_crawl(db: Session, settings: Settings) -> CrawlRun:
finally: finally:
time.sleep(settings.request_delay_seconds) time.sleep(settings.request_delay_seconds)
run.dismissed_count = _mark_dismissed(db, found_keys) run.dismissed_count = _mark_dismissed(db, run, found_keys, session, settings.request_timeout)
run.status = "completed" run.status = "completed"
except Exception as exc: except Exception as exc:
run.status = "failed" run.status = "failed"
@@ -80,6 +80,48 @@ def run_crawl(db: Session, settings: Settings) -> CrawlRun:
return run return run
def refresh_employee(db: Session, employee: Employee, settings: Settings) -> CrawlRun:
run = CrawlRun(source_url=employee.canonical_url, status="running", found_count=1)
db.add(run)
db.commit()
db.refresh(run)
try:
with requests.Session() as session:
parsed = parse_person_profile(
session,
employee.canonical_url,
HEADERS,
settings.request_timeout,
settings.parser_use_playwright,
)
if not parsed:
raise ValueError("Профиль не удалось распарсить.")
if _parsed_profile_key(parsed) != employee.profile_key:
raise ValueError("Распарсенный профиль не совпадает с обновляемым сотрудником.")
_upsert_employee(db, run, parsed)
run.parsed_count = 1
run.status = "completed"
except Exception as exc:
run.status = "failed"
run.error_count = 1
run.message = str(exc)
db.add(
CrawlError(
crawl_run_id=run.id,
profile_url=employee.canonical_url,
error_type=type(exc).__name__,
message=str(exc),
)
)
finally:
run.finished_at = datetime.now(timezone.utc)
db.commit()
db.refresh(run)
return run
def _ensure_source(db: Session, source_url: str) -> ParserSource: def _ensure_source(db: Session, source_url: str) -> ParserSource:
source = db.scalar(select(ParserSource).where(ParserSource.source_url == source_url)) source = db.scalar(select(ParserSource).where(ParserSource.source_url == source_url))
if source: if source:
@@ -91,10 +133,14 @@ def _ensure_source(db: Session, source_url: str) -> ParserSource:
return source return source
def _parsed_profile_key(parsed: dict) -> str:
return f"{parsed.get('profile_type')}:{parsed.get('profile_id')}"
def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee: def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee:
html = parsed.pop("_html", None) html = parsed.pop("_html", None)
checksum = _checksum(parsed) checksum = _checksum(parsed)
key = f"{parsed.get('profile_type')}:{parsed.get('profile_id')}" key = _parsed_profile_key(parsed)
employee = db.scalar(select(Employee).where(Employee.profile_key == key)) employee = db.scalar(select(Employee).where(Employee.profile_key == key))
now = datetime.now(timezone.utc) now = datetime.now(timezone.utc)
if not employee: if not employee:
@@ -107,6 +153,9 @@ def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee:
) )
db.add(employee) db.add(employee)
run.new_count += 1 run.new_count += 1
is_new = True
else:
is_new = False
employee.full_name = parsed.get("full_name") employee.full_name = parsed.get("full_name")
employee.status = "active" employee.status = "active"
@@ -117,6 +166,16 @@ def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee:
employee.current_checksum = checksum employee.current_checksum = checksum
db.flush() db.flush()
if is_new:
_record_employee_change(
db,
run,
employee,
"new",
profile_available=True,
message="Сотрудник впервые найден в источнике.",
)
db.query(ProfileTab).filter(ProfileTab.employee_id == employee.id).delete() db.query(ProfileTab).filter(ProfileTab.employee_id == employee.id).delete()
for tab in parsed.get("tabs") or []: for tab in parsed.get("tabs") or []:
db.add( db.add(
@@ -141,20 +200,70 @@ def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> Employee:
return employee return employee
def _mark_dismissed(db: Session, found_keys: set[str]) -> int: def _mark_dismissed(db: Session, run: CrawlRun, found_keys: set[str], session: requests.Session, timeout: int) -> int:
dismissed = 0 dismissed = 0
active = db.scalars(select(Employee).where(Employee.status == "active")).all() active = db.scalars(select(Employee).where(Employee.status == "active")).all()
now = datetime.now(timezone.utc) now = datetime.now(timezone.utc)
for employee in active: for employee in active:
if employee.profile_key in found_keys: if employee.profile_key in found_keys:
continue continue
profile_available = _profile_is_available(session, employee.canonical_url, timeout)
if profile_available:
_record_employee_change(
db,
run,
employee,
"missing_from_source",
profile_available=True,
message="Профиль доступен, но ссылка отсутствует в исходном списке.",
)
continue
employee.status = "dismissed" employee.status = "dismissed"
employee.dismissed_at = now employee.dismissed_at = now
_record_employee_change(
db,
run,
employee,
"dismissed",
profile_available=False,
message="Сотрудник отсутствует в исходном списке, профиль не подтвердился как доступный.",
)
dismissed += 1 dismissed += 1
db.commit() db.commit()
return dismissed return dismissed
def _profile_is_available(session: requests.Session, url: str, timeout: int) -> bool:
try:
response = session.get(url, headers=HEADERS, timeout=timeout, allow_redirects=True)
return response.status_code < 400
except requests.RequestException:
return False
def _record_employee_change(
db: Session,
run: CrawlRun,
employee: Employee,
change_type: str,
*,
profile_available: bool | None,
message: str,
) -> None:
db.add(
CrawlRunEmployeeChange(
crawl_run_id=run.id,
employee_id=employee.id,
profile_key=employee.profile_key,
profile_url=employee.canonical_url,
full_name=employee.full_name,
change_type=change_type,
profile_available=profile_available,
message=message,
)
)
def _checksum(data: dict) -> str: def _checksum(data: dict) -> str:
payload = json.dumps(data, ensure_ascii=False, sort_keys=True, separators=(",", ":")) payload = json.dumps(data, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(payload.encode("utf-8")).hexdigest() return hashlib.sha256(payload.encode("utf-8")).hexdigest()

View File

@@ -1,6 +1,8 @@
.admin { .admin {
margin: 0; margin: 0;
min-height: 100vh; min-height: 100vh;
display: flex;
flex-direction: column;
color: #1f2937; color: #1f2937;
background: #f6f7f9; background: #f6f7f9;
font-family: Arial, sans-serif; font-family: Arial, sans-serif;
@@ -21,6 +23,11 @@
font-size: 20px; font-size: 20px;
} }
.admin__brand-link {
color: inherit;
text-decoration: none;
}
.admin__nav { .admin__nav {
display: flex; display: flex;
align-items: center; align-items: center;
@@ -34,6 +41,7 @@
} }
.admin__main { .admin__main {
flex: 1;
width: min(1180px, calc(100% - 32px)); width: min(1180px, calc(100% - 32px));
margin: 28px auto; margin: 28px auto;
} }
@@ -52,18 +60,30 @@
} }
.metric { .metric {
display: block;
padding: 18px; padding: 18px;
background: #ffffff; background: #ffffff;
border: 1px solid #d9dee7; border: 1px solid #d9dee7;
border-radius: 8px; border-radius: 8px;
} }
.metric--link {
color: inherit;
text-decoration: none;
}
.metric--link:hover {
border-color: #0f766e;
}
.metric__label { .metric__label {
display: block;
color: #6b7280; color: #6b7280;
font-size: 13px; font-size: 13px;
} }
.metric__value { .metric__value {
display: block;
margin-top: 8px; margin-top: 8px;
font-size: 28px; font-size: 28px;
font-weight: 700; font-weight: 700;
@@ -87,6 +107,14 @@
border-collapse: collapse; border-collapse: collapse;
} }
.table__row {
cursor: pointer;
}
.table__row:hover {
background: #f0fdfa;
}
.table__cell, .table__cell,
.table__head { .table__head {
padding: 10px 8px; padding: 10px 8px;
@@ -143,6 +171,10 @@
background: transparent; background: transparent;
} }
.button--compact {
padding: 8px 12px;
}
.code { .code {
overflow-x: auto; overflow-x: auto;
padding: 14px; padding: 14px;
@@ -173,11 +205,34 @@
gap: 10px; gap: 10px;
} }
.employee-card__actions {
display: grid;
justify-items: end;
gap: 10px;
}
.employee-card__title { .employee-card__title {
margin: 0; margin: 0;
font-size: 24px; font-size: 24px;
} }
.employee-card__notice {
margin: 0;
padding: 12px 14px;
border-radius: 8px;
font-weight: 700;
}
.employee-card__notice--success {
color: #065f46;
background: #d1fae5;
}
.employee-card__notice--error {
color: #991b1b;
background: #fee2e2;
}
.employee-card__section { .employee-card__section {
padding: 20px; padding: 20px;
background: #ffffff; background: #ffffff;
@@ -331,12 +386,22 @@
} }
.stats-strip__item { .stats-strip__item {
display: block;
padding: 14px 16px; padding: 14px 16px;
background: #ffffff; background: #ffffff;
border: 1px solid #d9dee7; border: 1px solid #d9dee7;
border-radius: 8px; border-radius: 8px;
} }
.stats-strip__item--link {
color: inherit;
text-decoration: none;
}
.stats-strip__item--link:hover {
border-color: #0f766e;
}
.stats-strip__label { .stats-strip__label {
display: block; display: block;
color: #6b7280; color: #6b7280;

View File

@@ -59,10 +59,23 @@
applyColumns(columns); applyColumns(columns);
}); });
}); });
}
function setupClickableRows() {
const openRow = (row) => {
window.location.href = row.dataset.rowHref;
};
document.querySelectorAll("[data-row-href]").forEach((row) => { document.querySelectorAll("[data-row-href]").forEach((row) => {
row.addEventListener("click", (event) => { row.addEventListener("click", (event) => {
if (event.target.closest("a, button, input, select, label")) return; if (event.target.closest("a, button, input, select, label")) return;
window.location.href = row.dataset.rowHref; openRow(row);
});
row.addEventListener("keydown", (event) => {
if (!["Enter", " "].includes(event.key)) return;
if (event.target.closest("a, button, input, select, label")) return;
event.preventDefault();
openRow(row);
}); });
}); });
} }
@@ -107,5 +120,6 @@
} }
setupColumns(); setupColumns();
setupClickableRows();
setupProgress(); setupProgress();
})(); })();

View File

@@ -8,7 +8,7 @@
</head> </head>
<body class="admin"> <body class="admin">
<header class="admin__header"> <header class="admin__header">
<h1 class="admin__brand">MIEM Employees</h1> <h1 class="admin__brand"><a class="admin__brand-link" href="/admin">MIEM Employees</a></h1>
<nav class="admin__nav"> <nav class="admin__nav">
<a class="admin__link" href="/admin">Обзор</a> <a class="admin__link" href="/admin">Обзор</a>
<a class="admin__link" href="/admin/directory">Сотрудники</a> <a class="admin__link" href="/admin/directory">Сотрудники</a>

View File

@@ -2,10 +2,10 @@
{% block title %}Обзор · MIEM Employees{% endblock %} {% block title %}Обзор · MIEM Employees{% endblock %}
{% block content %} {% block content %}
<section class="admin__grid"> <section class="admin__grid">
<div class="metric"><div class="metric__label">Всего в базе</div><div class="metric__value">{{ counts.total }}</div></div> <a class="metric metric--link" href="/admin/directory"><span class="metric__label">Всего в базе</span><span class="metric__value">{{ counts.total }}</span></a>
<div class="metric"><div class="metric__label">Работают</div><div class="metric__value">{{ counts.active }}</div></div> <a class="metric metric--link" href="/admin/directory?status=active"><span class="metric__label">Работают</span><span class="metric__value">{{ counts.active }}</span></a>
<div class="metric"><div class="metric__label">Новые за запуск</div><div class="metric__value">{{ counts.new_in_last_run }}</div></div> <a class="metric metric--link" href="{% if latest_run %}/admin/runs/{{ latest_run.id }}#new-employees{% else %}/admin/runs{% endif %}"><span class="metric__label">Новые за запуск</span><span class="metric__value">{{ counts.new_in_last_run }}</span></a>
<div class="metric"><div class="metric__label">Уволены</div><div class="metric__value">{{ counts.dismissed }}</div></div> <a class="metric metric--link" href="/admin/directory?status=dismissed"><span class="metric__label">Уволены</span><span class="metric__value">{{ counts.dismissed }}</span></a>
</section> </section>
<section class="stats-strip"> <section class="stats-strip">
<div class="stats-strip__item"> <div class="stats-strip__item">
@@ -16,10 +16,10 @@
<span class="stats-strip__value">Сотрудников пока нет</span> <span class="stats-strip__value">Сотрудников пока нет</span>
{% endif %} {% endif %}
</div> </div>
<div class="stats-strip__item"> <a class="stats-strip__item stats-strip__item--link" href="/admin/runs">
<span class="stats-strip__label">Запуски</span> <span class="stats-strip__label">Запуски</span>
<span class="stats-strip__value">{{ counts.runs }}</span> <span class="stats-strip__value">{{ counts.runs }}</span>
</div> </a>
<div class="stats-strip__item"> <div class="stats-strip__item">
<span class="stats-strip__label">Ошибки</span> <span class="stats-strip__label">Ошибки</span>
<span class="stats-strip__value">{{ counts.errors }}</span> <span class="stats-strip__value">{{ counts.errors }}</span>
@@ -51,7 +51,7 @@
<thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Обработано</th><th class="table__head">Ошибки</th><th class="table__head">Старт</th></tr></thead> <thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Обработано</th><th class="table__head">Ошибки</th><th class="table__head">Старт</th></tr></thead>
<tbody> <tbody>
{% for run in runs %} {% for run in runs %}
<tr><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr> <tr class="table__row" onclick="window.location.href='/admin/runs/{{ run.id }}'" onkeydown="if (event.key === 'Enter' || event.key === ' ') { event.preventDefault(); window.location.href='/admin/runs/{{ run.id }}'; }" role="link" tabindex="0"><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr>
{% endfor %} {% endfor %}
</tbody> </tbody>
</table> </table>

View File

@@ -7,8 +7,18 @@
<h2 class="employee-card__title">{{ employee_view.full_name or employee.profile_key }}</h2> <h2 class="employee-card__title">{{ employee_view.full_name or employee.profile_key }}</h2>
<span class="badge {% if employee_view.status == "dismissed" %}badge--dismissed{% endif %}">{{ employee_view.status_display }}</span> <span class="badge {% if employee_view.status == "dismissed" %}badge--dismissed{% endif %}">{{ employee_view.status_display }}</span>
</div> </div>
<div class="employee-card__actions">
<form method="post" action="/admin/employees/{{ employee.id }}/refresh">
<button class="button button--compact" type="submit">Обновить данные</button>
</form>
<a class="admin__link" href="{{ employee_view.canonical_url }}">{{ employee_view.canonical_url }}</a> <a class="admin__link" href="{{ employee_view.canonical_url }}">{{ employee_view.canonical_url }}</a>
</div> </div>
</div>
{% if refresh_status == "success" %}
<p class="employee-card__notice employee-card__notice--success">Данные сотрудника обновлены.</p>
{% elif refresh_status == "error" %}
<p class="employee-card__notice employee-card__notice--error">Не удалось обновить данные сотрудника.</p>
{% endif %}
<section class="employee-card__section"> <section class="employee-card__section">
<h3 class="employee-section__title">Основная информация</h3> <h3 class="employee-section__title">Основная информация</h3>

View File

@@ -0,0 +1,64 @@
{% extends "base.html" %}
{% block title %}Запуск {{ run.id }} · MIEM Employees{% endblock %}
{% block content %}
<section class="panel">
<div class="progress-panel__header">
<div>
<h2 class="panel__title">Запуск {{ run.id }}</h2>
<p class="progress-panel__empty">{{ run.started_display }} · {{ run.status_display }}</p>
</div>
<a class="admin__link" href="/admin/runs">Все запуски</a>
</div>
<div class="stats-strip">
<div class="stats-strip__item"><span class="stats-strip__label">Найдено</span><span class="stats-strip__value">{{ run.found_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Обработано</span><span class="stats-strip__value">{{ run.parsed_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Новые</span><span class="stats-strip__value">{{ run.new_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Потеряшки</span><span class="stats-strip__value">{{ run.changes.missing_from_source | length }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Уволены</span><span class="stats-strip__value">{{ run.dismissed_count }}</span></div>
<div class="stats-strip__item"><span class="stats-strip__label">Ошибки</span><span class="stats-strip__value">{{ run.error_count }}</span></div>
</div>
{% if not run.changes_detail_available %}
<p class="progress-panel__empty">Детализация сотрудников для этого запуска недоступна. Она сохраняется только для новых запусков после обновления.</p>
{% endif %}
</section>
{% for group, title in [("new", "Новые сотрудники"), ("missing_from_source", "Потеряшки"), ("dismissed", "Уволенные")] %}
<section class="panel"{% if group == "new" %} id="new-employees"{% endif %}>
<h2 class="panel__title">{{ title }}</h2>
{% set items = run.changes[group] %}
{% if items %}
<table class="table">
<thead><tr><th class="table__head">ФИО</th><th class="table__head">Профиль</th><th class="table__head">Проверка</th><th class="table__head">Комментарий</th></tr></thead>
<tbody>
{% for item in items %}
<tr>
<td class="table__cell">{% if item.employee_id %}<a class="admin__link" href="/admin/employees/{{ item.employee_id }}">{{ item.full_name or item.profile_key }}</a>{% else %}{{ item.full_name or item.profile_key }}{% endif %}</td>
<td class="table__cell"><a class="admin__link" href="{{ item.profile_url }}">{{ item.profile_url }}</a></td>
<td class="table__cell">{{ item.profile_available_display }}</td>
<td class="table__cell">{{ item.message or "" }}</td>
</tr>
{% endfor %}
</tbody>
</table>
{% else %}
<p class="progress-panel__empty">Нет записей.</p>
{% endif %}
</section>
{% endfor %}
<section class="panel">
<h2 class="panel__title">Ошибки запуска</h2>
{% if run.errors %}
<table class="table">
<thead><tr><th class="table__head">Профиль</th><th class="table__head">Ошибка</th><th class="table__head">Время</th></tr></thead>
<tbody>
{% for error in run.errors %}
<tr><td class="table__cell">{{ error.profile_url or "" }}</td><td class="table__cell">{{ error.error_type }}: {{ error.message }}</td><td class="table__cell">{{ error.created_display }}</td></tr>
{% endfor %}
</tbody>
</table>
{% else %}
<p class="progress-panel__empty">Ошибок нет.</p>
{% endif %}
</section>
{% endblock %}

View File

@@ -38,7 +38,7 @@
<thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Найдено</th><th class="table__head">Обработано</th><th class="table__head">Новые</th><th class="table__head">Ошибки</th><th class="table__head">Уволены</th><th class="table__head">Старт</th></tr></thead> <thead><tr><th class="table__head">ID</th><th class="table__head">Статус</th><th class="table__head">Найдено</th><th class="table__head">Обработано</th><th class="table__head">Новые</th><th class="table__head">Ошибки</th><th class="table__head">Уволены</th><th class="table__head">Старт</th></tr></thead>
<tbody> <tbody>
{% for run in runs %} {% for run in runs %}
<tr><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.found_count }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.new_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.dismissed_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr> <tr class="table__row" onclick="window.location.href='/admin/runs/{{ run.id }}'" onkeydown="if (event.key === 'Enter' || event.key === ' ') { event.preventDefault(); window.location.href='/admin/runs/{{ run.id }}'; }" role="link" tabindex="0"><td class="table__cell">{{ run.id }}</td><td class="table__cell">{{ run.status_display }}</td><td class="table__cell">{{ run.found_count }}</td><td class="table__cell">{{ run.parsed_count }}</td><td class="table__cell">{{ run.new_count }}</td><td class="table__cell">{{ run.error_count }}</td><td class="table__cell">{{ run.dismissed_count }}</td><td class="table__cell">{{ run.started_display }}</td></tr>
{% endfor %} {% endfor %}
</tbody> </tbody>
</table> </table>

View File

@@ -1,3 +1,3 @@
APP_VERSION = "0.2.8" APP_VERSION = "0.4.7"
FRONTEND_VERSION = "0.2.8" FRONTEND_VERSION = "0.4.7"
BACKEND_VERSION = "0.2.8" BACKEND_VERSION = "0.4.7"

View File

@@ -20,7 +20,7 @@ services:
environment: environment:
DATABASE_URL: postgresql+psycopg://${POSTGRES_USER:-miem}:${POSTGRES_PASSWORD:-miem_password}@postgres:5432/${POSTGRES_DB:-miem_workers} DATABASE_URL: postgresql+psycopg://${POSTGRES_USER:-miem}:${POSTGRES_PASSWORD:-miem_password}@postgres:5432/${POSTGRES_DB:-miem_workers}
ports: ports:
- "127.0.0.1:8000:8000" - "127.0.0.1:${API_PORT:-8000}:8000"
depends_on: depends_on:
postgres: postgres:
condition: service_healthy condition: service_healthy
@@ -42,7 +42,7 @@ services:
environment: environment:
DATABASE_URL: postgresql+psycopg://${POSTGRES_USER:-miem}:${POSTGRES_PASSWORD:-miem_password}@postgres:5432/${POSTGRES_DB:-miem_workers} DATABASE_URL: postgresql+psycopg://${POSTGRES_USER:-miem}:${POSTGRES_PASSWORD:-miem_password}@postgres:5432/${POSTGRES_DB:-miem_workers}
ports: ports:
- "127.0.0.1:8001:8000" - "127.0.0.1:${MCP_PORT:-8001}:8000"
depends_on: depends_on:
postgres: postgres:
condition: service_healthy condition: service_healthy

View File

@@ -0,0 +1,21 @@
CREATE TABLE IF NOT EXISTS crawl_run_employee_changes (
id SERIAL PRIMARY KEY,
crawl_run_id INTEGER NOT NULL REFERENCES crawl_runs(id),
employee_id INTEGER REFERENCES employees(id),
profile_key VARCHAR(255) NOT NULL,
profile_url TEXT NOT NULL,
full_name TEXT,
change_type VARCHAR(32) NOT NULL,
profile_available BOOLEAN,
message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX IF NOT EXISTS ix_crawl_run_employee_changes_run_id
ON crawl_run_employee_changes (crawl_run_id);
CREATE INDEX IF NOT EXISTS ix_crawl_run_employee_changes_employee_id
ON crawl_run_employee_changes (employee_id);
CREATE INDEX IF NOT EXISTS ix_crawl_run_employee_changes_change_type
ON crawl_run_employee_changes (change_type);

View File

@@ -1,6 +1,6 @@
[project] [project]
name = "miem-workers" name = "miem-workers"
version = "0.2.8" version = "0.4.7"
description = "MIEM employees parser, admin API, and MCP server" description = "MIEM employees parser, admin API, and MCP server"
requires-python = ">=3.11" requires-python = ">=3.11"
dependencies = [ dependencies = [

View File

@@ -1,11 +1,12 @@
from datetime import datetime, timezone from datetime import datetime, timezone
from app.models import CrawlRun, Employee from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee
from app.services.admin_data import ( from app.services.admin_data import (
employee_detail_payload, employee_detail_payload,
employee_display_payload, employee_display_payload,
format_admin_datetime, format_admin_datetime,
list_employees_page, list_employees_page,
run_detail_payload,
run_payload, run_payload,
stats_payload, stats_payload,
) )
@@ -207,3 +208,43 @@ def test_run_payload_calculates_progress():
assert payload["processed_count"] == 5 assert payload["processed_count"] == 5
assert payload["progress_percent"] == 50.0 assert payload["progress_percent"] == 50.0
assert payload["status_display"] == "Выполняется" assert payload["status_display"] == "Выполняется"
def test_run_detail_payload_groups_changes_and_handles_old_runs(db_session):
old_run = CrawlRun(source_url="https://miem.hse.ru/persons", status="completed")
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="completed", new_count=1)
employee = Employee(
profile_key="staff:new",
canonical_url="https://www.hse.ru/staff/new",
full_name="New Person",
status="active",
first_seen_at=datetime.now(timezone.utc),
last_seen_at=datetime.now(timezone.utc),
)
db_session.add_all([old_run, run, employee])
db_session.commit()
db_session.add(
CrawlRunEmployeeChange(
crawl_run_id=run.id,
employee_id=employee.id,
profile_key=employee.profile_key,
profile_url=employee.canonical_url,
full_name=employee.full_name,
change_type="new",
profile_available=True,
message="added",
)
)
db_session.add(
CrawlError(crawl_run_id=run.id, profile_url=employee.canonical_url, error_type="ValueError", message="bad")
)
db_session.commit()
payload = run_detail_payload(db_session, run)
old_payload = run_detail_payload(db_session, old_run)
assert payload["changes_detail_available"] is True
assert payload["changes"]["new"][0]["full_name"] == "New Person"
assert payload["errors"][0]["error_type"] == "ValueError"
assert old_payload["changes_detail_available"] is False
assert old_payload["changes"]["new"] == []

View File

@@ -8,6 +8,7 @@ def test_base_navigation_is_russian_and_has_no_legacy_employees_link():
assert "Сотрудники" in template assert "Сотрудники" in template
assert "Запуски" in template assert "Запуски" in template
assert "Выйти" in template assert "Выйти" in template
assert '<a class="admin__brand-link" href="/admin">MIEM Employees</a>' in template
assert ">Employees<" not in template assert ">Employees<" not in template
assert "/admin/employees" not in template assert "/admin/employees" not in template
@@ -32,3 +33,61 @@ def test_admin_employees_route_redirects_to_directory():
source = Path("app/admin.py").read_text(encoding="utf-8") source = Path("app/admin.py").read_text(encoding="utf-8")
assert 'RedirectResponse("/admin/directory", status_code=303)' in source assert 'RedirectResponse("/admin/directory", status_code=303)' in source
def test_dashboard_limits_latest_runs_to_five():
source = Path("app/admin.py").read_text(encoding="utf-8")
assert "order_by(desc(CrawlRun.started_at)).limit(5)" in source
assert "order_by(desc(CrawlRun.started_at)).limit(10)" not in source
def test_runs_template_links_to_run_detail():
template = Path("app/templates/runs.html").read_text(encoding="utf-8")
assert 'onclick="window.location.href=\'/admin/runs/{{ run.id }}\'"' in template
assert "onkeydown=\"if (event.key === 'Enter' || event.key === ' ')" in template
assert 'role="link"' in template
assert 'tabindex="0"' in template
assert 'data-row-href="/admin/runs/{{ run.id }}"' not in template
assert '<a class="admin__link" href="/admin/runs/{{ run.id }}">' not in template
def test_run_detail_template_extends_base_and_shows_change_groups():
template = Path("app/templates/run_detail.html").read_text(encoding="utf-8")
assert '{% extends "base.html" %}' in template
assert 'id="new-employees"' in template
assert "Новые сотрудники" in template
assert "Потеряшки" in template
assert "Уволенные" in template
assert "Детализация сотрудников для этого запуска недоступна" in template
def test_dashboard_metric_cards_link_to_admin_targets():
template = Path("app/templates/dashboard.html").read_text(encoding="utf-8")
assert 'href="/admin/directory"' in template
assert 'href="/admin/directory?status=active"' in template
assert '/admin/runs/{{ latest_run.id }}#new-employees' in template
assert 'href="/admin/directory?status=dismissed"' in template
assert 'href="/admin/runs"' in template
def test_dashboard_latest_run_rows_link_to_run_detail():
template = Path("app/templates/dashboard.html").read_text(encoding="utf-8")
assert 'onclick="window.location.href=\'/admin/runs/{{ run.id }}\'"' in template
assert "onkeydown=\"if (event.key === 'Enter' || event.key === ' ')" in template
assert 'role="link"' in template
assert 'tabindex="0"' in template
assert 'data-row-href="/admin/runs/{{ run.id }}"' not in template
assert '<a class="admin__link" href="/admin/runs/{{ run.id }}">' not in template
def test_admin_js_supports_keyboard_activation_for_clickable_rows():
source = Path("app/static/admin.js").read_text(encoding="utf-8")
assert 'addEventListener("keydown"' in source
assert '"Enter"' in source
assert '" "' in source

View File

@@ -1,14 +1,15 @@
from datetime import datetime, timezone from datetime import datetime, timezone
from types import SimpleNamespace
from fastapi.testclient import TestClient from fastapi.testclient import TestClient
from sqlalchemy import create_engine from sqlalchemy import create_engine, select
from sqlalchemy.orm import sessionmaker from sqlalchemy.orm import sessionmaker
from sqlalchemy.pool import StaticPool from sqlalchemy.pool import StaticPool
from app.config import Settings, get_settings from app.config import Settings, get_settings
from app.db import Base, get_db from app.db import Base, get_db
from app.main import app from app.main import app
from app.models import CrawlRun, Employee from app.models import CrawlRun, CrawlRunEmployeeChange, Employee
from app.security import SESSION_COOKIE, sign_session from app.security import SESSION_COOKIE, sign_session
@@ -18,10 +19,10 @@ def test_health_returns_versions():
response = client.get("/api/health") response = client.get("/api/health")
assert response.status_code == 200 assert response.status_code == 200
assert response.json()["backend_version"] == "0.2.8" assert response.json()["backend_version"] == "0.4.7"
def test_mcp_requires_token_and_lists_tools(): def test_mcp_lists_tools_without_auth_and_ignores_auth_header():
engine = create_engine( engine = create_engine(
"sqlite:///:memory:", "sqlite:///:memory:",
connect_args={"check_same_thread": False}, connect_args={"check_same_thread": False},
@@ -38,19 +39,20 @@ def test_mcp_requires_token_and_lists_tools():
session.close() session.close()
app.dependency_overrides[get_db] = override_db app.dependency_overrides[get_db] = override_db
app.dependency_overrides[get_settings] = lambda: Settings(mcp_token="secret", session_secret="session-secret")
client = TestClient(app) client = TestClient(app)
unauthorized = client.post("/mcp", json={"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}) without_auth = client.post("/mcp", json={"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}})
authorized = client.post( with_auth = client.post(
"/mcp", "/mcp",
headers={"Authorization": "Bearer secret"}, headers={"Authorization": "Bearer anything"},
json={"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}, json={"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}},
) )
assert unauthorized.status_code == 401 assert without_auth.status_code == 200
assert authorized.status_code == 200 assert with_auth.status_code == 200
assert authorized.json()["result"]["tools"][0]["name"] == "search_employees" assert without_auth.json()["result"]["tools"][0]["name"] == "search_employees"
assert any(tool["name"] == "get_crawl_run_details" for tool in without_auth.json()["result"]["tools"])
assert with_auth.json()["result"]["tools"] == without_auth.json()["result"]["tools"]
app.dependency_overrides.clear() app.dependency_overrides.clear()
@@ -88,12 +90,10 @@ def test_mcp_search_employees_returns_matching_employee():
db.close() db.close()
app.dependency_overrides[get_db] = override_db app.dependency_overrides[get_db] = override_db
app.dependency_overrides[get_settings] = lambda: Settings(mcp_token="secret", session_secret="session-secret")
client = TestClient(app) client = TestClient(app)
response = client.post( response = client.post(
"/mcp", "/mcp",
headers={"Authorization": "Bearer secret"},
json={ json={
"jsonrpc": "2.0", "jsonrpc": "2.0",
"id": 1, "id": 1,
@@ -108,6 +108,80 @@ def test_mcp_search_employees_returns_matching_employee():
app.dependency_overrides.clear() app.dependency_overrides.clear()
def test_mcp_get_crawl_run_details_returns_changes():
engine = create_engine(
"sqlite:///:memory:",
connect_args={"check_same_thread": False},
poolclass=StaticPool,
)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="completed", new_count=1)
employee = Employee(
profile_key="staff:new",
profile_type="staff",
profile_id="new",
canonical_url="https://www.hse.ru/staff/new",
full_name="New Person",
status="active",
first_seen_at=datetime.now(timezone.utc),
last_seen_at=datetime.now(timezone.utc),
)
session.add_all([run, employee])
session.commit()
session.add(
CrawlRunEmployeeChange(
crawl_run_id=run.id,
employee_id=employee.id,
profile_key=employee.profile_key,
profile_url=employee.canonical_url,
full_name=employee.full_name,
change_type="new",
profile_available=True,
message="added",
)
)
session.commit()
run_id = run.id
session.close()
def override_db():
db = Session()
try:
yield db
finally:
db.close()
app.dependency_overrides[get_db] = override_db
client = TestClient(app)
response = client.post(
"/mcp",
json={
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {"name": "get_crawl_run_details", "arguments": {"run_id": run_id}},
},
)
assert response.status_code == 200
text = response.json()["result"]["content"][0]["text"]
assert "New Person" in text
assert "changes_detail_available" in text
app.dependency_overrides.clear()
def test_mcp_protected_resource_metadata_route_is_removed():
client = TestClient(app)
response = client.get("/.well-known/oauth-protected-resource")
assert response.status_code == 404
def test_api_employees_and_stats_require_admin_session(): def test_api_employees_and_stats_require_admin_session():
engine = create_engine( engine = create_engine(
"sqlite:///:memory:", "sqlite:///:memory:",
@@ -130,8 +204,23 @@ def test_api_employees_and_stats_require_admin_session():
current_data={"contacts": {"emails": ["alpha@hse.ru"]}, "sections": []}, current_data={"contacts": {"emails": ["alpha@hse.ru"]}, "sections": []},
) )
) )
db.add(CrawlRun(source_url="https://miem.hse.ru/persons", status="completed", new_count=1)) run = CrawlRun(source_url="https://miem.hse.ru/persons", status="completed", new_count=1)
db.add(run)
db.commit() db.commit()
db.add(
CrawlRunEmployeeChange(
crawl_run_id=run.id,
employee_id=1,
profile_key="staff:alpha",
profile_url="https://www.hse.ru/staff/alpha",
full_name="Alpha Person",
change_type="new",
profile_available=True,
message="added",
)
)
db.commit()
run_id = run.id
db.close() db.close()
settings = Settings(admin_username="admin", admin_password="password", session_secret="session-secret") settings = Settings(admin_username="admin", admin_password="password", session_secret="session-secret")
@@ -150,10 +239,66 @@ def test_api_employees_and_stats_require_admin_session():
employees = client.get("/api/employees", params={"q": "Alpha", "has_email": True}) employees = client.get("/api/employees", params={"q": "Alpha", "has_email": True})
stats = client.get("/api/stats") stats = client.get("/api/stats")
run_details = client.get(f"/api/crawl-runs/{run_id}")
assert employees.status_code == 200 assert employees.status_code == 200
assert employees.json()["total"] == 1 assert employees.json()["total"] == 1
assert stats.status_code == 200 assert stats.status_code == 200
assert stats.json()["new_in_last_run"] == 1 assert stats.json()["new_in_last_run"] == 1
assert run_details.status_code == 200
assert run_details.json()["changes"]["new"][0]["full_name"] == "Alpha Person"
app.dependency_overrides.clear()
def test_admin_refresh_employee_route_updates_only_requested_employee(monkeypatch):
engine = create_engine(
"sqlite:///:memory:",
connect_args={"check_same_thread": False},
poolclass=StaticPool,
)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
db = Session()
db.add(
Employee(
profile_key="org_person:133709486",
profile_type="org_person",
profile_id="133709486",
canonical_url="https://www.hse.ru/org/persons/133709486",
full_name="Будков Юрий Алексеевич",
status="active",
)
)
db.commit()
employee_id = db.scalar(select(Employee.id))
db.close()
settings = Settings(admin_username="admin", admin_password="password", session_secret="session-secret")
def override_db():
session = Session()
try:
yield session
finally:
session.close()
calls = []
def fake_refresh_employee(db, refreshed_employee, route_settings):
calls.append((refreshed_employee.id, route_settings))
return SimpleNamespace(status="completed")
app.dependency_overrides[get_db] = override_db
app.dependency_overrides[get_settings] = lambda: settings
monkeypatch.setattr("app.admin.refresh_employee", fake_refresh_employee)
client = TestClient(app)
client.cookies.set(SESSION_COOKIE, sign_session("admin", settings))
response = client.post(f"/admin/employees/{employee_id}/refresh", follow_redirects=False)
assert response.status_code == 303
assert response.headers["location"] == f"/admin/employees/{employee_id}?refresh_status=success"
assert calls == [(employee_id, settings)]
app.dependency_overrides.clear() app.dependency_overrides.clear()

View File

@@ -1,10 +1,25 @@
from datetime import datetime, timezone from datetime import datetime, timezone
from app.models import CrawlRun, Employee from app.models import CrawlRun, CrawlRunEmployeeChange, Employee
from app.services.crawler import _mark_dismissed, _upsert_employee from app.services.crawler import _mark_dismissed, _upsert_employee
def test_mark_dismissed_only_marks_missing_active_employees(db_session): class FakeResponse:
def __init__(self, status_code):
self.status_code = status_code
class FakeSession:
def __init__(self, statuses):
self.statuses = statuses
def get(self, url, **_kwargs):
return FakeResponse(self.statuses[url])
def test_mark_dismissed_records_missing_source_when_profile_is_available(db_session):
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add(run)
db_session.add( db_session.add(
Employee( Employee(
profile_key="staff:kept", profile_key="staff:kept",
@@ -16,8 +31,8 @@ def test_mark_dismissed_only_marks_missing_active_employees(db_session):
) )
db_session.add( db_session.add(
Employee( Employee(
profile_key="staff:gone", profile_key="staff:missing",
canonical_url="https://www.hse.ru/staff/gone", canonical_url="https://www.hse.ru/staff/missing",
status="active", status="active",
first_seen_at=datetime.now(timezone.utc), first_seen_at=datetime.now(timezone.utc),
last_seen_at=datetime.now(timezone.utc), last_seen_at=datetime.now(timezone.utc),
@@ -25,16 +40,53 @@ def test_mark_dismissed_only_marks_missing_active_employees(db_session):
) )
db_session.commit() db_session.commit()
dismissed = _mark_dismissed(db_session, {"staff:kept"}) dismissed = _mark_dismissed(
db_session,
run,
{"staff:kept"},
FakeSession({"https://www.hse.ru/staff/missing": 200}),
30,
)
assert dismissed == 0
assert db_session.query(Employee).filter_by(profile_key="staff:kept").one().status == "active"
missing = db_session.query(Employee).filter_by(profile_key="staff:missing").one()
assert missing.status == "active"
assert missing.dismissed_at is None
change = db_session.query(CrawlRunEmployeeChange).one()
assert change.change_type == "missing_from_source"
assert change.profile_available is True
def test_mark_dismissed_marks_missing_employee_when_profile_is_unavailable(db_session):
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
employee = Employee(
profile_key="staff:gone",
canonical_url="https://www.hse.ru/staff/gone",
status="active",
first_seen_at=datetime.now(timezone.utc),
last_seen_at=datetime.now(timezone.utc),
)
db_session.add_all([run, employee])
db_session.commit()
dismissed = _mark_dismissed(
db_session,
run,
set(),
FakeSession({"https://www.hse.ru/staff/gone": 404}),
30,
)
assert dismissed == 1 assert dismissed == 1
assert db_session.query(Employee).filter_by(profile_key="staff:kept").one().status == "active" assert employee.status == "dismissed"
gone = db_session.query(Employee).filter_by(profile_key="staff:gone").one() assert employee.dismissed_at is not None
assert gone.status == "dismissed" change = db_session.query(CrawlRunEmployeeChange).one()
assert gone.dismissed_at is not None assert change.change_type == "dismissed"
assert change.profile_available is False
def test_upsert_employee_increments_new_count_for_new_employee(db_session): def test_upsert_employee_increments_new_count_and_records_change_for_new_employee(db_session):
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running") run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add(run) db_session.add(run)
db_session.commit() db_session.commit()
@@ -56,3 +108,6 @@ def test_upsert_employee_increments_new_count_for_new_employee(db_session):
db_session.commit() db_session.commit()
assert run.new_count == 1 assert run.new_count == 1
change = db_session.query(CrawlRunEmployeeChange).one()
assert change.change_type == "new"
assert change.full_name == "New Person"

View File

@@ -27,4 +27,6 @@ def test_employee_detail_template_is_human_readable():
assert "Дата увольнения" in template assert "Дата увольнения" in template
assert "Тип профиля" in template assert "Тип профиля" in template
assert "ID профиля" in template assert "ID профиля" in template
assert "Обновить данные" in template
assert 'action="/admin/employees/{{ employee.id }}/refresh"' in template
assert "Снапшоты" in template assert "Снапшоты" in template

View File

@@ -1,6 +1,6 @@
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from app.parser.profile import enrich_sections_from_hse_widgets, extract_person_tabs from app.parser.profile import enrich_sections_from_hse_widgets, extract_person_tabs, extract_sections
from app.parser.profile_url import normalize_profile_url, parse_profile_identity from app.parser.profile_url import normalize_profile_url, parse_profile_identity
@@ -64,6 +64,47 @@ class FakeSession:
) )
class GroupedPublicationsSession(FakeSession):
def post(self, url, **kwargs):
self.posts.append((url, kwargs))
return FakeResponse(
{
"status": "ok",
"result": {
"more": False,
"total": 1,
"groupType": 2,
"items": {
"year": {
"header": {"ru": "по году", "en": "by year"},
"criteria": {"year": []},
"items": {
"2011": [
{
"id": "146366790",
"type": "ARTICLE",
"title": "Развитие теории самосогласованного поля",
"year": 2011,
"description": {"short": {"ru": "Журнал физической химии 2011."}},
}
],
"2012": [
{
"id": "146367323",
"type": "ARTICLE",
"title": "Self-consistent field theory investigation",
"year": 2012,
"description": {"short": {"en": "Russian Journal of Physical Chemistry A 2012."}},
}
],
},
}
},
},
}
)
def test_normalize_profile_url_supports_staff_and_org_persons(): def test_normalize_profile_url_supports_staff_and_org_persons():
assert normalize_profile_url("/staff/avsergeev#sci") == "https://www.hse.ru/staff/avsergeev" assert normalize_profile_url("/staff/avsergeev#sci") == "https://www.hse.ru/staff/avsergeev"
assert normalize_profile_url("https://www.hse.ru/org/persons/123/") == "https://www.hse.ru/org/persons/123" assert normalize_profile_url("https://www.hse.ru/org/persons/123/") == "https://www.hse.ru/org/persons/123"
@@ -117,3 +158,60 @@ def test_enrich_sections_from_hse_widgets_loads_publications_and_vkr():
assert theses["theses"][0]["project_url"] == "https://www.hse.ru/edu/vkr/1045750164" assert theses["theses"][0]["project_url"] == "https://www.hse.ru/edu/vkr/1045750164"
assert session.posts[0][0] == "https://publications.hse.ru/api/searchPubs" assert session.posts[0][0] == "https://publications.hse.ru/api/searchPubs"
assert session.gets[0][1]["params"] == {"supervisorId": "803294906"} assert session.gets[0][1]["params"] == {"supervisorId": "803294906"}
def test_enrich_sections_from_hse_widgets_loads_grouped_publications():
soup = BeautifulSoup(
"""
<script src="/n/stat/publications/dist-w/publs.js" data-author="133709486" data-widget-name="AuthorSearch"></script>
""",
"html.parser",
)
session = GroupedPublicationsSession()
sections = enrich_sections_from_hse_widgets(
session,
soup,
"https://www.hse.ru/org/persons/133709486",
{"User-Agent": "test"},
10,
[],
)
publications = next(section for section in sections if section["type"] == "publications")
assert publications["publications_count"] == 2
assert [item["id"] for item in publications["publications"]] == ["146366790", "146367323"]
assert publications["publications"][0]["url"] == "https://publications.hse.ru/view/146366790"
assert publications["publications"][1]["url"] == "https://publications.hse.ru/view/146367323"
def test_news_heading_with_publications_word_does_not_absorb_widget_publications():
soup = BeautifulSoup(
"""
<h2>Статья профессора МИЭМ вошла в число самых популярных публикаций на портале SpringerLink</h2>
<div class="post__text">
<p>Первоначально статья профессора вышла в российском журнале.</p>
</div>
<script src="/n/stat/publications/dist-w/publs.js" data-author="133709486" data-widget-name="AuthorSearch"></script>
""",
"html.parser",
)
session = FakeSession()
sections = extract_sections(soup, "https://www.hse.ru/org/persons/133709486")
sections = enrich_sections_from_hse_widgets(
session,
soup,
"https://www.hse.ru/org/persons/133709486",
{"User-Agent": "test"},
10,
sections,
)
assert sections[0]["type"] == "paragraphs"
assert sections[0]["title"].startswith("Статья профессора")
publications = [section for section in sections if section["type"] == "publications"]
assert len(publications) == 1
assert publications[0]["title"] == "Публикации и исследования"
assert publications[0]["publications_count"] == 1