Compare commits

...

8 Commits

22 changed files with 1344 additions and 34 deletions

1
.gitignore vendored
View File

@@ -8,3 +8,4 @@ pytest-cache-files-*/
.coverage
htmlcov/
postgres_data/
MCP_DESCRIPTION.md

View File

@@ -92,7 +92,7 @@ miem-employees
"protocolVersion": "2024-11-05",
"serverInfo": {
"name": "miem-employees",
"version": "0.5.0"
"version": "0.7.0"
},
"capabilities": {
"tools": {}
@@ -171,6 +171,8 @@ MCP читает данные из основной базы через SQLAlche
Основные таблицы и модели:
- `employees`: текущая карточка сотрудника, статус, профиль, `current_data`, checksum.
- `employee_publications`: нормализованные публикации сотрудников с авторами, DOI, аннотацией, описанием, citation text и raw JSON из HSE Publications.
- `employee_news_links`: нормализованные ссылки на новости из блока профиля «В новостях» с заголовком, URL, кратким описанием, датой, годом публикации и raw JSON карточки.
- `crawl_runs`: история запусков парсинга.
- `crawl_run_employee_changes`: детальные изменения сотрудников в рамках запуска.
- `crawl_errors`: ошибки парсинга в рамках запуска.
@@ -206,7 +208,29 @@ MCP читает данные из основной базы через SQLAlche
}
```
`data` соответствует распарсенному JSON профиля сотрудника. Внутри `sections` могут быть секции с публикациями, курсами, ВКР, таблицами, ссылками и произвольными текстовыми блоками.
`data` соответствует распарсенному JSON профиля сотрудника. Внутри `sections` могут быть секции с публикациями, курсами, ВКР, новостями, таблицами, ссылками и произвольными текстовыми блоками.
Пример секции новостей внутри `data.sections`:
```json
{
"title": "В новостях",
"slug": "v_novostyah",
"type": "news",
"news_count": 1,
"news_links": [
{
"title": "Название новости",
"url": "https://www.hse.ru/news/edu/1153850518.html",
"summary": "Краткое описание новости.",
"published_at": "2026-04-28T00:00:00+00:00",
"published_year": 2026
}
]
}
```
Для новостей отдельного MCP tool сейчас нет: они доступны через `get_employee(...).data.sections` или через полную синхронизацию `sync_employees(include_data=true)`.
## Tools
@@ -221,7 +245,7 @@ MCP читает данные из основной базы через SQLAlche
```json
{
"service_name": "miem-employees",
"backend_version": "0.5.0",
"backend_version": "0.7.0",
"protocolVersion": "2024-11-05",
"tools": [],
"dataset": {
@@ -387,7 +411,7 @@ Hash набора считается по отсортированному сп
### list_employee_publications
Назначение: вернуть публикации сотрудника из распарсенных секций профиля.
Назначение: вернуть публикации сотрудника. Если есть нормализованные строки в `employee_publications`, tool возвращает детальные публикационные данные: авторов, DOI, аннотацию, описание, citation text, год, тип, язык, статус и ссылки. Если детальная таблица еще не заполнена, tool использует старый fallback из `employees.current_data.sections[].publications`.
Аргументы:
@@ -397,24 +421,70 @@ Hash набора считается по отсортированному сп
}
```
Сервис ищет секции `current_data.sections` с `type = "publications"` и объединяет массивы `publications`.
Поиск сотрудника выполняется так же, как в `get_employee`: по `profile_key`, `profile_id`, точному или частичному `canonical_url`.
Порядок источников:
- сначала `employee_publications`, отсортированные по году, названию и внутреннему id;
- если записей нет, секции `current_data.sections` с `type = "publications"` и массивами `publications`.
Ответ:
```json
{
"employee": {},
"employee": {
"profile_key": "org_person:803294906",
"profile_id": "803294906",
"full_name": "Борисов Сергей Петрович",
"status": "active",
"canonical_url": "https://www.hse.ru/org/persons/803294906",
"last_seen_at": "2026-05-14T10:00:00+00:00",
"dismissed_at": null
},
"items": [
{
"id": "888959076",
"publication_id": "888959076",
"title": "Название публикации",
"text": "Полное описание",
"url": "https://..."
"text": "Краткое описание или citation",
"url": "https://publications.hse.ru/view/888959076",
"year": 2023,
"type": "ARTICLE",
"publication_type": "ARTICLE",
"language": "ru",
"status": 1,
"doi_url": "https://doi.org/10.53921/18195822_2023_23_4_624",
"other_url": "https://example.test",
"document_url": "https://example.test/file.pdf",
"citation_text": "Авторы. Название публикации // Журнал. 2023.",
"annotation": {
"ru": "Аннотация",
"en": "Abstract"
},
"description": {
"main": "Авторы. Название публикации // Журнал. 2023."
},
"authors": [
{
"id": "803294906",
"href": "https://www.hse.ru/org/persons/803294906",
"title_ru": "Борисов С. П.",
"title_en": "",
"reverse_title_ru": "С. П. Борисов",
"reverse_title_en": "",
"alt_name": "S. P. Borisov",
"other_name": null,
"is_current_employee": true
}
]
}
]
}
```
Если сотрудник или данные профиля отсутствуют:
В fallback-режиме из `current_data` старые элементы могут содержать только базовые поля `title`, `text`, `url` и `id`.
Если сотрудник не найден:
```json
{
@@ -422,6 +492,15 @@ Hash набора считается по отсортированному сп
}
```
Если сотрудник найден, но публикаций нет:
```json
{
"employee": {},
"items": []
}
```
### list_employee_courses
Назначение: вернуть курсы преподавания сотрудника из распарсенных секций профиля.

View File

@@ -9,7 +9,7 @@
- `mcp`: открытый HTTP MCP endpoint для ИИ-агентов.
- `postgres`: основная БД.
Парсер использует фиксированный источник сотрудников, по умолчанию `https://miem.hse.ru/persons`. Для каждой карточки сохраняются ФИО, должности, год начала работы, контакты, идентификаторы, вкладки профиля, секции, публикации, курсы, ВКР, JSON-снапшот и сжатый HTML-снапшот. Ссылки обходятся только из меню профиля самого сотрудника (`person-menu`), например `#sci`, `#teaching`, `#main`.
Парсер использует фиксированный источник сотрудников, по умолчанию `https://miem.hse.ru/persons`. Для каждой карточки сохраняются ФИО, должности, год начала работы, контакты, идентификаторы, вкладки профиля, секции, публикации, курсы, ВКР, новости, JSON-снапшот и сжатый HTML-снапшот. Детальные публикации дополнительно нормализуются в отдельную таблицу `employee_publications`, а новости из блока «В новостях» — в `employee_news_links`. Ссылки обходятся только из меню профиля самого сотрудника (`person-menu`), например `#sci`, `#teaching`, `#main`.
## Переменные окружения
@@ -58,7 +58,27 @@ docker compose up --build
- MCP: `http://localhost:8001/mcp`
- Postgres: `localhost:5432`
Таблицы создаются приложением при старте. SQL-миграция для ручного применения лежит в `migrations/001_init.sql`.
Таблицы создаются приложением при старте. При обновлении существующей базы приложение также добавляет недостающие runtime-колонки, например `crawl_runs.skipped_count`. SQL-миграции для ручного применения лежат в `migrations/`.
## Наполнение БД
Основная карточка сотрудника хранится в `employees`: профиль, статус, даты обнаружения/увольнения, текущий JSON `current_data`, checksum и версия парсера. История успешных изменений сохраняется в `employee_snapshots` вместе с JSON-снимком и сжатым HTML профиля.
Публикации теперь хранятся в двух видах:
- краткий список остается внутри `employees.current_data.sections[].publications` для обратной совместимости;
- детальные записи сохраняются в `employee_publications` и связываются с сотрудником через `employee_id`.
`employee_publications` содержит `publication_id`, название, год, тип публикации, язык, статус, ссылку на карточку HSE Publications, DOI, внешние/document-ссылки, citation text, аннотацию, описание, авторов, raw JSON ответа `searchPubs` и `source_hash` для безопасного повторного upsert. Уникальность поддерживается по `(employee_id, publication_id)` и `(employee_id, source_hash)`, поэтому повторный crawl не должен создавать дубликаты.
`list_employee_publications` сначала читает `employee_publications`; если детальных строк еще нет, возвращает старые публикации из `current_data`.
Новости сотрудников также хранятся в двух видах:
- краткий список остается внутри `employees.current_data.sections[].news_links`;
- нормализованные карточки из вкладки «В новостях» сохраняются в `employee_news_links`.
`employee_news_links` содержит название новости, ссылку, краткое описание, дату публикации, год публикации, raw JSON карточки и `source_hash`. Уникальность поддерживается по `(employee_id, url)` и `(employee_id, source_hash)`, поэтому повторный crawl не создает дубликаты.
## Парсинг
@@ -73,6 +93,8 @@ curl -X POST http://localhost:8000/api/crawl-runs --cookie "miem_admin_session=.
- найденные сотрудники получают статус `active` и обновленный `last_seen_at`;
- новые сотрудники добавляются в `employees`;
- количество новых сотрудников за запуск сохраняется в `crawl_runs.new_count`;
- публикации из HSE Publications записываются в `employee_publications`, а краткий список остается в JSON профиля;
- новости из блока «В новостях» записываются в `employee_news_links`, а краткий список остается в JSON профиля;
- активные сотрудники, исчезнувшие из текущего списка источника, получают статус `dismissed` и `dismissed_at`;
- каждый успешный новый или измененный разбор сохраняет запись в `employee_snapshots`;
- неизмененные профили учитываются в `crawl_runs.skipped_count` и не получают новый snapshot.
@@ -89,13 +111,15 @@ Endpoint: `POST /mcp`, без авторизации на уровне прил
- `sync_employees(client_hash?, include_data?)`
- `search_employees(query, status?, limit?)`
- `get_employee(profile_id_or_url)`
- `list_employee_publications(profile_id_or_url)`
- `list_employee_publications(profile_id_or_url)` — публикации сотрудника; при наличии данных из `employee_publications` возвращает авторов, DOI, аннотацию, описание, citation text, год, тип, язык, статус и ссылку HSE Publications.
- `list_employee_courses(profile_id_or_url)`
- `get_crawl_status()`
- `get_crawl_run_details(run_id)`
`get_service_info` возвращает метаданные сервиса, список tools и текущую версию набора сотрудников. `sync_employees` отдает полный snapshot или delta по `client_hash`; checksum набора строится по сотрудникам, их статусам и текущим checksums. Ответы tools возвращаются как JSON-строка внутри MCP `content[0].text`.
Новости сотрудника отдельной MCP tool не имеют: они доступны в `get_employee(...).data.sections` и `sync_employees(include_data=true)` как секция `type = "news"` с массивом `news_links`.
Пример локального запроса списка tools:
```bash
@@ -115,4 +139,4 @@ docker compose exec postgres pg_dump -U miem miem_workers > backup.sql
docker compose down
```
Версия сервиса: `0.6.0`. Админка всегда показывает версии backend и frontend в footer.
Версия сервиса: `0.7.0`. Админка всегда показывает версии backend и frontend в footer.

View File

@@ -1,6 +1,6 @@
from collections.abc import Generator
from sqlalchemy import create_engine
from sqlalchemy import create_engine, inspect, text
from sqlalchemy.orm import DeclarativeBase, Session, sessionmaker
from app.config import get_settings
@@ -25,6 +25,28 @@ def init_db() -> None:
import app.models # noqa: F401
Base.metadata.create_all(bind=engine)
_ensure_runtime_schema()
def _ensure_runtime_schema() -> None:
import app.models as models
inspector = inspect(engine)
table_names = set(inspector.get_table_names())
if "employees" in table_names and "employee_publications" not in table_names:
models.EmployeePublication.__table__.create(bind=engine, checkfirst=True)
inspector = inspect(engine)
table_names = set(inspector.get_table_names())
if "employees" in table_names and "employee_news_links" not in table_names:
models.EmployeeNewsLink.__table__.create(bind=engine, checkfirst=True)
inspector = inspect(engine)
table_names = set(inspector.get_table_names())
if "crawl_runs" not in table_names:
return
crawl_run_columns = {column["name"] for column in inspector.get_columns("crawl_runs")}
if "skipped_count" not in crawl_run_columns:
with engine.begin() as connection:
connection.execute(text("ALTER TABLE crawl_runs ADD COLUMN skipped_count INTEGER NOT NULL DEFAULT 0"))
def get_db() -> Generator[Session, None, None]:

View File

@@ -5,7 +5,7 @@ from sqlalchemy import desc, or_, select
from sqlalchemy.orm import Session
from app.db import get_db
from app.models import CrawlRun, Employee
from app.models import CrawlRun, Employee, EmployeePublication
from app.services.admin_data import run_detail_payload
from app.services.dataset_versions import service_info_payload, sync_employees_payload
from app.version import BACKEND_VERSION
@@ -52,7 +52,10 @@ TOOLS = [
},
{
"name": "list_employee_publications",
"description": "List publications parsed from an employee profile.",
"description": (
"List employee publications with detailed fields when available: authors, DOI URL, annotation, "
"description, citation text, year, publication type, language, status, and HSE Publications URL."
),
"inputSchema": {"type": "object", "properties": {"profile_id_or_url": {"type": "string"}}, "required": ["profile_id_or_url"]},
},
{
@@ -171,8 +174,14 @@ def _find_employee(db: Session, value: str) -> Employee | None:
def _collect_section_items(employee: Employee | None, section_type: str) -> dict:
if not employee or not employee.current_data:
if not employee:
return {"items": []}
if section_type == "publications":
publications = _stored_publications(employee)
if publications:
return {"employee": _employee_payload(employee, include_data=False), "items": publications}
if not employee.current_data:
return {"employee": _employee_payload(employee, include_data=False), "items": []}
items = []
for section in employee.current_data.get("sections") or []:
if section.get("type") != section_type:
@@ -184,6 +193,41 @@ def _collect_section_items(employee: Employee | None, section_type: str) -> dict
return {"employee": _employee_payload(employee, include_data=False), "items": items}
def _stored_publications(employee: Employee) -> list[dict]:
return [_publication_payload(publication) for publication in sorted(employee.publications, key=_publication_sort_key)]
def _publication_sort_key(publication: EmployeePublication) -> tuple:
return (publication.year or 0, publication.title or "", publication.id)
def _publication_payload(publication: EmployeePublication) -> dict:
text = publication.citation_text or publication.title
payload = {
"id": publication.publication_id,
"publication_id": publication.publication_id,
"title": publication.title,
"text": text,
"url": publication.url,
}
optional = {
"year": publication.year,
"type": publication.publication_type,
"publication_type": publication.publication_type,
"language": publication.language,
"status": publication.status,
"doi_url": publication.doi_url,
"other_url": publication.other_url,
"document_url": publication.document_url,
"citation_text": publication.citation_text,
"annotation": publication.annotation,
"description": publication.description,
"authors": publication.authors,
}
payload.update({key: value for key, value in optional.items() if value not in (None, [], {})})
return payload
def _employee_payload(employee: Employee, include_data: bool = True) -> dict:
payload = {
"profile_key": employee.profile_key,

View File

@@ -41,6 +41,8 @@ class Employee(Base):
snapshots: Mapped[list["EmployeeSnapshot"]] = relationship(back_populates="employee")
tabs: Mapped[list["ProfileTab"]] = relationship(back_populates="employee", cascade="all, delete-orphan")
publications: Mapped[list["EmployeePublication"]] = relationship(back_populates="employee", cascade="all, delete-orphan")
news_links: Mapped[list["EmployeeNewsLink"]] = relationship(back_populates="employee", cascade="all, delete-orphan")
crawl_run_changes: Mapped[list["CrawlRunEmployeeChange"]] = relationship(back_populates="employee")
@@ -60,6 +62,68 @@ class EmployeeSnapshot(Base):
employee: Mapped[Employee] = relationship(back_populates="snapshots")
class EmployeePublication(Base):
__tablename__ = "employee_publications"
__table_args__ = (
UniqueConstraint("employee_id", "publication_id", name="uq_employee_publications_employee_publication"),
UniqueConstraint("employee_id", "source_hash", name="uq_employee_publications_employee_source_hash"),
Index("ix_employee_publications_employee_id", "employee_id"),
Index("ix_employee_publications_publication_id", "publication_id"),
Index("ix_employee_publications_doi_url", "doi_url"),
Index("ix_employee_publications_year", "year"),
Index("ix_employee_publications_publication_type", "publication_type"),
)
id: Mapped[int] = mapped_column(Integer, primary_key=True)
employee_id: Mapped[int] = mapped_column(ForeignKey("employees.id", ondelete="CASCADE"), nullable=False)
publication_id: Mapped[str | None] = mapped_column(String(64))
title: Mapped[str] = mapped_column(Text, nullable=False)
year: Mapped[int | None] = mapped_column(Integer)
publication_type: Mapped[str | None] = mapped_column(String(64))
language: Mapped[str | None] = mapped_column(String(16))
status: Mapped[int | None] = mapped_column(Integer)
url: Mapped[str | None] = mapped_column(Text)
doi_url: Mapped[str | None] = mapped_column(Text)
other_url: Mapped[str | None] = mapped_column(Text)
document_url: Mapped[str | None] = mapped_column(Text)
citation_text: Mapped[str | None] = mapped_column(Text)
annotation: Mapped[dict | None] = mapped_column(json_type)
description: Mapped[dict | None] = mapped_column(json_type)
authors: Mapped[list | None] = mapped_column(json_type)
raw_data: Mapped[dict | None] = mapped_column(json_type)
source_hash: Mapped[str] = mapped_column(String(64), nullable=False)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, nullable=False)
updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, onupdate=utcnow, nullable=False)
employee: Mapped[Employee] = relationship(back_populates="publications")
class EmployeeNewsLink(Base):
__tablename__ = "employee_news_links"
__table_args__ = (
UniqueConstraint("employee_id", "url", name="uq_employee_news_links_employee_url"),
UniqueConstraint("employee_id", "source_hash", name="uq_employee_news_links_employee_source_hash"),
Index("ix_employee_news_links_employee_id", "employee_id"),
Index("ix_employee_news_links_url", "url"),
Index("ix_employee_news_links_published_at", "published_at"),
Index("ix_employee_news_links_published_year", "published_year"),
)
id: Mapped[int] = mapped_column(Integer, primary_key=True)
employee_id: Mapped[int] = mapped_column(ForeignKey("employees.id", ondelete="CASCADE"), nullable=False)
title: Mapped[str] = mapped_column(Text, nullable=False)
url: Mapped[str | None] = mapped_column(Text)
summary: Mapped[str | None] = mapped_column(Text)
published_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
published_year: Mapped[int | None] = mapped_column(Integer)
source_hash: Mapped[str] = mapped_column(String(64), nullable=False)
raw_data: Mapped[dict | None] = mapped_column(json_type)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, nullable=False)
updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, onupdate=utcnow, nullable=False)
employee: Mapped[Employee] = relationship(back_populates="news_links")
class CrawlRun(Base):
__tablename__ = "crawl_runs"

View File

@@ -1,6 +1,7 @@
import hashlib
import json
import re
from datetime import datetime, timezone
from urllib.parse import urljoin
from bs4 import BeautifulSoup, NavigableString, Tag
@@ -101,6 +102,8 @@ def extract_person_header(soup: BeautifulSoup, source_url: str) -> dict:
def extract_sections(soup: BeautifulSoup, source_url: str) -> list[dict]:
sections = []
for h2 in soup.select("h2"):
if h2.find_parent(class_="post") or h2.find_parent(attrs={"data-tab": "press_links_news"}):
continue
title = normalize_ws(h2.get_text(" ", strip=True))
if not title or "расписание занятий" in title.lower():
continue
@@ -142,6 +145,21 @@ def extract_sections(soup: BeautifulSoup, source_url: str) -> list[dict]:
if section_type in {"generic", "paragraphs"}:
section["type"] = "year_blocks"
sections.append(section)
news_links = _parse_news_links(soup, source_url)
if news_links:
sections.append(
{
"title": "В новостях",
"slug": "v_novostyah",
"type": "news",
"raw_text": "",
"paragraphs": [],
"items": [item["title"] for item in news_links if item.get("title")],
"links": [{"text": item["title"], "url": item["url"]} for item in news_links if item.get("title") and item.get("url")],
"news_count": len(news_links),
"news_links": news_links,
}
)
return sections
@@ -333,7 +351,7 @@ def _load_widget_publications(
items = _extract_publication_items(result)
if not items:
break
publications.extend(_normalize_publication_item(item) for item in items)
publications.extend(_normalize_publication_item(item, author_id) for item in items)
total = int(result.get("total") or 0)
if not result.get("more") and len(publications) >= total:
@@ -575,20 +593,126 @@ def _parse_vkr_items(nodes: list) -> list[str]:
return [item for item in dict.fromkeys(items) if item]
def _normalize_publication_item(item: dict) -> dict:
def _parse_news_links(soup: BeautifulSoup, source_url: str) -> list[dict]:
news = []
for post in soup.select('[data-tab="press_links_news"] .post'):
if not isinstance(post, Tag):
continue
anchor = post.select_one(".post__content h2 a[href], h2 a[href], a[href]")
title = normalize_ws(anchor.get_text(" ", strip=True)) if anchor else ""
href = normalize_ws(anchor.get("href")) if anchor else ""
summary_node = post.select_one(".post__text")
summary = normalize_ws(summary_node.get_text(" ", strip=True)) if summary_node else ""
published_at = _parse_post_date(post)
if not title and not href:
continue
item = {
"title": title or href,
"url": urljoin(source_url, href) if href else None,
"summary": summary or None,
"published_at": published_at.isoformat() if published_at else None,
"published_year": published_at.year if published_at else _int_or_none(normalize_ws(_select_text(post, ".post-meta__year"))),
"raw_data": {
"title": title or href,
"url": href or None,
"summary": summary or None,
"date_text": normalize_ws(_select_text(post, ".post-meta__date")),
},
}
news.append(item)
return _dedupe_news_links(news)
def _select_text(node: Tag, selector: str) -> str:
selected = node.select_one(selector)
return selected.get_text(" ", strip=True) if selected else ""
def _parse_post_date(post: Tag) -> datetime | None:
day = _int_or_none(normalize_ws(_select_text(post, ".post-meta__day")))
month = _month_number(normalize_ws(_select_text(post, ".post-meta__month")))
year = _int_or_none(normalize_ws(_select_text(post, ".post-meta__year")))
if not day or not month or not year:
return None
try:
return datetime(year, month, day, tzinfo=timezone.utc)
except ValueError:
return None
def _month_number(value: str) -> int | None:
lowered = value.lower().strip(".")
months = {
"янв": 1,
"январь": 1,
"января": 1,
"фев": 2,
"февр": 2,
"февраль": 2,
"февраля": 2,
"март": 3,
"мар": 3,
"марта": 3,
"апр": 4,
"апрель": 4,
"апреля": 4,
"май": 5,
"мая": 5,
"июнь": 6,
"июня": 6,
"июль": 7,
"июля": 7,
"авг": 8,
"август": 8,
"августа": 8,
"сент": 9,
"сен": 9,
"сентябрь": 9,
"сентября": 9,
"окт": 10,
"октябрь": 10,
"октября": 10,
"нояб": 11,
"ноябрь": 11,
"ноября": 11,
"дек": 12,
"декабрь": 12,
"декабря": 12,
}
return months.get(lowered)
def _normalize_publication_item(item: dict, current_author_id: str | None = None) -> dict:
publication_id = str(item.get("id") or "").strip()
title = _html_to_text(item.get("title"))
year = item.get("year")
year = _int_or_none(item.get("year"))
publication_type = str(item.get("type") or "").strip() or None
description = item.get("description") if isinstance(item.get("description"), dict) else {}
short_description = _localized_value(description.get("short")) or _localized_value(description.get("shortLeft"))
documents = item.get("documents") if isinstance(item.get("documents"), dict) else {}
language = item.get("language") if isinstance(item.get("language"), dict) else {}
annotation = _localized_text_map(item.get("annotation"))
authors = _normalize_publication_authors(item.get("authorsByType"), current_author_id)
citation_text = normalize_ws(str(description.get("main") or "")) or _build_publication_citation(title, authors, year)
text = normalize_ws(" ".join(part for part in [title, str(year or ""), short_description] if part))
return {
"id": publication_id or None,
"publication_id": publication_id or None,
"title": title or publication_id,
"year": year,
"type": publication_type,
"publication_type": publication_type,
"language": normalize_ws(language.get("name")) or None,
"status": _int_or_none(item.get("status")),
"url": f"https://publications.hse.ru/view/{publication_id}" if publication_id else None,
"doi_url": _document_href(documents, "DOI"),
"other_url": _document_href(documents, "OTHER_URL"),
"document_url": _document_href(documents, "DOCUMENT"),
"citation_text": citation_text or None,
"annotation": annotation,
"description": description or None,
"authors": authors,
"raw_data": item,
"text": text or title or publication_id,
}
@@ -681,16 +805,84 @@ def _dedupe_publications(items: list[dict]) -> list[dict]:
return unique
def _dedupe_news_links(items: list[dict]) -> list[dict]:
seen = set()
unique = []
for item in items:
key = item.get("url") or item.get("title")
if key and key not in seen:
seen.add(key)
unique.append(item)
return unique
def _html_to_text(value: object) -> str:
return normalize_ws(BeautifulSoup(str(value or ""), "html.parser").get_text(" ", strip=True))
def _localized_text_map(value: object) -> dict[str, str]:
if not isinstance(value, dict):
return {}
localized = {}
for key in ("ru", "en", "publ"):
text = _html_to_text(value.get(key))
if text:
localized[key] = text
return localized
def _localized_value(value: object) -> str:
if isinstance(value, dict):
return normalize_ws(value.get("ru") or value.get("publ") or value.get("en"))
return normalize_ws(str(value or ""))
def _normalize_publication_authors(value: object, current_author_id: str | None) -> list[dict]:
if not isinstance(value, dict):
return []
authors = []
for author in value.get("author") or []:
if not isinstance(author, dict):
continue
title = author.get("title") if isinstance(author.get("title"), dict) else {}
reverse_title = author.get("reverseTitle") if isinstance(author.get("reverseTitle"), dict) else {}
author_id = normalize_ws(author.get("id"))
href = normalize_ws(author.get("href"))
authors.append(
{
"id": author_id or None,
"href": urljoin("https://www.hse.ru", href) if href else None,
"title_ru": _html_to_text(title.get("ru")),
"title_en": _html_to_text(title.get("en")),
"reverse_title_ru": _html_to_text(reverse_title.get("ru")),
"reverse_title_en": _html_to_text(reverse_title.get("en")),
"alt_name": normalize_ws(author.get("altName")) or None,
"other_name": normalize_ws(author.get("otherName")) or None,
"is_current_employee": bool(current_author_id and author_id == current_author_id),
}
)
return authors
def _document_href(documents: dict, key: str) -> str | None:
document = documents.get(key)
if not isinstance(document, dict):
return None
return normalize_ws(document.get("href")) or None
def _build_publication_citation(title: str, authors: list[dict], year: int | None) -> str:
author_names = [author.get("title_ru") or author.get("title_en") or author.get("alt_name") for author in authors]
return normalize_ws(". ".join(part for part in [", ".join(filter(None, author_names)), title, str(year or "")] if part))
def _int_or_none(value: object) -> int | None:
try:
return int(value)
except (TypeError, ValueError):
return None
def _slugify(value: str) -> str:
cleaned = re.sub(r"[^\w\s-]", "", value.lower(), flags=re.UNICODE)
return re.sub(r"[-\s]+", "_", cleaned).strip("_") or "section"

View File

@@ -8,7 +8,7 @@ from zoneinfo import ZoneInfo
from sqlalchemy import Select, Text, and_, desc, func, or_, select
from sqlalchemy.orm import Session
from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee
from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee, EmployeeNewsLink
EMPLOYEE_SORTS = {
"full_name": Employee.full_name,
@@ -24,6 +24,7 @@ def employee_display_payload(employee: Employee) -> dict[str, Any]:
data = _as_dict(employee.current_data)
contacts = _as_dict(data.get("contacts"))
sections = _as_list(data.get("sections"))
stored_news_links = _stored_news_links(employee)
positions = _clean_list(data.get("positions"))
emails = _clean_list(contacts.get("emails"))
phones = _clean_list(contacts.get("phones"))
@@ -43,6 +44,7 @@ def employee_display_payload(employee: Employee) -> dict[str, Any]:
"address": contacts.get("address"),
"publications_count": _count_section_items(sections, "publications"),
"courses_count": _count_section_items(sections, "courses_by_year"),
"news_count": len(stored_news_links) or _count_section_items(sections, "news"),
"first_seen_at": employee.first_seen_at.isoformat() if employee.first_seen_at else None,
"last_seen_at": employee.last_seen_at.isoformat() if employee.last_seen_at else None,
"dismissed_at": employee.dismissed_at.isoformat() if employee.dismissed_at else None,
@@ -67,6 +69,7 @@ def employee_detail_payload(employee: Employee) -> dict[str, Any]:
"contact_items": _normalize_contact_items(contacts.get("items")),
},
"external_ids": _normalize_external_ids(data.get("external_ids")),
"news_links": _detail_news_links(employee, data),
"sections": [_normalize_section(section) for section in _as_list(data.get("sections"))],
}
@@ -276,6 +279,8 @@ def _count_section_items(sections: list[dict[str, Any]], section_type: str) -> i
total += len(section.get("publications") or section.get("items") or [])
elif section_type == "courses_by_year":
total += len(section.get("courses") or [])
elif section_type == "news":
total += len(section.get("news_links") or section.get("items") or [])
return total
@@ -348,6 +353,8 @@ def _normalize_section(section: Any) -> dict[str, Any]:
"year_entries": _normalize_year_entries(section.get("year_entries")),
"publications": _normalize_publications(section.get("publications")),
"publications_count": section.get("publications_count"),
"news_links": _normalize_news_links(section.get("news_links")),
"news_count": section.get("news_count"),
"theses": _normalize_theses(section.get("theses")),
"theses_count": section.get("theses_count"),
"academic_year": section.get("academic_year"),
@@ -370,6 +377,77 @@ def _normalize_links(items: Any) -> list[dict[str, str | None]]:
return normalized
def _stored_news_links(employee: Employee) -> list[dict[str, Any]]:
return [_stored_news_link_payload(item) for item in sorted(employee.news_links, key=_news_link_sort_key)]
def _news_link_sort_key(item: EmployeeNewsLink) -> tuple:
timestamp = item.published_at.timestamp() if item.published_at else 0
return (-timestamp, item.title or "", item.id)
def _stored_news_link_payload(item: EmployeeNewsLink) -> dict[str, Any]:
return {
"title": item.title,
"url": item.url,
"summary": item.summary,
"published_at": item.published_at.isoformat() if item.published_at else None,
"published_year": item.published_year,
"published_display": format_admin_date(item.published_at) if item.published_at else str(item.published_year or ""),
}
def _detail_news_links(employee: Employee, data: dict[str, Any]) -> list[dict[str, Any]]:
stored = _stored_news_links(employee)
if stored:
return stored
for section in _as_list(data.get("sections")):
if isinstance(section, dict) and section.get("type") == "news":
return _normalize_news_links(section.get("news_links"))
return []
def format_admin_date(value: Any) -> str:
if not value:
return ""
if isinstance(value, str):
try:
value = datetime.fromisoformat(value.replace("Z", "+00:00"))
except ValueError:
return value
if not isinstance(value, datetime):
return str(value)
if value.tzinfo:
value = value.astimezone(ZoneInfo("Europe/Moscow"))
return value.strftime("%d.%m.%Y")
def _normalize_news_links(items: Any) -> list[dict[str, Any]]:
normalized = []
if not isinstance(items, list):
return normalized
for item in items:
if not isinstance(item, dict):
continue
title = str(item.get("title") or item.get("url") or "").strip()
url = str(item.get("url") or "").strip()
summary = str(item.get("summary") or "").strip()
published_at = str(item.get("published_at") or "").strip()
published_year = item.get("published_year")
if title or url:
normalized.append(
{
"title": title or url,
"url": url or None,
"summary": summary or None,
"published_at": published_at or None,
"published_year": published_year,
"published_display": format_admin_date(published_at) if published_at else str(published_year or ""),
}
)
return normalized
def _normalize_year_entries(items: Any) -> list[dict[str, Any]]:
normalized = []
if not isinstance(items, list):

View File

@@ -6,11 +6,21 @@ import time
from datetime import datetime, timezone
import requests
from sqlalchemy import select
from sqlalchemy import inspect, select
from sqlalchemy.orm import Session
from app.config import Settings
from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee, EmployeeSnapshot, ParserSource, ProfileTab
from app.models import (
CrawlError,
CrawlRun,
CrawlRunEmployeeChange,
Employee,
EmployeeNewsLink,
EmployeePublication,
EmployeeSnapshot,
ParserSource,
ProfileTab,
)
from app.parser.collector import collect_profile_links
from app.parser.profile import parse_person_profile
from app.parser.profile_url import profile_key
@@ -219,9 +229,230 @@ def _upsert_employee(db: Session, run: CrawlRun, parsed: dict) -> tuple[Employee
parser_version=parser_version,
)
)
db.flush()
_try_sync_employee_publications(db, run, employee, parsed)
_try_sync_employee_news_links(db, run, employee, parsed)
return employee, changed
def _try_sync_employee_publications(db: Session, run: CrawlRun, employee: Employee, parsed: dict) -> None:
try:
if not _publication_payloads(parsed):
return
if not _employee_publications_table_exists(db):
return
with db.begin_nested():
_sync_employee_publications(db, employee, parsed)
except Exception as exc:
db.add(
CrawlError(
crawl_run_id=run.id,
profile_url=employee.canonical_url,
error_type=type(exc).__name__,
message=f"Не удалось сохранить публикации сотрудника: {exc}",
)
)
def _employee_publications_table_exists(db: Session) -> bool:
return inspect(db.connection()).has_table(EmployeePublication.__tablename__)
def _sync_employee_publications(db: Session, employee: Employee, parsed: dict) -> None:
publications = _publication_payloads(parsed)
seen_hashes = set()
for publication in publications:
source_hash = _publication_hash(publication)
seen_hashes.add(source_hash)
publication_id = _clean_optional(publication.get("publication_id") or publication.get("id"))
existing = None
if publication_id:
existing = db.scalar(
select(EmployeePublication).where(
EmployeePublication.employee_id == employee.id,
EmployeePublication.publication_id == publication_id,
)
)
if not existing:
existing = db.scalar(
select(EmployeePublication).where(
EmployeePublication.employee_id == employee.id,
EmployeePublication.source_hash == source_hash,
)
)
if not existing:
existing = EmployeePublication(employee_id=employee.id, source_hash=source_hash, title=_publication_title(publication))
db.add(existing)
_apply_publication(existing, publication, source_hash)
if seen_hashes:
stale = db.scalars(
select(EmployeePublication).where(
EmployeePublication.employee_id == employee.id,
EmployeePublication.source_hash.not_in(seen_hashes),
)
).all()
for item in stale:
db.delete(item)
def _publication_payloads(parsed: dict) -> list[dict]:
publications = []
for section in parsed.get("sections") or []:
if not isinstance(section, dict) or section.get("type") != "publications":
continue
for publication in section.get("publications") or []:
if isinstance(publication, dict):
publications.append(publication)
return publications
def _apply_publication(target: EmployeePublication, publication: dict, source_hash: str) -> None:
target.publication_id = _clean_optional(publication.get("publication_id") or publication.get("id"))
target.title = _publication_title(publication)
target.year = _int_or_none(publication.get("year"))
target.publication_type = _clean_optional(publication.get("publication_type") or publication.get("type"))
target.language = _clean_optional(publication.get("language"))
target.status = _int_or_none(publication.get("status"))
target.url = _clean_optional(publication.get("url"))
target.doi_url = _clean_optional(publication.get("doi_url"))
target.other_url = _clean_optional(publication.get("other_url"))
target.document_url = _clean_optional(publication.get("document_url"))
target.citation_text = _clean_optional(publication.get("citation_text") or publication.get("text"))
target.annotation = publication.get("annotation") if isinstance(publication.get("annotation"), dict) else None
target.description = publication.get("description") if isinstance(publication.get("description"), dict) else None
target.authors = publication.get("authors") if isinstance(publication.get("authors"), list) else None
target.raw_data = publication.get("raw_data") if isinstance(publication.get("raw_data"), dict) else publication
target.source_hash = source_hash
def _publication_hash(publication: dict) -> str:
return _payload_hash(publication.get("raw_data") if isinstance(publication.get("raw_data"), dict) else publication)
def _payload_hash(value: object) -> str:
payload = json.dumps(_stable_checksum_payload(value), ensure_ascii=False, sort_keys=True, separators=(",", ":"), default=str)
return hashlib.sha256(payload.encode("utf-8")).hexdigest()
def _publication_title(publication: dict) -> str:
return _clean_optional(publication.get("title") or publication.get("text") or publication.get("id")) or "Untitled publication"
def _clean_optional(value: object) -> str | None:
text = str(value or "").strip()
return text or None
def _int_or_none(value: object) -> int | None:
try:
return int(value)
except (TypeError, ValueError):
return None
def _try_sync_employee_news_links(db: Session, run: CrawlRun, employee: Employee, parsed: dict) -> None:
try:
if not _news_link_payloads(parsed):
return
if not _employee_news_links_table_exists(db):
return
with db.begin_nested():
_sync_employee_news_links(db, employee, parsed)
except Exception as exc:
db.add(
CrawlError(
crawl_run_id=run.id,
profile_url=employee.canonical_url,
error_type=type(exc).__name__,
message=f"Не удалось сохранить новости сотрудника: {exc}",
)
)
def _employee_news_links_table_exists(db: Session) -> bool:
return inspect(db.connection()).has_table(EmployeeNewsLink.__tablename__)
def _sync_employee_news_links(db: Session, employee: Employee, parsed: dict) -> None:
news_links = _news_link_payloads(parsed)
seen_hashes = set()
for news_link in news_links:
source_hash = _news_link_hash(news_link)
seen_hashes.add(source_hash)
url = _clean_optional(news_link.get("url"))
existing = None
if url:
existing = db.scalar(
select(EmployeeNewsLink).where(
EmployeeNewsLink.employee_id == employee.id,
EmployeeNewsLink.url == url,
)
)
if not existing:
existing = db.scalar(
select(EmployeeNewsLink).where(
EmployeeNewsLink.employee_id == employee.id,
EmployeeNewsLink.source_hash == source_hash,
)
)
if not existing:
existing = EmployeeNewsLink(employee_id=employee.id, source_hash=source_hash, title=_news_link_title(news_link))
db.add(existing)
_apply_news_link(existing, news_link, source_hash)
if seen_hashes:
stale = db.scalars(
select(EmployeeNewsLink).where(
EmployeeNewsLink.employee_id == employee.id,
EmployeeNewsLink.source_hash.not_in(seen_hashes),
)
).all()
for item in stale:
db.delete(item)
def _news_link_payloads(parsed: dict) -> list[dict]:
news_links = []
for section in parsed.get("sections") or []:
if not isinstance(section, dict) or section.get("type") != "news":
continue
for item in section.get("news_links") or []:
if isinstance(item, dict):
news_links.append(item)
return news_links
def _apply_news_link(target: EmployeeNewsLink, news_link: dict, source_hash: str) -> None:
target.title = _news_link_title(news_link)
target.url = _clean_optional(news_link.get("url"))
target.summary = _clean_optional(news_link.get("summary"))
target.published_at = _datetime_or_none(news_link.get("published_at"))
target.published_year = _int_or_none(news_link.get("published_year"))
target.raw_data = news_link.get("raw_data") if isinstance(news_link.get("raw_data"), dict) else news_link
target.source_hash = source_hash
def _news_link_hash(news_link: dict) -> str:
return _payload_hash(news_link.get("raw_data") if isinstance(news_link.get("raw_data"), dict) else news_link)
def _news_link_title(news_link: dict) -> str:
return _clean_optional(news_link.get("title") or news_link.get("url")) or "Untitled news"
def _datetime_or_none(value: object) -> datetime | None:
if isinstance(value, datetime):
return value
if not value:
return None
try:
parsed = datetime.fromisoformat(str(value).replace("Z", "+00:00"))
except ValueError:
return None
return parsed if parsed.tzinfo else parsed.replace(tzinfo=timezone.utc)
def _mark_dismissed(db: Session, run: CrawlRun, found_keys: set[str], session: requests.Session, timeout: int) -> int:
dismissed = 0
active = db.scalars(select(Employee).where(Employee.status == "active")).all()

View File

@@ -55,6 +55,7 @@
<th class="directory-table__head" data-column="address">Адрес</th>
<th class="directory-table__head" data-column="publications_count">Публикации</th>
<th class="directory-table__head" data-column="courses_count">Курсы</th>
<th class="directory-table__head" data-column="news_count">Новости</th>
<th class="directory-table__head" data-column="first_seen_at">Впервые найден</th>
<th class="directory-table__head" data-column="last_seen_at">Последний раз найден</th>
<th class="directory-table__head" data-column="dismissed_at">Дата увольнения</th>
@@ -73,13 +74,14 @@
<td class="directory-table__cell" data-column="address">{{ employee.address or "" }}</td>
<td class="directory-table__cell" data-column="publications_count">{{ employee.publications_count }}</td>
<td class="directory-table__cell" data-column="courses_count">{{ employee.courses_count }}</td>
<td class="directory-table__cell" data-column="news_count">{{ employee.news_count }}</td>
<td class="directory-table__cell" data-column="first_seen_at">{{ employee.first_seen_display }}</td>
<td class="directory-table__cell" data-column="last_seen_at">{{ employee.last_seen_display }}</td>
<td class="directory-table__cell" data-column="dismissed_at">{{ employee.dismissed_display }}</td>
<td class="directory-table__cell" data-column="profile"><a class="admin__link" href="{{ employee.canonical_url }}">Открыть</a></td>
</tr>
{% else %}
<tr><td class="directory-table__empty" colspan="13">По этим фильтрам сотрудники не найдены.</td></tr>
<tr><td class="directory-table__empty" colspan="14">По этим фильтрам сотрудники не найдены.</td></tr>
{% endfor %}
</tbody>
</table>
@@ -106,7 +108,7 @@
<button class="button button--ghost" type="button" data-columns-close>Закрыть</button>
</div>
<div class="columns-modal__grid">
{% for key, label in [("full_name", "ФИО"), ("status", "Статус"), ("positions", "Должности"), ("hse_start_year", "Год начала"), ("email", "Email"), ("phone", "Телефон"), ("address", "Адрес"), ("publications_count", "Публикации"), ("courses_count", "Курсы"), ("first_seen_at", "Впервые найден"), ("last_seen_at", "Последний раз найден"), ("dismissed_at", "Дата увольнения"), ("profile", "Профиль")] %}
{% for key, label in [("full_name", "ФИО"), ("status", "Статус"), ("positions", "Должности"), ("hse_start_year", "Год начала"), ("email", "Email"), ("phone", "Телефон"), ("address", "Адрес"), ("publications_count", "Публикации"), ("courses_count", "Курсы"), ("news_count", "Новости"), ("first_seen_at", "Впервые найден"), ("last_seen_at", "Последний раз найден"), ("dismissed_at", "Дата увольнения"), ("profile", "Профиль")] %}
<label class="columns-modal__option"><input class="columns-modal__checkbox" type="checkbox" value="{{ key }}" data-column-toggle> {{ label }}</label>
{% endfor %}
</div>

View File

@@ -104,6 +104,25 @@
</section>
{% endif %}
{% if employee_view.news_links %}
<section class="employee-card__section">
<h3 class="employee-section__title">В новостях</h3>
<ul class="employee-card__list">
{% for news in employee_view.news_links %}
<li class="employee-card__list-item">
{% if news.published_display %}<div class="employee-section__meta"><span class="employee-section__meta-item">{{ news.published_display }}</span></div>{% endif %}
{% if news.url %}
<a class="admin__link" href="{{ news.url }}">{{ news.title }}</a>
{% else %}
{{ news.title }}
{% endif %}
{% if news.summary %}<div class="employee-section__text">{{ news.summary }}</div>{% endif %}
</li>
{% endfor %}
</ul>
</section>
{% endif %}
<section class="employee-card__section">
<h3 class="employee-section__title">Разделы профиля</h3>
{% if employee_view.sections %}

View File

@@ -1,3 +1,3 @@
APP_VERSION = "0.6.0"
FRONTEND_VERSION = "0.6.0"
BACKEND_VERSION = "0.6.0"
APP_VERSION = "0.7.0"
FRONTEND_VERSION = "0.7.0"
BACKEND_VERSION = "0.7.0"

View File

@@ -0,0 +1,39 @@
CREATE TABLE IF NOT EXISTS employee_publications (
id SERIAL PRIMARY KEY,
employee_id INTEGER NOT NULL REFERENCES employees(id) ON DELETE CASCADE,
publication_id VARCHAR(64),
title TEXT NOT NULL,
year INTEGER,
publication_type VARCHAR(64),
language VARCHAR(16),
status INTEGER,
url TEXT,
doi_url TEXT,
other_url TEXT,
document_url TEXT,
citation_text TEXT,
annotation JSONB,
description JSONB,
authors JSONB,
raw_data JSONB,
source_hash VARCHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT uq_employee_publications_employee_publication UNIQUE (employee_id, publication_id),
CONSTRAINT uq_employee_publications_employee_source_hash UNIQUE (employee_id, source_hash)
);
CREATE INDEX IF NOT EXISTS ix_employee_publications_employee_id
ON employee_publications (employee_id);
CREATE INDEX IF NOT EXISTS ix_employee_publications_publication_id
ON employee_publications (publication_id);
CREATE INDEX IF NOT EXISTS ix_employee_publications_doi_url
ON employee_publications (doi_url);
CREATE INDEX IF NOT EXISTS ix_employee_publications_year
ON employee_publications (year);
CREATE INDEX IF NOT EXISTS ix_employee_publications_publication_type
ON employee_publications (publication_type);

View File

@@ -0,0 +1,27 @@
CREATE TABLE IF NOT EXISTS employee_news_links (
id SERIAL PRIMARY KEY,
employee_id INTEGER NOT NULL REFERENCES employees(id) ON DELETE CASCADE,
title TEXT NOT NULL,
url TEXT,
summary TEXT,
published_at TIMESTAMPTZ,
published_year INTEGER,
source_hash VARCHAR(64) NOT NULL,
raw_data JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT uq_employee_news_links_employee_url UNIQUE (employee_id, url),
CONSTRAINT uq_employee_news_links_employee_source_hash UNIQUE (employee_id, source_hash)
);
CREATE INDEX IF NOT EXISTS ix_employee_news_links_employee_id
ON employee_news_links (employee_id);
CREATE INDEX IF NOT EXISTS ix_employee_news_links_url
ON employee_news_links (url);
CREATE INDEX IF NOT EXISTS ix_employee_news_links_published_at
ON employee_news_links (published_at);
CREATE INDEX IF NOT EXISTS ix_employee_news_links_published_year
ON employee_news_links (published_year);

View File

@@ -1,6 +1,6 @@
[project]
name = "miem-workers"
version = "0.6.0"
version = "0.7.0"
description = "MIEM employees parser, admin API, and MCP server"
requires-python = ">=3.11"
dependencies = [

View File

@@ -1,6 +1,6 @@
from datetime import datetime, timezone
from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee
from app.models import CrawlError, CrawlRun, CrawlRunEmployeeChange, Employee, EmployeeNewsLink
from app.services.admin_data import (
employee_detail_payload,
employee_display_payload,
@@ -35,6 +35,7 @@ def test_employee_display_payload_extracts_common_fields(db_session):
"sections": [
{"type": "publications", "publications": [{"title": "Paper"}]},
{"type": "courses_by_year", "courses": [{"title": "Course"}]},
{"type": "news", "news_links": [{"title": "News", "url": "https://example.test/news"}]},
],
},
)
@@ -46,6 +47,7 @@ def test_employee_display_payload_extracts_common_fields(db_session):
assert payload["email_text"] == "person@hse.ru"
assert payload["publications_count"] == 1
assert payload["courses_count"] == 1
assert payload["news_count"] == 1
assert payload["first_seen_display"] != "Не указано"
@@ -104,6 +106,19 @@ def test_employee_detail_payload_normalizes_human_readable_sections(db_session):
"type": "generic",
"raw_text": "Fallback text",
},
{
"title": "В новостях",
"type": "news",
"news_links": [
{
"title": "News title",
"url": "https://example.test/news",
"summary": "News summary",
"published_at": "2026-04-28T00:00:00+00:00",
"published_year": 2026,
}
],
},
],
},
)
@@ -118,6 +133,41 @@ def test_employee_detail_payload_normalizes_human_readable_sections(db_session):
assert payload["sections"][2]["courses"][0]["title"] == "Course"
assert payload["sections"][3]["theses"][0]["student"] == "Student Name"
assert payload["sections"][4]["paragraphs"] == ["Fallback text"]
assert payload["sections"][5]["news_links"][0]["title"] == "News title"
assert payload["news_links"][0]["published_display"] == "28.04.2026"
def test_employee_payload_prefers_stored_news_links(db_session):
employee = Employee(
profile_key="staff:news",
canonical_url="https://www.hse.ru/staff/news",
full_name="News Person",
status="active",
first_seen_at=datetime.now(timezone.utc),
last_seen_at=datetime.now(timezone.utc),
current_data={"sections": [{"type": "news", "news_links": [{"title": "Old news"}]}]},
)
db_session.add(employee)
db_session.commit()
db_session.add(
EmployeeNewsLink(
employee_id=employee.id,
title="Stored news",
url="https://example.test/stored",
summary="Stored summary",
published_at=datetime(2026, 4, 28, tzinfo=timezone.utc),
published_year=2026,
source_hash="b" * 64,
)
)
db_session.commit()
display = employee_display_payload(employee)
detail = employee_detail_payload(employee)
assert display["news_count"] == 1
assert detail["news_links"][0]["title"] == "Stored news"
assert detail["news_links"][0]["published_display"] == "28.04.2026"
def test_employee_payloads_tolerate_malformed_current_data(db_session):

View File

@@ -22,6 +22,8 @@ def test_directory_template_is_russian_and_uses_display_dates():
assert "На странице: {{ value }}" in template
assert "{% for value in [25, 50, 100] %}" in template
assert "Найдено:" in template
assert "Новости" in template
assert "employee.news_count" in template
assert "employee.first_seen_display" in template
assert "employee.last_seen_display" in template
assert "employee.dismissed_display" in template

View File

@@ -10,7 +10,7 @@ from sqlalchemy.pool import StaticPool
from app.config import Settings, get_settings
from app.db import Base, get_db
from app.main import app
from app.models import CrawlRun, CrawlRunEmployeeChange, Employee
from app.models import CrawlRun, CrawlRunEmployeeChange, Employee, EmployeePublication
from app.security import SESSION_COOKIE, sign_session
@@ -20,7 +20,7 @@ def test_health_returns_versions():
response = client.get("/api/health")
assert response.status_code == 200
assert response.json()["backend_version"] == "0.6.0"
assert response.json()["backend_version"] == "0.7.0"
def test_mcp_lists_tools_without_auth_and_ignores_auth_header():
@@ -154,13 +154,115 @@ def test_mcp_service_info_returns_tools_and_dataset_hash():
assert response.status_code == 200
payload = json.loads(response.json()["result"]["content"][0]["text"])
assert payload["service_name"] == "miem-employees"
assert payload["backend_version"] == "0.6.0"
assert payload["backend_version"] == "0.7.0"
assert payload["dataset"]["hash"]
assert any(tool["name"] == "sync_employees" for tool in payload["tools"])
app.dependency_overrides.clear()
def test_mcp_list_employee_publications_prefers_stored_publications_with_fallback():
engine = create_engine(
"sqlite:///:memory:",
connect_args={"check_same_thread": False},
poolclass=StaticPool,
)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
stored_employee = Employee(
profile_key="staff:stored",
profile_type="staff",
profile_id="stored",
canonical_url="https://www.hse.ru/staff/stored",
full_name="Stored Person",
status="active",
current_data={
"sections": [
{
"type": "publications",
"publications": [{"title": "Old JSON Publication", "url": "https://example.test/old"}],
}
]
},
)
fallback_employee = Employee(
profile_key="staff:fallback",
profile_type="staff",
profile_id="fallback",
canonical_url="https://www.hse.ru/staff/fallback",
full_name="Fallback Person",
status="active",
current_data={
"sections": [
{
"type": "publications",
"publications": [{"title": "Fallback Publication", "url": "https://example.test/fallback"}],
}
]
},
)
session.add_all([stored_employee, fallback_employee])
session.commit()
session.add(
EmployeePublication(
employee_id=stored_employee.id,
publication_id="pub-1",
title="Stored Publication",
year=2024,
publication_type="ARTICLE",
url="https://publications.hse.ru/view/pub-1",
doi_url="https://doi.org/10.1/test",
citation_text="Stored Citation",
annotation={"ru": "Аннотация", "en": "Abstract"},
description={"main": "Stored Citation"},
authors=[{"id": "1", "title_ru": "Автор", "is_current_employee": True}],
source_hash="a" * 64,
)
)
session.commit()
session.close()
def override_db():
db = Session()
try:
yield db
finally:
db.close()
app.dependency_overrides[get_db] = override_db
client = TestClient(app)
stored_response = client.post(
"/mcp",
json={
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {"name": "list_employee_publications", "arguments": {"profile_id_or_url": "stored"}},
},
)
fallback_response = client.post(
"/mcp",
json={
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {"name": "list_employee_publications", "arguments": {"profile_id_or_url": "fallback"}},
},
)
stored_payload = json.loads(stored_response.json()["result"]["content"][0]["text"])
fallback_payload = json.loads(fallback_response.json()["result"]["content"][0]["text"])
assert stored_payload["items"][0]["title"] == "Stored Publication"
assert stored_payload["items"][0]["doi_url"] == "https://doi.org/10.1/test"
assert stored_payload["items"][0]["annotation"] == {"ru": "Аннотация", "en": "Abstract"}
assert stored_payload["items"][0]["authors"] == [{"id": "1", "title_ru": "Автор", "is_current_employee": True}]
assert fallback_payload["items"][0]["title"] == "Fallback Publication"
app.dependency_overrides.clear()
def test_mcp_sync_employees_full_empty_and_unknown_hash_modes():
engine = create_engine(
"sqlite:///:memory:",

View File

@@ -1,7 +1,16 @@
import gzip
from datetime import datetime, timezone
from app.models import CrawlRun, CrawlRunEmployeeChange, Employee, EmployeeSnapshot, ParseResourceCache
from app.models import (
CrawlError,
CrawlRun,
CrawlRunEmployeeChange,
Employee,
EmployeeNewsLink,
EmployeePublication,
EmployeeSnapshot,
ParseResourceCache,
)
from app.services.crawler import _checksum, _mark_dismissed, _upsert_employee
from app.services.resource_cache import ResourceCache
@@ -191,6 +200,106 @@ def test_upsert_employee_skips_snapshot_when_checksum_is_unchanged(db_session):
assert db_session.query(EmployeeSnapshot).count() == 1
def test_upsert_employee_saves_publications_and_reuses_existing_rows(db_session):
first_run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
second_run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add_all([first_run, second_run])
db_session.commit()
parsed = _parsed_employee("published")
parsed["sections"] = [
{
"type": "publications",
"publications": [
{
"id": "888959076",
"publication_id": "888959076",
"title": "Detailed Publication",
"year": 2023,
"publication_type": "ARTICLE",
"language": "ru",
"status": 1,
"url": "https://publications.hse.ru/view/888959076",
"doi_url": "https://doi.org/10.1/test",
"citation_text": "Detailed citation",
"annotation": {"ru": "Аннотация"},
"description": {"main": "Detailed citation"},
"authors": [{"id": "1", "title_ru": "Автор"}],
"raw_data": {"id": "888959076", "title": "Detailed Publication"},
}
],
}
]
employee, _ = _upsert_employee(db_session, first_run, parsed)
db_session.commit()
_upsert_employee(db_session, second_run, _parsed_employee_with_publication("published"))
db_session.commit()
publications = db_session.query(EmployeePublication).filter_by(employee_id=employee.id).all()
assert len(publications) == 1
assert publications[0].doi_url == "https://doi.org/10.1/test"
assert publications[0].authors == [{"id": "1", "title_ru": "Автор"}]
def test_upsert_employee_records_publication_errors_without_failing_employee(monkeypatch, db_session):
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add(run)
db_session.commit()
def broken_sync(*_args, **_kwargs):
raise RuntimeError("boom")
monkeypatch.setattr("app.services.crawler._sync_employee_publications", broken_sync)
employee, changed = _upsert_employee(db_session, run, _parsed_employee_with_publication("error-safe"))
db_session.commit()
assert changed is True
assert employee.full_name == "Same Person"
assert db_session.query(Employee).filter_by(profile_key="staff:error-safe").one()
error = db_session.query(CrawlError).one()
assert "публикации" in error.message.lower()
def test_upsert_employee_saves_news_links_and_reuses_existing_rows(db_session):
first_run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
second_run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add_all([first_run, second_run])
db_session.commit()
employee, _ = _upsert_employee(db_session, first_run, _parsed_employee_with_news("news-person"))
db_session.commit()
_upsert_employee(db_session, second_run, _parsed_employee_with_news("news-person"))
db_session.commit()
news_links = db_session.query(EmployeeNewsLink).filter_by(employee_id=employee.id).all()
assert len(news_links) == 1
assert news_links[0].title == "News Title"
assert news_links[0].url == "https://www.hse.ru/news/1.html"
assert news_links[0].published_year == 2026
def test_upsert_employee_records_news_errors_without_failing_employee(monkeypatch, db_session):
run = CrawlRun(source_url="https://miem.hse.ru/persons", status="running")
db_session.add(run)
db_session.commit()
def broken_sync(*_args, **_kwargs):
raise RuntimeError("boom")
monkeypatch.setattr("app.services.crawler._sync_employee_news_links", broken_sync)
employee, changed = _upsert_employee(db_session, run, _parsed_employee_with_news("news-error-safe"))
db_session.commit()
assert changed is True
assert employee.full_name == "Same Person"
assert db_session.query(Employee).filter_by(profile_key="staff:news-error-safe").one()
error = db_session.query(CrawlError).one()
assert "новости" in error.message.lower()
def test_checksum_changes_when_widget_data_changes():
base = _parsed_employee("widgets")
changed = _parsed_employee("widgets")
@@ -224,3 +333,51 @@ def _parsed_employee(profile_id: str) -> dict:
"parser_version": "0.6.0",
"_html": "<html></html>",
}
def _parsed_employee_with_publication(profile_id: str) -> dict:
parsed = _parsed_employee(profile_id)
parsed["sections"] = [
{
"type": "publications",
"publications": [
{
"id": "888959076",
"publication_id": "888959076",
"title": "Detailed Publication",
"year": 2023,
"publication_type": "ARTICLE",
"language": "ru",
"status": 1,
"url": "https://publications.hse.ru/view/888959076",
"doi_url": "https://doi.org/10.1/test",
"citation_text": "Detailed citation",
"annotation": {"ru": "Аннотация"},
"description": {"main": "Detailed citation"},
"authors": [{"id": "1", "title_ru": "Автор"}],
"raw_data": {"id": "888959076", "title": "Detailed Publication"},
}
],
}
]
return parsed
def _parsed_employee_with_news(profile_id: str) -> dict:
parsed = _parsed_employee(profile_id)
parsed["sections"] = [
{
"type": "news",
"news_links": [
{
"title": "News Title",
"url": "https://www.hse.ru/news/1.html",
"summary": "News summary",
"published_at": "2026-04-28T00:00:00+00:00",
"published_year": 2026,
"raw_data": {"title": "News Title", "url": "https://www.hse.ru/news/1.html"},
}
],
}
]
return parsed

115
tests/test_db_schema.py Normal file
View File

@@ -0,0 +1,115 @@
from sqlalchemy import create_engine, inspect, text
from app.db import _ensure_runtime_schema
def test_runtime_schema_adds_skipped_count_to_existing_crawl_runs_table(monkeypatch):
engine = create_engine("sqlite:///:memory:")
with engine.begin() as connection:
connection.execute(
text(
"""
CREATE TABLE crawl_runs (
id INTEGER PRIMARY KEY,
source_url TEXT NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'running',
found_count INTEGER NOT NULL DEFAULT 0,
parsed_count INTEGER NOT NULL DEFAULT 0
)
"""
)
)
monkeypatch.setattr("app.db.engine", engine)
_ensure_runtime_schema()
columns = {column["name"] for column in inspect(engine).get_columns("crawl_runs")}
assert "skipped_count" in columns
def test_runtime_schema_creates_employee_publications_table_when_employees_exist(monkeypatch):
engine = create_engine("sqlite:///:memory:")
with engine.begin() as connection:
connection.execute(
text(
"""
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
profile_key VARCHAR(255) NOT NULL UNIQUE,
canonical_url TEXT NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'active',
first_seen_at DATETIME NOT NULL,
last_seen_at DATETIME NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL
)
"""
)
)
connection.execute(
text(
"""
CREATE TABLE crawl_runs (
id INTEGER PRIMARY KEY,
source_url TEXT NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'running',
found_count INTEGER NOT NULL DEFAULT 0,
parsed_count INTEGER NOT NULL DEFAULT 0,
skipped_count INTEGER NOT NULL DEFAULT 0
)
"""
)
)
monkeypatch.setattr("app.db.engine", engine)
_ensure_runtime_schema()
_ensure_runtime_schema()
inspector = inspect(engine)
assert "employee_publications" in inspector.get_table_names()
columns = {column["name"] for column in inspector.get_columns("employee_publications")}
assert {"employee_id", "publication_id", "doi_url", "authors", "raw_data", "source_hash"}.issubset(columns)
def test_runtime_schema_creates_employee_news_links_table_when_employees_exist(monkeypatch):
engine = create_engine("sqlite:///:memory:")
with engine.begin() as connection:
connection.execute(
text(
"""
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
profile_key VARCHAR(255) NOT NULL UNIQUE,
canonical_url TEXT NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'active',
first_seen_at DATETIME NOT NULL,
last_seen_at DATETIME NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL
)
"""
)
)
connection.execute(
text(
"""
CREATE TABLE crawl_runs (
id INTEGER PRIMARY KEY,
source_url TEXT NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'running',
found_count INTEGER NOT NULL DEFAULT 0,
parsed_count INTEGER NOT NULL DEFAULT 0,
skipped_count INTEGER NOT NULL DEFAULT 0
)
"""
)
)
monkeypatch.setattr("app.db.engine", engine)
_ensure_runtime_schema()
_ensure_runtime_schema()
inspector = inspect(engine)
assert "employee_news_links" in inspector.get_table_names()
columns = {column["name"] for column in inspector.get_columns("employee_news_links")}
assert {"employee_id", "title", "url", "summary", "published_at", "published_year", "source_hash", "raw_data"}.issubset(columns)

View File

@@ -13,6 +13,9 @@ def test_employee_detail_template_is_human_readable():
assert "section.list_items" in template
assert "Основная информация" in template
assert "Контакты" in template
assert "В новостях" in template
assert "employee_view.news_links" in template
assert "news.summary" in template
assert "Разделы профиля" in template
assert "graduation_theses" in template
assert "Год защиты" in template

View File

@@ -34,7 +34,21 @@ class FakeSession:
"type": "ARTICLE",
"title": "Дублирование пакетов",
"year": 2023,
"language": {"name": "ru"},
"status": 1,
"authorsByType": {
"author": [
{
"id": "568398853",
"href": "/org/persons/568398853",
"title": {"ru": "Левицкий И. А.", "en": ""},
"reverseTitle": {"ru": "И. А. Левицкий", "en": ""},
}
]
},
"description": {"short": {"ru": "Информационные процессы. 2023."}},
"annotation": {"ru": "<p>Русская аннотация</p>"},
"documents": {"DOI": {"href": "https://doi.org/10.1/test"}},
}
],
},
@@ -153,6 +167,9 @@ def test_enrich_sections_from_hse_widgets_loads_publications_and_vkr():
assert publications["publications_count"] == 1
assert publications["publications"][0]["url"] == "https://publications.hse.ru/view/888959076"
assert publications["publications"][0]["doi_url"] == "https://doi.org/10.1/test"
assert publications["publications"][0]["annotation"] == {"ru": "Русская аннотация"}
assert publications["publications"][0]["authors"][0]["is_current_employee"] is True
assert theses["theses_count"] == 1
assert theses["theses"][0]["student"] == "Лесняк Владислав Евгеньевич"
assert theses["theses"][0]["project_url"] == "https://www.hse.ru/edu/vkr/1045750164"
@@ -215,3 +232,45 @@ def test_news_heading_with_publications_word_does_not_absorb_widget_publications
assert len(publications) == 1
assert publications[0]["title"] == "Публикации и исследования"
assert publications[0]["publications_count"] == 1
def test_extract_sections_parses_employee_news_links():
soup = BeautifulSoup(
"""
<div class="b-person-data posts hidden printable" data-tab="press_links_news" tab-node="press_links_news">
<div class="post f8">
<div class="post__extra">
<div class="post-meta">
<div class="post-meta__date">
<div class="post-meta__day">28</div>
<div class="post-meta__month">апр.</div>
<div class="post-meta__year">2026</div>
</div>
</div>
</div>
<div class="post__content">
<h2 class="first_child"><a class="link" href="/news/edu/1153850518.html">Как финал ВсОШ формирует кадры</a></h2>
<div class="post__text"><p class="with-indent">Краткое описание новости.</p></div>
</div>
</div>
<div class="post f8">
<div class="post__content">
<h2><a href="https://miem.hse.ru/news/1123589375.html">Партнер магистратуры</a></h2>
</div>
</div>
</div>
""",
"html.parser",
)
sections = extract_sections(soup, "https://www.hse.ru/staff/avsergeev")
assert len(sections) == 1
news = sections[0]
assert news["type"] == "news"
assert news["news_count"] == 2
assert news["news_links"][0]["title"] == "Как финал ВсОШ формирует кадры"
assert news["news_links"][0]["url"] == "https://www.hse.ru/news/edu/1153850518.html"
assert news["news_links"][0]["summary"] == "Краткое описание новости."
assert news["news_links"][0]["published_at"] == "2026-04-28T00:00:00+00:00"
assert news["news_links"][0]["published_year"] == 2026