Крон для Anthropic Knowledge Base через RSS и sitemap

Дата: 2026-05-04 03:59 UTC

Главный вывод

Хороший будущий сценарий для Trip2G: сделать отдельную базу знаний по Anthropic и поставить агенту cron-задачу, которая регулярно проверяет новые посты, вытаскивает markdown, сохраняет raw sources, обновляет LLM Wiki страницы и синхронизирует базу.

Рекомендуемый pipeline:

RSS/sitemap watcher
  → URL diff
  → URL-to-markdown extractor
  → raw source storage
  → LLM Wiki update pass
  → index.md/log.md update
  → Trip2G sync
  → MCP search verification

Главное: RSS использовать как сигнал о новых постах, sitemap — как canonical backfill и контроль пропусков.

Что уже проверено

Официальный RSS Anthropic

Официальный RSS для www.anthropic.com/news, research, engineering не найден.

Проверялись:

https://www.anthropic.com/rss.xml
https://www.anthropic.com/feed.xml
https://www.anthropic.com/news/rss.xml
https://www.anthropic.com/news/feed.xml

Результат:

404 Not Found

На страницах news, research, engineering RSS/Atom discovery link тоже не найден.

Официальный sitemap

Рабочий источник:

https://www.anthropic.com/sitemap.xml

Он содержит URL и lastmod. Для контентной базы нужны разделы:

/news/
/research/
/engineering/

В предыдущем исследовании найдено примерно:

/news/       ~205 URL
/research/   ~115 URL
/engineering/ ~24 URL

Unofficial RSS

Найден GitHub-проект:

https://github.com/taobojlen/anthropic-rss-feed

Feeds:

https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_news_rss.xml
https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_engineering_rss.xml

README проекта утверждает:

Unofficial RSS feeds for Anthropic's website, updated every 6 hours via GitHub Actions.

Ограничения:

unofficial;
нет отдельного research feed;
feed даёт title/link/date/description, но не full text;
всё равно нужно fetch/extract article body.

URL → Markdown extractor

Проверен быстрый путь через Jina Reader:

https://r.jina.ai/http://https://www.anthropic.com/engineering/building-effective-agents

Результат: Jina вернула нормальный markdown с title, source URL и content.

Пример начала результата:

Title: Building Effective AI Agents

URL Source: https://www.anthropic.com/engineering/building-effective-agents

Markdown Content:
Over the past year, we've worked with dozens of teams building large language model...

Вывод:

Для первого cron-prototype можно использовать Jina Reader как URL-to-markdown сервис.
Для production лучше иметь fallback extractor: trafilatura/readability/Playwright.

Структура базы

anthropic-kb/
├── AGENTS.md
├── SCHEMA.md
├── _mcp_initialize.md
├── index.md
├── log.md
├── _meta/
│   ├── anthropic-source-index.json
│   ├── ingestion-policy.md
│   └── source-quality.md
├── raw/
│   ├── news/
│   ├── research/
│   └── engineering/
├── entities/
│   ├── anthropic.md
│   ├── claude.md
│   ├── claude-code.md
│   └── constitutional-ai.md
├── concepts/
│   ├── building-effective-agents.md
│   ├── contextual-retrieval.md
│   ├── model-context-protocol.md
│   ├── responsible-scaling-policy.md
│   └── agentic-misalignment.md
├── timelines/
│   ├── claude-releases.md
│   └── safety-policy-updates.md
├── comparisons/
└── queries/

MCP instruction files

`AGENTS.md`

---
free: true
mcp_method: instructions
mcp_description: "How agents should read, cite and update the Anthropic public knowledge base."
---

Должен объяснять:

сначала читать index.md;
для свежих фактов проверять raw/ и source URL;
различать Anthropic official content и agent synthesis;
не смешивать docs API knowledge и blog/research knowledge;
обновлять log.md после ingest/update;
цитировать raw source и synthesized page.

`SCHEMA.md`

---
free: true
mcp_method: schema
mcp_description: "Schema for the Anthropic LLM Wiki: source sections, page types, frontmatter and update rules."
---

Должен фиксировать:

page types: raw_source, entity, concept, timeline, comparison, query;
required frontmatter;
source sections: news/research/engineering/docs;
confidence rules;
contradiction/update rules;
naming conventions.

`_mcp_initialize.md`

---
free: true
mcp_method: initialize
mcp_description: "Start a session with the Anthropic knowledge base."
---

Startup protocol:

1. Read index.md.
2. Read latest log.md entries.
3. Call instructions().
4. Call schema() before writing.
5. Use search for broad questions.
6. Verify current claims against raw sources.

Raw article frontmatter

---
title: Building Effective AI Agents
source_url: https://www.anthropic.com/engineering/building-effective-agents
source_domain: anthropic.com
source_section: engineering
published: 2024-12-19
lastmod: 2026-04-13T17:46:47.000Z
ingested: 2026-05-04
extractor: jina-reader
sha256: <body-hash>
type: raw_source
---

Cron task prompt для агента

You maintain the Anthropic public knowledge base in Trip2G.

Goal:
Keep the Anthropic LLM Wiki up to date from public Anthropic content.

Sources:
- Canonical backfill/reconciliation: https://www.anthropic.com/sitemap.xml
- Optional RSS signal:
  - https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_news_rss.xml
  - https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_engineering_rss.xml

Scope:
Only ingest URLs under:
- https://www.anthropic.com/news/
- https://www.anthropic.com/research/
- https://www.anthropic.com/engineering/

Process:
1. Read AGENTS.md, SCHEMA.md, index.md and last 30 lines of log.md.
2. Load _meta/anthropic-source-index.json if it exists.
3. Fetch RSS feeds if available.
4. Fetch sitemap.xml and filter scoped URLs.
5. Detect new or changed URLs by URL + lastmod + sha256.
6. For each new/changed URL:
   - convert URL to markdown using Jina Reader first;
   - if Jina fails, fallback to HTML extraction;
   - save raw markdown under raw/news, raw/research or raw/engineering;
   - add raw source frontmatter;
   - update source index.
7. For each ingested source:
   - decide whether it creates/updates entity/concept/timeline pages;
   - do not create pages for passing mentions;
   - update existing pages before creating duplicates;
   - add wikilinks;
   - update confidence/provenance.
8. Update index.md.
9. Append log.md entry with URLs ingested, pages created/updated, errors.
10. Sync Trip2G.
11. Verify MCP search can find at least one newly ingested page.
12. Final response: summarize new URLs, updated pages, skipped URLs, errors, sync status.

Safety:
- Do not ingest unrelated Anthropic pages like careers/legal unless explicitly requested.
- Do not claim a source changed unless lastmod or sha256 changed.
- Do not overwrite raw sources without preserving previous hash in log.
- Separate official Anthropic text from agent synthesis.

Cron schedule options

Conservative

Daily at 07:00 UTC

Good for first version.

More active

RSS check every 6 hours
Sitemap reconciliation daily
Full source drift audit weekly

Recommended start

0 7 * * *

URL-to-markdown options

Option 1 — Jina Reader

https://r.jina.ai/http://https://www.anthropic.com/engineering/building-effective-agents

Pros:

fast;
no browser needed;
returns markdown;
works well on tested Anthropic article.

Cons:

external dependency;
may cache;
may fail on some pages;
not under our control.

Option 2 — Python extractor

Potential libraries:

trafilatura
readability-lxml
beautifulsoup4
markdownify

Current environment check:

trafilatura: not installed
readability: not installed
bs4: not installed
markdownify: not installed

So if using this environment, install dependencies or vendor a small extractor.

Option 3 — Playwright/Steel/browser render

Use when:

HTML is heavily client-side rendered;
Jina fails;
article body is missing from raw HTML;
need screenshots/visual evidence.

But for cron, browser should be fallback, not default.

Почему лучше дать агенту cron, а не просто feed parser

Feed parser только принесёт URL.

Агент может делать второй слой:

raw article → concept/entity updates → index/log → citations → MCP-ready wiki

Именно это превращает поток постов в LLM Wiki:

RAG retrieves. LLM Wiki compounds.

Acceptance criteria для первой версии

Первая версия считается готовой, если:

создана отдельная Trip2G база или подпапка anthropic-kb/;
есть AGENTS.md, SCHEMA.md, _mcp_initialize.md, index.md, log.md;
ingested 10–20 Anthropic posts;
raw markdown сохранён с source URL и hash;
есть минимум 5 synthesized pages;
cron/task умеет повторно запускаться без дублей;
изменения пишутся в log.md;
Trip2G sync проходит;
MCP search находит новую базу;
агент отвечает на demo questions с цитатами.

Demo questions

What does Anthropic recommend for building effective agents?

How does Contextual Retrieval differ from a normal RAG pipeline?

What has Anthropic written about Claude Code best practices?

Which Anthropic posts are most relevant to Trip2G's MCP knowledge-base design?

What changed in Anthropic's safety policy over time?

Как это связано с Trip2G positioning

Это очень сильный демонстрационный кейс:

Trip2G can turn any public blog into an agent-readable, self-maintaining knowledge base.

Для лендинга:

Give your agents a living knowledge base over Anthropic's research, engineering notes and product updates.

Для продукта:

Sitemap/RSS adapters → markdown raw sources → LLM Wiki → MCP → federation.

Для GTM:

Not just “RSS reader”.
A compounding knowledge base that agents can query, cite and maintain.

Следующее действие

Если решим делать live demo:

Создать anthropic-kb/ в Trip2G vault.
Взять 10–20 стартовых URL из sitemap.
Прогнать через Jina Reader.
Создать raw sources + 5 concept pages.
Подключить cron daily.
Проверить через MCP.