Крон для Anthropic Knowledge Base через RSS и sitemap
Дата: 2026-05-04 03:59 UTC
Главный вывод
Хороший будущий сценарий для Trip2G: сделать отдельную базу знаний по Anthropic и поставить агенту cron-задачу, которая регулярно проверяет новые посты, вытаскивает markdown, сохраняет raw sources, обновляет LLM Wiki страницы и синхронизирует базу.
Рекомендуемый pipeline:
RSS/sitemap watcher
→ URL diff
→ URL-to-markdown extractor
→ raw source storage
→ LLM Wiki update pass
→ index.md/log.md update
→ Trip2G sync
→ MCP search verification
Главное: RSS использовать как сигнал о новых постах, sitemap — как canonical backfill и контроль пропусков.
Что уже проверено
Официальный RSS Anthropic
Официальный RSS для www.anthropic.com/news, research, engineering не найден.
Проверялись:
https://www.anthropic.com/rss.xml
https://www.anthropic.com/feed.xml
https://www.anthropic.com/news/rss.xml
https://www.anthropic.com/news/feed.xml
Результат:
404 Not Found
На страницах news, research, engineering RSS/Atom discovery link тоже не найден.
Официальный sitemap
Рабочий источник:
https://www.anthropic.com/sitemap.xml
Он содержит URL и lastmod. Для контентной базы нужны разделы:
/news/
/research/
/engineering/
В предыдущем исследовании найдено примерно:
/news/ ~205 URL
/research/ ~115 URL
/engineering/ ~24 URL
Unofficial RSS
Найден GitHub-проект:
https://github.com/taobojlen/anthropic-rss-feed
Feeds:
https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_news_rss.xml
https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_engineering_rss.xml
README проекта утверждает:
Unofficial RSS feeds for Anthropic's website, updated every 6 hours via GitHub Actions.
Ограничения:
- unofficial;
- нет отдельного research feed;
- feed даёт title/link/date/description, но не full text;
- всё равно нужно fetch/extract article body.
URL → Markdown extractor
Проверен быстрый путь через Jina Reader:
https://r.jina.ai/http://https://www.anthropic.com/engineering/building-effective-agents
Результат: Jina вернула нормальный markdown с title, source URL и content.
Пример начала результата:
Title: Building Effective AI Agents
URL Source: https://www.anthropic.com/engineering/building-effective-agents
Markdown Content:
Over the past year, we've worked with dozens of teams building large language model...
Вывод:
Для первого cron-prototype можно использовать Jina Reader как URL-to-markdown сервис.
Для production лучше иметь fallback extractor: trafilatura/readability/Playwright.
Рекомендованный режим работы cron
Частота
Для начала:
1 раз в день
Позже:
RSS — каждые 6 часов
sitemap reconciliation — раз в день или неделю
Источники
Минимально:
- Anthropic sitemap.xml
Практично:
- unofficial RSS: news
- unofficial RSS: engineering
- sitemap.xml: news/research/engineering
State file
Нужен локальный state file, например:
_meta/anthropic-source-index.json
Структура:
{
"https://www.anthropic.com/engineering/building-effective-agents": {
"lastmod": "2026-04-13T17:46:47.000Z",
"raw_path": "raw/engineering/building-effective-agents.md",
"sha256": "...",
"last_ingested": "2026-05-04T03:59:00Z",
"status": "ok"
}
}
Зачем:
- не скачивать одно и то же каждый раз;
- видеть changed pages по
lastmodили hash; - уметь переиндексировать только изменившееся;
- логировать ошибки.
Структура базы
anthropic-kb/
├── AGENTS.md
├── SCHEMA.md
├── _mcp_initialize.md
├── index.md
├── log.md
├── _meta/
│ ├── anthropic-source-index.json
│ ├── ingestion-policy.md
│ └── source-quality.md
├── raw/
│ ├── news/
│ ├── research/
│ └── engineering/
├── entities/
│ ├── anthropic.md
│ ├── claude.md
│ ├── claude-code.md
│ └── constitutional-ai.md
├── concepts/
│ ├── building-effective-agents.md
│ ├── contextual-retrieval.md
│ ├── model-context-protocol.md
│ ├── responsible-scaling-policy.md
│ └── agentic-misalignment.md
├── timelines/
│ ├── claude-releases.md
│ └── safety-policy-updates.md
├── comparisons/
└── queries/
MCP instruction files
AGENTS.md
---
free: true
mcp_method: instructions
mcp_description: "How agents should read, cite and update the Anthropic public knowledge base."
---
Должен объяснять:
- сначала читать
index.md; - для свежих фактов проверять
raw/и source URL; - различать Anthropic official content и agent synthesis;
- не смешивать docs API knowledge и blog/research knowledge;
- обновлять
log.mdпосле ingest/update; - цитировать raw source и synthesized page.
SCHEMA.md
---
free: true
mcp_method: schema
mcp_description: "Schema for the Anthropic LLM Wiki: source sections, page types, frontmatter and update rules."
---
Должен фиксировать:
- page types: raw_source, entity, concept, timeline, comparison, query;
- required frontmatter;
- source sections: news/research/engineering/docs;
- confidence rules;
- contradiction/update rules;
- naming conventions.
_mcp_initialize.md
---
free: true
mcp_method: initialize
mcp_description: "Start a session with the Anthropic knowledge base."
---
Startup protocol:
1. Read index.md.
2. Read latest log.md entries.
3. Call instructions().
4. Call schema() before writing.
5. Use search for broad questions.
6. Verify current claims against raw sources.
Raw article frontmatter
---
title: Building Effective AI Agents
source_url: https://www.anthropic.com/engineering/building-effective-agents
source_domain: anthropic.com
source_section: engineering
published: 2024-12-19
lastmod: 2026-04-13T17:46:47.000Z
ingested: 2026-05-04
extractor: jina-reader
sha256: <body-hash>
type: raw_source
---
Cron task prompt для агента
You maintain the Anthropic public knowledge base in Trip2G.
Goal:
Keep the Anthropic LLM Wiki up to date from public Anthropic content.
Sources:
- Canonical backfill/reconciliation: https://www.anthropic.com/sitemap.xml
- Optional RSS signal:
- https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_news_rss.xml
- https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_engineering_rss.xml
Scope:
Only ingest URLs under:
- https://www.anthropic.com/news/
- https://www.anthropic.com/research/
- https://www.anthropic.com/engineering/
Process:
1. Read AGENTS.md, SCHEMA.md, index.md and last 30 lines of log.md.
2. Load _meta/anthropic-source-index.json if it exists.
3. Fetch RSS feeds if available.
4. Fetch sitemap.xml and filter scoped URLs.
5. Detect new or changed URLs by URL + lastmod + sha256.
6. For each new/changed URL:
- convert URL to markdown using Jina Reader first;
- if Jina fails, fallback to HTML extraction;
- save raw markdown under raw/news, raw/research or raw/engineering;
- add raw source frontmatter;
- update source index.
7. For each ingested source:
- decide whether it creates/updates entity/concept/timeline pages;
- do not create pages for passing mentions;
- update existing pages before creating duplicates;
- add wikilinks;
- update confidence/provenance.
8. Update index.md.
9. Append log.md entry with URLs ingested, pages created/updated, errors.
10. Sync Trip2G.
11. Verify MCP search can find at least one newly ingested page.
12. Final response: summarize new URLs, updated pages, skipped URLs, errors, sync status.
Safety:
- Do not ingest unrelated Anthropic pages like careers/legal unless explicitly requested.
- Do not claim a source changed unless lastmod or sha256 changed.
- Do not overwrite raw sources without preserving previous hash in log.
- Separate official Anthropic text from agent synthesis.
Cron schedule options
Conservative
Daily at 07:00 UTC
Good for first version.
More active
RSS check every 6 hours
Sitemap reconciliation daily
Full source drift audit weekly
Recommended start
0 7 * * *
URL-to-markdown options
Option 1 — Jina Reader
https://r.jina.ai/http://https://www.anthropic.com/engineering/building-effective-agents
Pros:
- fast;
- no browser needed;
- returns markdown;
- works well on tested Anthropic article.
Cons:
- external dependency;
- may cache;
- may fail on some pages;
- not under our control.
Option 2 — Python extractor
Potential libraries:
trafilatura
readability-lxml
beautifulsoup4
markdownify
Current environment check:
trafilatura: not installed
readability: not installed
bs4: not installed
markdownify: not installed
So if using this environment, install dependencies or vendor a small extractor.
Option 3 — Playwright/Steel/browser render
Use when:
- HTML is heavily client-side rendered;
- Jina fails;
- article body is missing from raw HTML;
- need screenshots/visual evidence.
But for cron, browser should be fallback, not default.
Почему лучше дать агенту cron, а не просто feed parser
Feed parser только принесёт URL.
Агент может делать второй слой:
raw article → concept/entity updates → index/log → citations → MCP-ready wiki
Именно это превращает поток постов в LLM Wiki:
RAG retrieves. LLM Wiki compounds.
Acceptance criteria для первой версии
Первая версия считается готовой, если:
- создана отдельная Trip2G база или подпапка
anthropic-kb/; - есть
AGENTS.md,SCHEMA.md,_mcp_initialize.md,index.md,log.md; - ingested 10–20 Anthropic posts;
- raw markdown сохранён с source URL и hash;
- есть минимум 5 synthesized pages;
- cron/task умеет повторно запускаться без дублей;
- изменения пишутся в
log.md; - Trip2G sync проходит;
- MCP
searchнаходит новую базу; - агент отвечает на demo questions с цитатами.
Demo questions
What does Anthropic recommend for building effective agents?
How does Contextual Retrieval differ from a normal RAG pipeline?
What has Anthropic written about Claude Code best practices?
Which Anthropic posts are most relevant to Trip2G's MCP knowledge-base design?
What changed in Anthropic's safety policy over time?
Как это связано с Trip2G positioning
Это очень сильный демонстрационный кейс:
Trip2G can turn any public blog into an agent-readable, self-maintaining knowledge base.
Для лендинга:
Give your agents a living knowledge base over Anthropic's research, engineering notes and product updates.
Для продукта:
Sitemap/RSS adapters → markdown raw sources → LLM Wiki → MCP → federation.
Для GTM:
Not just “RSS reader”.
A compounding knowledge base that agents can query, cite and maintain.
Следующее действие
Если решим делать live demo:
- Создать
anthropic-kb/в Trip2G vault. - Взять 10–20 стартовых URL из sitemap.
- Прогнать через Jina Reader.
- Создать raw sources + 5 concept pages.
- Подключить cron daily.
- Проверить через MCP.