Как индексировать Anthropic blog как базу знаний

Дата: 2026-05-04 02:53 UTC

Главный вывод

У Anthropic не найден официальный RSS для www.anthropic.com/news, research и engineering: стандартные кандидаты rss.xml, feed.xml, /news/rss.xml, /news/feed.xml возвращают 404, а на страницах news, research, engineering RSS/Atom <link> не обнаружен.

Но контент всё равно можно индексировать надёжно:

Canonical source: https://www.anthropic.com/sitemap.xml — лучший официальный источник URL и lastmod.
Unofficial RSS: есть GitHub-проект taobojlen/anthropic-rss-feed, который генерирует RSS для News и Engineering каждые 6 часов.
Engineering-only RSS: есть conoro/anthropic-engineering-rss-feed, но он уже выглядит менее свежим, чем taobojlen.
Docs: у Anthropic docs есть llms.txt и llms-full.txt, но это про developer docs, а не blog/news/research.
Готового публичного vector search по Anthropic blog я не нашёл; найден только старый experimental retrieval demo и чужие статьи/примеры про contextual retrieval.

Рекомендация для Trip2G: сделать Anthropic Knowledge Base adapter на основе sitemap + optional unofficial RSS. Не полагаться только на RSS.

Что проверено

Официальный RSS Anthropic

Проверены URL:

https://www.anthropic.com/rss.xml
https://www.anthropic.com/feed.xml
https://www.anthropic.com/news/rss.xml
https://www.anthropic.com/news/feed.xml

Результат:

404 Not Found

Проверены страницы:

https://www.anthropic.com/news
https://www.anthropic.com/research
https://www.anthropic.com/engineering

Результат:

HTML есть, RSS/Atom discovery link не найден.

Вывод:

Официального RSS для публичного сайта Anthropic на момент проверки не видно.

Sitemap Anthropic

Проверен:

https://www.anthropic.com/sitemap.xml

Результат:

robots.txt разрешает crawling и указывает sitemap:

User-Agent: *
Allow: /
Sitemap: https://www.anthropic.com/sitemap.xml

sitemap доступен;
содержит URL и lastmod;
всего найдено около 406 URL;
отфильтровано:
- news/ — около 205 URL;
- research/ — около 115 URL;
- engineering/ — около 24 URL;
- итого для первичного content corpus: около 344 URL.

Пример строк из sitemap:

https://www.anthropic.com/engineering/building-effective-agents
lastmod: 2026-04-13T17:46:47.000Z

https://www.anthropic.com/engineering/claude-code-best-practices
lastmod: 2026-01-26T23:24:56.000Z

https://www.anthropic.com/research/agentic-misalignment
lastmod: 2025-06-23T08:51:14.000Z

Вывод:

Sitemap — лучший официальный incremental index: URL + lastmod.

Найденные сторонние RSS/индексы

`taobojlen/anthropic-rss-feed`

GitHub:

https://github.com/taobojlen/anthropic-rss-feed

README говорит:

Unofficial RSS feeds for Anthropic's website, updated every 6 hours via GitHub Actions.

Feeds:

https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_news_rss.xml
https://raw.githubusercontent.com/taobojlen/anthropic-rss-feed/main/anthropic_engineering_rss.xml

В репозитории есть:

anthropic_news_rss.xml
anthropic_engineering_rss.xml
anthropic_rss.py
requirements.txt

Проверенный фрагмент RSS:

<channel>
  <title>Anthropic News</title>
  <description>Latest news and announcements from Anthropic</description>
  <lastBuildDate>Sun, 03 May 2026 18:24:47 +0000</lastBuildDate>
  <item>
    <title>Claude is a space to think</title>
    <link>https://www.anthropic.com/news/claude-is-a-space-to-think</link>
  </item>
</channel>

Плюсы:

уже есть готовый RSS;
обновляется через GitHub Actions;
покрывает news и engineering;
удобно подключать в любой feed watcher.

Минусы:

unofficial;
не покрывает research отдельным feed;
RSS содержит metadata/link/title/description, но не полный article content;
всё равно нужно загружать саму статью для full-text indexing.

`conoro/anthropic-engineering-rss-feed`

GitHub:

https://github.com/conoro/anthropic-engineering-rss-feed

Feed:

https://raw.githubusercontent.com/conoro/anthropic-engineering-rss-feed/main/anthropic_engineering_rss.xml

README:

RSS Feed for the Anthropic Engineering Blog
Uses Playwright to scrape the client-side rendered content.
GitHub Action runs hourly.

Плюсы:

engineering-specific;
есть Playwright scraper;
полезен как пример реализации.

Минусы:

только engineering;
найденный lastBuildDate был старее, чем у taobojlen;
не решает news/research.

RSSHub route

Поиск показал:

https://rsshub.bestblogs.dev/anthropic/news

Но проверка endpoint в текущей среде зависала/не дала результата. Поэтому не считаю это проверенным источником.

Рекомендация:

RSSHub можно держать как fallback/эксперимент, но не как canonical ingestion path.

Anthropic docs как отдельная база

У developer docs Anthropic есть:

https://docs.anthropic.com/llms.txt
https://docs.anthropic.com/llms-full.txt

Проверка:

status 200
content-type: text/plain

Но:

https://www.anthropic.com/llms.txt → 404
https://www.anthropic.com/.well-known/llms.txt → 404

Вывод:

Anthropic developer docs можно подключать отдельно как docs knowledge base через llms.txt/llms-full.txt.
Но это не заменяет индекс blog/news/research/engineering на www.anthropic.com.

Готовый vector search по Anthropic materials

Поискал:

Anthropic blog vector search
Anthropic blog dataset github
Anthropic website scraper
Anthropic sitemap scraper

Найдено:

anthropics/anthropic-retrieval-demo — experimental Claude Search and Retrieval Demo, но это не готовый индекс Anthropic blog;
anthropics/anthropic-cookbook и DeepWiki pages — про vector DB integrations, но не про индекс Anthropic blog;
seanGSISG/Anthropic-Documentation-Scraper — mirror/scraper Anthropic docs website, не blog corpus;
разные статьи про Anthropic Contextual Retrieval.

Не найдено:

публичный maintained vector search по всем Anthropic news/research/engineering articles.

Вывод:

Лучше строить свой Trip2G index, чем искать готовый чужой vector DB.

Предложенная структура Trip2G LLM Wiki

anthropic-blog/
├── AGENTS.md
├── SCHEMA.md
├── _mcp_initialize.md
├── index.md
├── log.md
├── raw/
│   ├── news/
│   ├── research/
│   └── engineering/
├── entities/
│   ├── claude.md
│   ├── claude-code.md
│   ├── anthropic.md
│   └── constitutional-ai.md
├── concepts/
│   ├── contextual-retrieval.md
│   ├── building-effective-agents.md
│   ├── model-context-protocol.md
│   ├── responsible-scaling-policy.md
│   └── agentic-misalignment.md
├── timelines/
│   ├── claude-releases.md
│   └── safety-policy-updates.md
├── comparisons/
│   ├── claude-code-vs-cursor.md
│   └── rag-vs-contextual-retrieval.md
└── queries/

Frontmatter для raw article

---
title: Building effective agents
source_url: https://www.anthropic.com/engineering/building-effective-agents
source_domain: anthropic.com
source_section: engineering
published: 2024-12-19
lastmod: 2026-04-13T17:46:47.000Z
ingested: 2026-05-04
sha256: <body hash>
type: raw_source
---

Frontmatter для synthesized concept page

---
title: Building Effective Agents
type: concept
tags: [agents, engineering, claude, workflows]
sources:
  - raw/engineering/building-effective-agents.md
confidence: high
created: 2026-05-04
updated: 2026-05-04
---

MCP methods для базы

AGENTS.md:

---
free: true
mcp_method: instructions
mcp_description: "How agents should read and update this Anthropic knowledge base."
---

SCHEMA.md:

---
free: true
mcp_method: schema
mcp_description: "Schema for the Anthropic blog/research LLM Wiki: sources, sections, entities, concepts and update rules."
---

_mcp_initialize.md:

---
free: true
mcp_method: initialize
mcp_description: "Start a session with the Anthropic knowledge base."
---

Как агент должен отвечать по Anthropic базе

Сначала определить тип вопроса:
- API docs?
- product announcement?
- research/safety?
- engineering practice?
Если API docs — идти в anthropic-docs базу.
Если blog/research/engineering — идти в anthropic-blog.
Для свежих фактов проверять lastmod и raw source.
Для synthesis использовать concept/entity pages.
Ответ давать с цитатами на Trip2G pages и source URLs.

Минимальный ingestion script sketch

import hashlib
import xml.etree.ElementTree as ET
from urllib.request import Request, urlopen

SITEMAP = "https://www.anthropic.com/sitemap.xml"
PREFIXES = ("https://www.anthropic.com/news/", "https://www.anthropic.com/research/", "https://www.anthropic.com/engineering/")

xml = urlopen(Request(SITEMAP, headers={"User-Agent": "Trip2G research bot"})).read()
root = ET.fromstring(xml)
ns = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}

urls = []
for item in root.findall("s:url", ns):
    loc = item.findtext("s:loc", namespaces=ns)
    lastmod = item.findtext("s:lastmod", namespaces=ns)
    if loc and loc.startswith(PREFIXES):
        urls.append((loc, lastmod))

print(len(urls))

Next script steps:

- compare with local source-index.json;
- fetch changed URLs;
- extract clean article markdown;
- save raw files;
- update index/log;
- sync Trip2G.

Для Trip2G как продукта

Это хороший demo use case:

Turn Anthropic's public research, engineering and product posts into an agent-readable MCP knowledge base.

Почему сильный кейс:

Anthropic пишет именно для developer/agent audience;
есть Claude Code, agents, contextual retrieval, MCP-adjacent topics;
пользователю понятно, зачем спрашивать такую базу;
можно показать federation: trip2g-docs + anthropic-blog + user-project-docs.

Пример demo question:

What does Anthropic recommend for building effective agents, and how does that relate to Trip2G's MCP knowledge-base design?

Пример value prop:

Don't just bookmark Anthropic posts. Compile them into an LLM Wiki your agents can query, cite and update.

Acceptance для следующей задачи

Если делать реальный adapter, done означает:

загружен sitemap;
построен список /news, /research, /engineering;
скачаны минимум 10 статей как markdown raw sources;
создан AGENTS.md, SCHEMA.md, _mcp_initialize.md, index.md, log.md;
создано 5 concept/entity pages;
Trip2G sync successful;
MCP search находит статьи;
note_html открывает raw и synthesized pages;
есть demo query с цитатами.

Следующие шаги

Решить: делаем только report или создаём live demo base Anthropic Knowledge Base в Trip2G.
Если live demo:
- начать с sitemap-first ingest;
- взять 10–20 наиболее relevant engineering/research posts;
- сделать маленькую LLM Wiki, не пытаться сразу индексировать все 344 URL.
Потом добавить periodic watcher:
- daily sitemap diff;
- optional unofficial RSS for fast signal;
- update log.md и changed raw pages.

Как индексировать Anthropic blog как базу знаний

Главный вывод

Что проверено

Официальный RSS Anthropic

Sitemap Anthropic

Найденные сторонние RSS/индексы

`taobojlen/anthropic-rss-feed`

`conoro/anthropic-engineering-rss-feed`

RSSHub route

Anthropic docs как отдельная база

Готовый vector search по Anthropic materials

Рекомендованная архитектура для Trip2G

Вариант A — быстро и надёжно: sitemap-first ingest

Вариант B — RSS-first for monitoring + sitemap for backfill

Вариант C — docs + blog as separate federated KBs

Предложенная структура Trip2G LLM Wiki

Frontmatter для raw article

Frontmatter для synthesized concept page

MCP methods для базы

Как агент должен отвечать по Anthropic базе

Минимальный ingestion script sketch

Для Trip2G как продукта

Acceptance для следующей задачи

Следующие шаги