Make Your Blog AI-Readable: llms.txt, robots.txt, and JSON-LD

Q: What is llms.txt and do I need it?

llms.txt is a Markdown index of your site's key content, served at /llms.txt, that helps language models understand your site at inference time. It's a low-cost convenience that some AI tools read, but it is not a ranking signal, and Google says you don't need it.

Q: How do I let AI crawlers read my site?

Add explicit User-agent / Allow: / blocks in robots.txt for the bots you want (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, etc.) and include a Sitemap: line. Remember robots.txt is advisory — enforce hard blocks at the edge.

Q: Does FAQ schema still work in 2026?

FAQ rich results are being retired by Google for most sites, so you won't get the snippet. The JSON-LD is still valid markup, and a real FAQ section (question-shaped headings + self-contained answers) still helps AI comprehension.

A practical, copy-paste guide to making a blog legible to AI crawlers and language models: the llms.txt convention, an AI-welcoming robots.txt, and Schema.org JSON-LD that actually matters in 2026.

T Terapep · Jun 15, 2026

TL;DR — Three low-effort files make your site easier for AI to read and cite: a /llms.txt index of your best content, a robots.txt that explicitly welcomes AI crawlers and points to your sitemap, and JSON-LD structured data (BlogPosting, Organization, BreadcrumbList) on every page. None of them are magic, but together they remove friction. This very blog ships all three.

Search isn't the only thing reading your website anymore. ChatGPT, Claude, Perplexity, and Google's AI answers all fetch, parse, and sometimes quote web pages. If your content is buried in cluttered HTML with no machine-readable signals, you're making their job harder — and getting cited less. Here's the practical stack, with the honest caveats.

1. llms.txt — a map for language models

llms.txt is a convention proposed by Jeremy Howard of Answer.AI. It's a single Markdown file at your site root (/llms.txt) that gives an LLM a clean, curated index of your most important content — so it doesn't have to crawl and de-clutter your whole site to understand it.

The format is simple and strict, in this order:

An # H1 with the site name (the only required line)
A > blockquote summary
Optional prose
## H2 sections, each a list of links: - [name](url): note

A minimal example:

# Terapep

> A blog about technology and good food — practical notes, honest takes, and recipes worth keeping.

## Posts

- [How Korean Ramyeon Took Over the World](https://terapep.com/blog/korean-ramyeon-global-boom/): The export data behind K-ramen's rise.
- [Does GEO Actually Work?](https://terapep.com/blog/does-geo-actually-work/): What the research really says about AI-search optimization.

There's also llms-full.txt, which inlines your entire content into one big Markdown file so a model can ingest everything in a single fetch. (Watch the size — it can blow a context window on a large site.)

The honest caveat: llms.txt is "mainly useful for inference" — i.e., when an LLM is actively helping a user — not training, and no major AI vendor has committed to using it as a ranking input. Google has explicitly said you don't need it for its AI features. So treat it as a low-cost convenience that some tools read, not a guaranteed win.

2. robots.txt — welcome the AI crawlers on purpose

AI bots come in three flavors, and the difference matters:

Class	What it does	Examples
Training	Collects content to train models	GPTBot, ClaudeBot, CCBot, Bytespider
Search index	Builds the index behind AI answers (sends citations back to you)	OAI-SearchBot, PerplexityBot, Claude-SearchBot
Live fetch	Reads a page because a user asked right now	ChatGPT-User, Perplexity-User

If you want AI visibility — and most blogs do, because the search-index bots drive citations and referral traffic — name them explicitly:

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://terapep.com/sitemap.xml

Two things worth knowing. Google-Extended and Applebot-Extended aren't crawlers — they're control tokens that toggle whether already-crawled content can be used for AI training, with no effect on Search ranking. And robots.txt is advisory: reputable bots honor it, but if you genuinely need to block one (Bytespider has a reputation for ignoring rules), enforce it at the edge/WAF, not just here.

If you'd rather opt out of training but keep AI-search citations, that three-class table is your lever: Disallow: / the training bots, Allow: / the search-index ones.

3. JSON-LD — structured data that still earns results

Structured data tells machines what your page is. Embed it as <script type="application/ld+json">. For a blog, the high-value types in 2026 are BlogPosting, Organization/WebSite, and BreadcrumbList.

A trimmed BlogPosting:

{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "Make Your Blog AI-Readable",
  "description": "A practical guide to llms.txt, robots.txt, and JSON-LD.",
  "image": "https://terapep.com/og.jpg",
  "datePublished": "2026-06-15T09:00:00+09:00",
  "dateModified": "2026-06-15T09:00:00+09:00",
  "author": { "@type": "Organization", "name": "Terapep" },
  "publisher": {
    "@type": "Organization",
    "name": "Terapep",
    "logo": { "@type": "ImageObject", "url": "https://terapep.com/favicon.svg" }
  },
  "mainEntityOfPage": "https://terapep.com/blog/make-your-blog-ai-readable/"
}

Two 2026 reality checks. Always include a timezone offset in your dates (+09:00) — without it, Google assumes Googlebot's timezone. And don't bother chasing FAQPage rich results: Google is sunsetting FAQ rich results for most sites. The markup is still valid and may help AI, but you won't get the snippet. (Writing a real FAQ section with question-shaped headings still helps — more on that in the GEO post.)

Always validate with Google's Rich Results Test before shipping.

Doing it in Astro

Astro makes this pleasant because it ships static HTML by default — your content is in the markup, no hydration required, which is exactly what AI fetchers want to see.

robots.txt: drop a static file in public/, or generate it from a src/pages/robots.txt.ts endpoint so you can inject your site URL.
llms.txt: a src/pages/llms.txt.ts endpoint that iterates your posts and returns text/plain.
JSON-LD: a small component fed by frontmatter, rendered once per page. Site-wide Organization/WebSite in your base head; per-post BlogPosting + BreadcrumbList in the post layout.

FAQ

What is llms.txt and do I need it?

llms.txt is a Markdown index of your site's key content, served at /llms.txt, that helps language models understand your site at inference time. It's a low-cost convenience that some AI tools read, but it is not a ranking signal, and Google says you don't need it.

How do I let AI crawlers read my site?

Add explicit User-agent / Allow: / blocks in robots.txt for the bots you want (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, etc.) and include a Sitemap: line. Remember robots.txt is advisory — enforce hard blocks at the edge.

Does FAQ schema still work in 2026?

FAQ rich results are being retired by Google for most sites, so you won't get the snippet. The JSON-LD is still valid markup, and a real FAQ section (question-shaped headings + self-contained answers) still helps AI comprehension.

Sources: llmstxt.org, Google AI optimization guide, Google: web publisher controls, Schema.org Article docs, Search Engine Land.

Image: Markus Spiske, CC0 / public domain, via Wikimedia Commons.

#llms-txt#seo#schema#ai#astro

← Back to all posts