Right now, AI bots might be reading your website.
They're extracting information, learning from your content, and using it to answer user questions. Whether you want them to or not.
That's LLM scraping. And you have some choices to make about it.
Two types of LLM scraping
Training data scraping happened before the AI model was deployed. Companies like OpenAI scraped massive amounts of web content to train their models. Your content from years ago might be in there.
You can't influence this retroactively. It's baked in. And it's nearly impossible to know what was included.
Real-time scraping happens when AI tools search the web to answer current queries. Perplexity does this for every question. ChatGPT does it when it needs current information. Google's AI Overviews pull from web sources.
This real-time scraping is where your current content matters. It's also where you have control.
You can control AI access
If you want to block AI crawlers, robots.txt directives work:
User-agent: GPTBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
Different AI tools have different bot names. You can block some and allow others.
Why you might block
Maybe you want to protect premium content. If your business model depends on users paying for access, having AI give away your insights for free doesn't help.
Maybe you're concerned about AI using your content without attribution. When AI synthesizes information from multiple sources, your original work might not get credited.
Maybe you want control over how your brand appears. If AI misrepresents your content or takes things out of context, that's a problem.
Why you might allow
Here's the trade-off: blocking LLM scraping removes you from AI-mediated discovery.
If AI can't read your site, it can't recommend you. It can't cite you in answers. It can't tell users about you. You become invisible to a growing channel.
For most businesses seeking visibility, allowing (and optimizing for) LLM scraping makes sense. The exposure is worth the loss of control.
Making your content scrape-friendly
If you want AI to scrape and cite your content effectively, help it out.
Structure clearly. AI extracts information more easily from well-organized content with clear headings and logical flow.
Provide direct answers. Content that directly answers questions is more likely to be cited. Don't make AI dig through paragraphs.
Keep it accessible. Content behind paywalls or heavy JavaScript may not be scraped effectively. AI bots aren't going to log in or wait for your React app to render.
The ethics debate continues
LLM scraping raises real questions. Who owns content? Is training on scraped data fair use? Should AI companies pay publishers? What about attribution?
These debates aren't settled. Different countries are taking different approaches. Some publishers are suing. Others are striking deals.
But the practical reality is clear: AI is scraping the web. Whether that's right or wrong, being part of what it scrapes affects your visibility. You need to decide how to play the game as it exists, not as you wish it were.