Step-by-step help to master cookie compliance

Guides

Implementing Ai.txt And Llms.txt For Purpose Based Scraping Control

Beyond Robots.txt: Implementing AI.txt and LLMs.txt for Purpose-Based Scraping Control

Robots.txt is a voluntary, non-enforceable protocol designed in 1994 for polite search engine crawlers.

In 2026, many modern AI scrapers harvest data aggressively and use it for model training without honoring robots.txt. While reputable AI bots may honor robots.txt, most crawlers ignore it entirely, use residential IPs to bypass rate limits, or spoof their identity to act as normal web browsers.

Thus, robots.txt is no longer sufficient for AI scraping control because it simply doesn’t work.

Read about alternative solutions for AI scraping control and how to implement robots.txt alternatives (ai.txt and llms.txt) for purpose-based scraping control.

What Are AI.txt and LLMs.txt?

AI.txt and LLMs.txt are emerging attempts to standardize the rules for how AI systems can use (or cannot use) websites’ content.

Robots.txt was built for search engines. It tells crawlers what they can index.

But AI systems don’t just index— they train, summarize, remix, and generate new content. However, this was practically not regulated.

AI.txt and LLMs.txt (robots.txt alternatives) aim to set rules on how to use content:

  • AI.txt focuses on how content can be used (training, summarization, etc.).
  • LLMs.txt defines which AI systems (LLMs) can access your content.

 

It’s better to use AI.txt and LLMs.txt instead of robots.txt for these reasons:

  • Improved accuracy
    AI.txt and LLMs.txt reduce errors by giving AI direct access to cleaned content.
  • Better AI SEO
    Ensures that tools like ChatGPT or Perplexity, when answering user queries, extract from the most relevant and updated text-based sources on your site.
  • Content curation
    Allows website owners to guide AI crawlers on which information should be presented first.

AI.txt vs LLMs.txt: What’s the Difference?

llms.txt and ai.txt are both emerging, unofficial, text-based protocols designed to help AI systems understand and interact with websites, but they serve different purposes. Basically, AI.txt defines how content can be used, while LLMs.txt defines who can access your site.

Even if AI.txt vs LLMs.txt may sound interchangeable, they’re not.

 

LLMs.txt

llms.txt is a proposed standard— a plain text/Markdown file placed at a website's root (yoursite.com/llms.txt) that defines who can access your site.

LLMs.txt targets specific AI providers or models. It is designed to provide LLMs with a low-noise, summarized overview of key content and documentation.

Its purpose is to assist Large Language Models (LLMs) in finding, understanding, and accurately citing a website's key content.

Target audience: AI tools such as ChatGPT, Claude, and Gemini that answer user questions in real-time.

LLMs.txt is best used in content-rich sites, such as blog websites, tech documentation, or SaaS sites) seeking accurate AI citations.

It simplifies content digestion for AI, avoiding complex HTML and reducing hallucinations.

 

AI.txt

ai.txt is a proposed, plain-text standard placed in a website's root directory that defines how to use content. It allows website owners to manage how AI crawlers, LLMs, and synthetic data trainers use their content.

ai.txt targets AI crawlers, such as GPTBot or ClaudeBot, that scrape sites for data ingestion.

Its purpose is to give site owners control over how their content is used, especially regarding AI training.

ai.txt is best used in sites that want to restrict their content from being used to train third-party AI models while still allowing AI interaction.

Its format is a plain text file (yourdomain.com/ai.txt) that sets rules for content use in training, summarization, or other purposes.

Key advantage: ai.txt is simpler than llms.txt and offers a more direct opt-out or opt-in mechanism.

Most sites use both LLMs.txt  and ai.txt together.

AI.txt vs Robots.txt: Key Differences at a Glance

In 2026, the distinction between AI.txt vs Robots.txt has become a significant point, separating content creators and Artificial Intelligence. Robots.txt was designed to help website creators find your content, while AI.txt was designed to prevent AI from taking your content.

Here are the key differences at a glance.

 Feature Robots.txt (The Legacy Standard)       AI.txt (The 2026 initiative)  
Primary goal SEO & indexing control. Tells search engines which pages to show in results. Usage rights. Tells AI models if they can use your data for their purposes.
Designed for Search engines AI systems
Granularity Directory-based (Disallow: /private/). Permission-based (No-Training, No-Inference).
Legal status De facto standard, but not obligatory. Regulated by the EU AI Act, when included in Licensing & Text and Data Mining laws.
Implementation Root directory (yoursite.com/robots.txt). Usually in /.well-known/ai.txt or root.
User benefit Helps your site get traffic from search. Protects your intellectual property from AI scrapping.

 

In conclusion, AI.txt and Robots.txt provide a completely different level of control. Robots.txt tells search engines whether they can access your page, while AI.txt informs AI what it is allowed to do with the content.

Use CookieScript Cookie Scanner to scan your website and see what website cookies and other trackers are used:

Why Robots.txt Falls Short for AI Crawlers

Robots.txt falls short for AI crawlers because it is a voluntary, non-enforceable guideline in an increasingly hostile data environment where many AI agents simply ignore it. Robots.txt isn’t broken— it’s just outdated for the AI control.

Robots.txt has long been the standard for managing web crawlers. Even now, it performs its function well, telling search engines whether they can access the page or not. Search engine bots, such as Googlebot, respect robots.txt and maintain a mutually beneficial relationship.

Unlike traditional search engine bots, many AI crawlers were designed for massive data ingestion to train large language models (LLMs). Studies show that up to 72% of AI crawlers violate robots.txt rules

Here is why robots.txt fails for AI crawlers:

  • It is voluntary and non-enforceable
    Robots.txt relies on the honor system, which many AI crawlers do not respect.
  • Widespread violations
    AI crawlers frequently violate robots.txt. One study shows that there were an average of 156 violation requests per site over a three-week period in 2025
  • Ambiguity in AI agents
    It is difficult for website owners to identify and block every single AI crawler, as new ones are constantly emergihg, and many operate anonymously.
  • Confused purposes (Crawling vs. Training)
    Most AI crawlers do not respect robots.txt, and the line between crawling and training is blurred.
  • Ineffectiveness against malicious scrapers
    Robots.txt does not protect against malicious scrapers that explicitly ignore these rules to steal content.

What Is Purpose-Based Scraping Control?

In 2026, purpose-based scraping control is the technical and legal framework used by website owners to grant or deny access to bots based on how the data will be used, rather than just who is collecting it.

Purpose-based scraping control replaced traditional blocking. Previously, webmasters used robots.txt to block specific bots, such as GPTBot. However, this was an inefficient tool. In 2026, a single company might scrape your site for two completely different reasons:

  • Search/attribution: To show a link to your site in an AI answer. This activity generates you traffic.
  • Training: To ingest your content. This gives you no benefit.

 

With purpose-based control, you can allow AI bots to use your site for search or attribution but block them for training purposes.

Thus, you can define purposes for ai scraping control of your site:

  • Indexing (search engines)
  • Training (LLM datasets)
  • Summarization (AI answers, snippets)
  • Commercial reuse (paid AI products)

 

Each of these has very different implications. Some of them may be beneficial to you, while others offer no benefit or even harm you, stealing your content.

This is a strategic shift in your website’s management. Instead of just traffic control, now you perform data rights management.

Technical implementation: AI.txt & TDMRep

In 2026, the ai.txt file has become the standard to set permissions of your site usage. It uses specific tags to define the allowed purpose:

  • No-Training: Prohibits using data to train or update LLM models.
  • No-Inference: Prohibits using data to generate a real-time answer.
  • Allow-RAG: Allows the bot to access your page to provide an answer if the bot references back to you.
  • TDMRep (Text and Data Mining Reservation Protocol): This is the high-integrity W3C standard that embeds these permissions into the HTTP headers of every page, making them legally binding in the EU.

Legal requirements

The EU AI Act (Article 53) sets legal requirements for General Purpose AI (GPAI) providers. GPAI providers are legally required to have a Privacy Policy in place to respect machine-readable signals that define what they are allowed to do on a site and what is prohibited.

Thus, purpose-based control is legally binding.

If you set a purpose-based control (no-training, no-inference), an AI company must honor it. If it ignores it, they are violating the EU AI Act, regardless of whether the data is publicly accessible or not.

Economic impact

Purpose-based scraping control allows websites to implement the bot paywall.

Websites could use platforms like TollBit or Human Security that can detect the intent of a scraper and redirect it to different webpages:

  • When a bot is a search engine: Access is free.
  • When a bot tries to reach our site for AI training: The bot is redirected to a paywall where it must pay a licensing fee per megabyte of data scraped.

 

Purpose-based control is different from legacy control (Robots.txt) in several aspects. Robots.txt is a binary signal (allow or block) and does not have a legal weight. Purpose-based control allows granular bot selection (Yes to search, No to training) and it is a legal requirement regulated by the EU AI Act and TDM Directive.

How to Implement LLMs.txt on Your Website

LLMs.txt implementation is quite simple. The tricky part is setting a Privacy Polic and ensuring enforcement.

To implement LLMs.txt on your website, follow these steps:

1. Create the file

Just like robots.txt, you place it at the root of your domain:

https://yourdomain.com/llms.txt

 

2. Define allowed and disallowed AI agents

Define AI provider identifiers and what is allowed/ disallowed for each of them.

For example (simplified):

  • User-Agent: OpenAI
  • Allow: /

 

  • User-Agent: *
  • Disallow: /

This means that OpenAI is allowed to access your site, while everyone else is blocked.

 

3. Combine with AI.txt (if used)

Many websites use both LLMs.txt and AI.txt:

LLMs.txt defines who can access your content.

AI.txt regulates the use of content (training, summarization, etc.).

This layered approach gives you detailed control of your content.

 

4. Test and monitor

LLMs.txt enforcement could be the most difficult part; many webmasters don’t test it correctly or don’t monitor implementation at all.

To test LLMs.txt enforcement, you need to:

  • Check server logs for AI crawlers.
  • Monitor unusual traffic patterns.
  • Adjust rules over time.

AI Crawlers to Know in 2026

In 2026, the crawler landscape has shifted from search engines to AI trainers and answer engines. While Google and Bing bots are still present, AI crawlers are consuming your bandwidth as well, and their participation is rapidly increasing.

Here are the most popular and active crawlers in 2026, categorized by their primary function.

1. The Top Three crawlers (traditional + hybrid search)

These bots represent the largest volume of traffic. Their primary mission is mixed: they index websites for traditional search but also feed their respective AI models. , Copilot, and Meta AI).

  1. Googlebot/ Google-Extended (Gemini)
    It’s the #1 crawler globally, taking approximately 31.6% of all bandwidth, though its dominance is slightly slipping. Google-Extended is used specifically to crawl web content to help train, refine, and improve the capabilities of Gemini and Vertex AI, not for search ranking.
  2. Meta-ExternalAgent
    This is the #2 most active AI crawler in 2026, taking 16.7% bandwidth share. It scrapes data specifically to train Meta’s Llama models and power Meta AI across Instagram and WhatsApp.
  3. Bingbot (Microsoft Copilot)
    Controlled by Microsoft, it feeds both the Bing search index and the data for Microsoft Copilot.

2. The most popular AI crawlers in 2026

There are crawlers, used specifically for LLM training and real-time search. They are often more aggressive than traditional search bots.

The most popular AI crawlers in 2026 include:

  1. GPTBot & OAI-SearchBot (OpenAI)
    GPTBot is used for massive offline training, while OAI-SearchBot handles real-time ChatGPT search queries. Together, they account for about 14% of AI crawler traffic.
  2. ClaudeBot (Anthropic)
    A very high-activity bot used to train the Claude models. At the start of 2026, its volume increases 800%, as Anthropic scaled its web search API.
  3. Applebot / Applebot-Extended
    The activity of Applebot increased significantly in 2026. With the rollout of Apple Intelligence, Applebot has surged to 5.8% of all crawl traffic, surpassing Amazon and ByteDance. Applebot-Extended is a secondary user agent from Apple that allows website owners to control whether their content could be used to train Apple’s generative AI foundation models.
  4. PerplexityBot
    The primary bot for Perplexity’s answer engine. It is known for high-frequency, targeted crawling of news and high-authority blogs.

3. The aggressive data harvesters

These bots are often blocked by default by security firewalls like Cloudflare because they are used only for training and provide zero referral traffic back to your site.

The most widespread include:

  1. Bytespider
    Owned by ByteDance (TikTok), Bytespider is notoriously aggressive and often consumes more bandwidth than Googlebot on smaller sites.
  2. Amazonbot
    It scrapes for Alexa, Echo, and Amazon’s internal AI shopping assistants.
  3. CCBot (Common Crawl)
    It’s an open-source scraper. Most AI models, including those from smaller startups, use the data CCBot collects.

Best Practices for Controlling AI Scraping

So, how to protect content from AI training and how to control AI scraping without blocking search engines?

Read these best practices for AI scraping control and LLM crawler control:

  1. Don’t rely on one method
    To restrict AI training data, use a layered approach: Robots.txt – for baseline control, LLMs.txt – for access control, and AI.txt – for usage control.
    Also, put in place Terms of Service, implement rate limiting and monitor access to our website.
  2. Define your priorities first
    It’s not good practice to block everything, because you will lose traffic on your site. Before implementing any rule for controlling AI scraping, decide:
    Do you want AI visibility?
    Do you allow training usage?
    Where’s your revenue risk?
  3. Monitor your website access To prevent AI scraping on your website, you should know what’s happening on your site. Don’t assume- you must know what’s happening exactly. You need real data. Monitor this activity on your site:
    Server logs
    Bot activity patterns
    Traffic anomalies
  4. Set realistic goals
    Not all AI scrapers will respect your rules. Actually, there are many aggressive scrapers that don’t respect AI.txt and LLMs.txt. Thus, try to manage AI bots on your website as much as possible, but do not expect full compliance with your rules. Set rules to reduce risk, not to prevent it entirely.
  5. Keep policies flexible
    AI crawlers are evolving fast. Thus, AI crawler blocking should be flexible. What works today might be irrelevant in several months. Build rules now but revisit them regularly. AI scraping control must be up to date.

Use CookieScript CMP to scan your website for website cookies and other trackers, deliver a Cookie Banner, and obtain and store Cookie Consent.

CookieScript CMP has the following features:

 

It also offers a 14-day free trial.

Frequently Asked Questions

What is ai.txt?

AI.txt is a proposed plain-text standard for managing how AI models and crawlers interact with website content. It allows publishers to specify whether their content can be used for AI training, summarization, or commercial purposes.

What is llms.txt?

llms.txt is a proposed plain text file standard placed at a website's root (yoursite.com/llms.txt) that defines who can access your site. LLMs.txt targets specific AI providers or models. It is designed to provide LLMs with a low-noise, summarized overview of key content and documentation.

How to block AI training but allow indexing?

To block AI training but allow search engines to index your site, create a targeted robots.txt file that allows search crawlers, such as Googlebot, while explicitly prohibiting AI-specific crawlers, such as GPTBot or Google-Extended. Use CookieScript Cookie Scanner to scan your website for cookies and other trackers.

How to stop AI bots from using my content?

Use a multi-layered approach, focusing on instructing compliant bots to stay away and implementing technical blocks for aggressive bots that ignore rules. First, use robots.txt as a first filter to block known AI crawlers. Second, add LLMs.txt to control who can access your site. Third, implement AI.txt to control how to use your content.

What’s the difference between purpose-based control and Robots.txt?

Purpose-based control differs from legacy control (Robots.txt) in several aspects. Robots.txt is a binary signal (allow or block) and is a voluntary standard without legal weight. Purpose-based control allows granular bot selection (Yes to search, No to training), and it is a legal requirement regulated by the EU AI Act and TDM Directive.

How to implement LLMs.txt on your website?

To implement LLMs.txt on your site, create the file (https://yourdomain.com/llms.txt), define allowed and disallowed AI agents, and test its implementation: check server logs for AI crawlers and monitor unusual traffic patterns. You could also combine LLMs.txt with AI.txt.

What is purpose-based scraping control?

In 2026, purpose-based scraping control is the technical and legal framework used by website owners to grant or deny access to bots based on what the data will be used for, rather than just who is collecting it. With purpose-based control, you can allow AI bots to use your site for search or attribution but block them for training purposes. It is implemented using Ai.txt and LLMs.txt standards.

What is the best way to control AI crawlers?

To prevent AI scraping your website, use these best practices: implement a layered approach, define your priorities (what to block and what to allow), monitor your website access, and keep policies flexible, since AI crawlers are evolving fast. Do not expect full compliance; thus, set the rules to reduce risk, not prevent it totally. 

New to CookieScript?

CookieScript helps to make the website ePrivacy and GDPR compliant.

We have all the necessary tools to comply with the latest privacy policy regulations: third-party script management, consent recording, monthly website scans, automatic cookie categorization, cookie declaration automatic update, translations to 34 languages, and much more.