Robots.txt

When to Use

Configure robots.txt on every Drupal site before launch. It controls which paths search engines crawl. The default Drupal core robots.txt covers common admin paths — extend it for site-specific private areas, staging environments, and AI crawler policies.

Decision

Situation	Choice	Why
Standard Drupal site	Edit core robots.txt directly	Simplest — it ships with sensible defaults
Drupal CMS with seo_tools recipe applied	Use robots.append.txt	Recipe manages the base; append.txt for custom rules
Staging / dev environment	Disallow all crawlers	Prevent indexing of non-production content
Need AI crawler-specific rules	robots.txt + per-agent Disallow	Training bots vs search bots need separate rules
Composer-managed project overwriting robots.txt	robots.txt via scaffolding config	Prevent composer from resetting your edits

Pattern

Core robots.txt Location

Drupal ships robots.txt in the webroot. Its default rules:

User-agent: *
Crawl-delay: 10

# Drupal admin and system paths
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
Disallow: /user/logout

# Files directories
Disallow: /sites/*/files/private/

# Sitemap reference (add this)
Sitemap: https://example.com/sitemap.xml

drupal_cms_seo_tools — robots.append.txt

When using the drupal_cms_seo_tools recipe, custom rules go in robots.append.txt (project webroot). The recipe merges this file with core robots.txt during deployment:

# robots.append.txt — project-specific additions

# Block access to private staging areas
Disallow: /stage/
Disallow: /preview/

# AI crawler policies — see ai-crawler-policy.md for full list
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

Essential Rules to Always Include

# Admin UI — always block
Disallow: /admin/
Disallow: /node/*/edit
Disallow: /node/*/delete

# System paths
Disallow: /batch
Disallow: /cron/
Disallow: /update.php
Disallow: /install.php

# Private files
Disallow: /sites/*/files/private/

# Views exposed filters (prevent crawl traps)
Disallow: *?*

# Sitemap — always include
Sitemap: https://example.com/sitemap.xml

Staging Environment — Block Everything

# robots.txt for staging.example.com
User-agent: *
Disallow: /

Set this via environment-specific file or server configuration — do not rely solely on robots.txt as it is not enforced.

Protecting robots.txt from Composer Scaffold Overwriting

Add to composer.json to prevent Drupal scaffold from resetting your robots.txt:

"drupal-scaffold": {
  "file-mapping": {
    "[webroot-dir]/robots.txt": false
  }
}

Common Mistakes

Wrong: Disallowing /sites/default/files/ entirely → Right: Only disallow /sites/*/files/private/ — public media should be crawlable
Wrong: Using robots.txt as a security mechanism → Right: Robots.txt is advisory only; use access control for real security
Wrong: Forgetting the Sitemap: directive → Right: Add it so crawlers auto-discover without Search Console submission
Wrong: Wildcard Disallow: *?* on sites using query parameters for canonical content → Right: Only block faceted navigation query parameters, not all query strings
Wrong: Same robots.txt on production and staging → Right: Staging must disallow all crawlers to prevent duplicate content indexing