robots.txt

Output is generated by default for the response to requests made to a WordPress site’s /robots.txt. The output of /robots.txt for a WordPress site can be modified with WordPress actions and filters.

If a WordPress environment is still accessed by its convenience domain (a custom domain is not yet assigned as the primary domain), default settings for /robots.txt are applied by the platform to prevent search engines from indexing its content.

The default output of /robots.txt for environments with a convenience domain:

/robots.txt

User-agent: *
Disallow: /

In addition to an environment’s /robots.txt, an x-robots-tag: noindex, nofollow HTTP response header is returned for all requests made to:

Production environments that do not yet have a custom domain assigned as the primary domain.
Non-production environments at all times regardless of the domains assigned to them.

Modify /robots.txt output

Prerequisite

The output of a site’s /robots.txt can only be modified if the site’s environment (production or non-production) has a custom domain set as the primary domain.

To modify /robots.txt for a site, hook into the do_robotstxt action or filter the output by hooking into the robots_txt filter.

In most cases, custom code to override the default output of /robots.txt can be added to a theme’s functions.php. For sites that require more tailored search engine crawling directives, custom code can be selectively added and enabled with a site-specific plugin.

Action

do_robotstxt

In this code example, the do_robotstxt action is used to mark a specific directory as nofollow for all User Agents:

/themes/example-theme/functions.php

function my_robotstxt_disallow_directory() {
    echo 'User-agent: *' . PHP_EOL;
    echo 'Disallow: /path/to/your/directory/' . PHP_EOL;
}
add_action( 'do_robotstxt', 'my_robotstxt_disallow_directory' );

Filter

robots_txt

In this code example, the output of /robots.txt is modified using the robots_txt filter:

/themes/example-theme/functions.php

function my_robots_txt_disallow_private_directory( $output, $public ) {
    $output .= 'Disallow: /wp-admin/' . PHP_EOL;
    $output .= 'Allow: /wp-admin/admin-ajax.php' . PHP_EOL;

    // Add custom rules here
    $output .= 'Disallow: /private-directory/' . PHP_EOL;
    $output .= 'Allow: /public-directory/' . PHP_EOL;

    return $output;
}
add_filter( 'robots_txt', 'my_robots_txt_disallow_private_directory', 10, 2 );

Disallow AI crawlers

Use the robots_txt filter to configure a site’s /robots.txt to disallow artificial intelligence (AI) crawlers from crawling a site.

Note

Additional restriction to a site’s content can be put in place for AI crawlers with the VIP_Request_Block utility class.

In this code example, a site’s /robots.txt is configured to disallow requests from User-Agents of well-known AI crawlers (e.g. OpenAI’s GPTBot).

Only 4 AI crawlers are included in this code example, though far more exist. Customers should research which AI crawler User-Agents should be disallowed for their site and include them in a modified version of this code example.

vip-config/vip-config.php

function my_robots_txt_block_ai_crawlers( $output, $public ) {
	$output .= '
## OpenAI GPTBot crawler (https://platform.openai.com/docs/gptbot)
User-agent: GPTbot
Disallow: /

## OpenAI ChatGPT service (https://platform.openai.com/docs/plugins/bot)
User-agent: ChatGPT-User
Disallow: /

## Common Crawl crawler (https://commoncrawl.org/faq)
User-agent: CCBot
Disallow: /

## Google Bard / Gemini crawler (https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)
User-agent: Google-Extended
Disallow: /
';

    return $output;
}
add_filter( 'robots_txt', 'my_robots_txt_block_ai_crawlers', 10, 2 );

Discourage search engines

If the content of a WordPress environment with an assigned primary domain should not be accessible for indexing by search engines, the output of /robots.txt can be programmatically modified.

In addition, a setting in a site’s WordPress Admin dashboard can be enabled to discourage search engines.

In the WP Admin, select Settings -> Reading from the lefthand navigation menu.
Toggle the setting labeled “Search engine visibility” and enable the option “Discourage search engines from indexing this site”.
Select the button labeled “Save Changes” to save the setting.

Test modifications

Modifications made to /robots.txt should be tested on a non-production environment first. If the non-production environment is mapped to a convenience domain (i.e. a custom primary domain is not assigned to the environment), the environment’s default /robots.txt must be temporarily overridden to allow for testing.

The following code for overriding the environment’s default /robots.txt output can be added to a file in /plugins or /client-mu-plugins:

remove_filter( 'robots_txt', 'Automattic\VIP\Core\Privacy\vip_convenience_domain_robots_txt' );

Caching

A site’s /robots.txt is cached for long periods of time by the page cache. After changes are made to /robots.txt, the cached version can be purged by using the VIP Dashboard or VIP-CLI.

The cached version of /robots.txt can also be cleared from within the WordPress Admin dashboard.

In the WP Admin, select Settings -> Reading from the lefthand navigation menu.
Toggle the setting of Search engine visibility, and select the button labeled “Save Changes” each time the setting is changed.

Last updated: April 18, 2024