From time to time you will need to block search engines from accessing to the entire WordPress Multisite network.
Scanario 1: Staging site that is an exact replica of the live site. It's a great idea to have it because that way you can safely experiment with new functionality and/or design.
Scenario 2: An agency has set up a staging site for their client so the client can see the progress during the development.
Scenario 3: An agency is about to hire a developer/designer to work on the site. Before access is given the site is cleaned up from any client orders & data.
Password Protection
Password protecting a site (basic authentication) is a valid option to protect a site. If the site is an ecommerce site which uses PayPal the password may/will cause problems. Because after each transaction Paypal calls your server to notify it about the recent transaction. If PayPal can't access the site (because of the password) then the orders would stay as pending/waiting for payment.
You can search how to protect a folder based on the control panel (cPanel or Plesk).
Telling search engines not to index the site
The following examples assume that you're using Apache webserver.
This example tells bots not to index the site (using .htacces).
[code]
<IfModule mod_headers.c>
# https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
Header set X-App-Env "Staging"
Header set X-Robots-Tag "noindex, nofollow"
</IfModule>
[/code]
Blocking all Bots
[code]
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(bot|googlebot|bingbot|Baiduspider|Yandex|HTTrack|crawl|index|download|extract|stripper|sucker|ninja|clshttp|spider|leacher|collector|grabber|webpictures).*$ [NC]
RewriteRule .* - [R=403,L]
</IfModule>
[/code]
Telling bots not to index the site (using a system / MU WordPress plugin).
You will need to create a file in wp-content/mu-plugins/staging-noindex.php
If wp-content/mu-plugins doesn't exist, create it.
The php solution works a little bit differently. If checks if the WordPress site's domain contains one of the following keywords. If it does the system plugin will assume that's it's a staging environment. Only then the noindex is sent to the browser. That way when the site is finally moved to a production server the noindex will not be sent. Pretty smart :)
staging, test, development, dev, sandbox, new, example, sample, testing, clients
Examples: staging.example.com, dev.example.com, client-staging.com
[code]
<?php
/////////////////////////////////////////////////////////////////////////////////////////////////////
// Example 1
/**
* Appends some code to the HTML head php to stop search engines from indexing the (staging) site.
* @author Svetoslav (Slavi) Marinov | http://orbisius.com
*/
function qsandbox_staging_noindex() {
$output_no_index = php_sapi_name() != 'cli'
// the following below [w\.] is to skip any www\. stuff
&& ( ! empty( $_SERVER['SERVER_NAME'] ) && preg_match( '#(staging|test|development|dev|sandbox|new|example|sample|testing|clients)\d*\.#si', $_SERVER['SERVER_NAME'] ) );
if ( $output_no_index ) {
echo "\n<!-- Staging -->\n<meta name='robots' content='noindex,nofollow' />\n<!-- /Staging -->\n";
}
}
add_action( 'wp_head', 'qsandbox_staging_noindex', 0 );
/////////////////////////////////////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////////////////////////////////////////
// Example 2
/**
* Outputs the http headers using php to stop search engines from indexing the (staging) site.
* @author Svetoslav (Slavi) Marinov | http://orbisius.com
*/
function qsandbox_staging_noindex_http_headers() {
$output_no_index = php_sapi_name() != 'cli'
// the following below [w\.] is to skip any www\. stuff
&& ( ! empty( $_SERVER['SERVER_NAME'] )
&& preg_match( '#(staging|test|development|dev|sandbox|new|example|sample|testing|clients)\d*\.#si', $_SERVER['SERVER_NAME'] ) );
if ( $output_no_index && ! headers_sent() ) {
header( 'X-Robots-Tag: noindex, nofollow', true );
}
}
add_action( 'wp_head', 'qsandbox_staging_noindex_http_headers', 0 );
/////////////////////////////////////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////////////////////////////////////////
// Example 3
/**
* When robots.txt file is accesses (and there's no file like this WordPress will take control).
* Tells the search engines using robots.txt to not indexing the (staging) site.
* @author Svetoslav (Slavi) Marinov | http://orbisius.com
* @see http://www.robotstxt.org/faq/prevent.html
*/
function qsandbox_staging_noindex_robots_txt() {
$output_no_index = php_sapi_name() != 'cli'
// the following below [w\.] is to skip any www\. stuff
&& ( ! empty( $_SERVER['SERVER_NAME'] ) && preg_match( '#(staging|test|development|dev|sandbox|new|example|sample|testing|clients)\d*\.#si', $_SERVER['SERVER_NAME'] ) );
if ( $output_no_index ) {
echo "Disallow: /" . PHP_EOL;
}
}
add_action( 'do_robotstxt', 'qsandbox_staging_noindex_robots_txt', 0 );
/////////////////////////////////////////////////////////////////////////////////////////////////////
[/code]
Example 1
Example 2: can be seen using developer tools. In chrome you can access them by pressing F12.
Example 3
Related
- https://premium.wpmudev.org/forums/topic/how-to-prevent-google-from-indexing-entire-multisite-network
- http://www.robotstxt.org/faq/prevent.html
- https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag