SEO Digital Marketing 101: ROBOTS.TXT - How to Block Pages, Directories?

ROBOTS.TXT RULES:

Here are the ROBOTS.TXT commands which can be used by creating a notepad file and place these codes if you would like to block a page, file, directory or whole site.

List of ROBOTS.TXT COMMANDS

To block the entire site, use a forward slash.
```
Disallow: /
```
To block a directory and everything in it, follow the directory name with a forward slash.
```
Disallow: /junk-directory/
```
To block a page, list the page.
```
Disallow: /private_file.html
```
To remove a specific image from Google Images, add the following:
```
User-agent: Googlebot-Image
Disallow: /images/dogs.jpg 
```
To remove all images on your site from Google Images:
```
User-agent: Googlebot-Image
Disallow: / 
```
To block files of a specific file type (for example, .gif), use the following:
```
User-agent: Googlebot
Disallow: /*.gif$
```
To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:
```
User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /
```

Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.

Googlebot supports submission of Sitemap files through the robots.txt file.

Pattern matching

Googlebot (but not all search engines) respects some pattern matching.

To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
```
User-agent: Googlebot
Disallow: /private*/
```
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
```
User-agent: Googlebot
Disallow: /*?
```
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
```
User-agent: Googlebot 
Disallow: /*.xls$
```
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
```
User-agent: *
Allow: /*?$
Disallow: /*?
```
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.

REF: https://support.google.com/webmasters/answer/156449?hl=en