The robots.txt file is a text file that dictates indexing and behavioral recommendations for crawlers or search engine robots (beware! Recommendations, not obligations). These crawlers want to index all the possible information, so when they get to your website they track everything.
The problem arises when you want to prevent certain pages from being included in their indexes, what do you do then? You have 2 options The first is to use a special tag on each page (see Meta Robots) or use a centralized file to control the input. This last option is that of robots.txt which is what we are going to see in depth.
What is a robots.txt file?
The Robots Exclusion Protocol or REP is a series of web standards that regulate the behavior of robots and the indexing of search engines. The REP consists of the following:
The original REP dates from 1994 and was extended in 1997, defining the robots.txt tracking guidelines. Some search engines support extensions such as URI (wild cards) patterns.
In 1996, the indexing guidelines (REP tags) were defined for use in the meta elements of robots, also known as the meta robots label. Search engines also support additional REP tags with the "X-Robots-Tag". Webmasters can implement these REP tags in the HTTP header of non-HTML resources such as PDF documents or images.
The «rel-nofollow» microformat appears in 2005 to define how search engines should handle links where there is an element A of the REL tribute that contains the value «nofollow».
Robot exclusion tags
If we talk about a URI, the REP tags (noindex, nofollow, unavailable_after) direct certain indexer tasks and in some cases (nosnippet, noarchive, NOODP) even query engines in the execution of a search query. Apart from the guidelines for crawlers, each search engine interprets these REP tags differently.
For example, Google removes the listings of unique URLs and OPD references from its SERPs when a resource is tagged with "noindex", however Bing shows those external references to URLs as prohibited in its search results. As the REP tags can be implemented in the META elements of X / HTML content, as well as in the HTTP headers of any web object, the consensus is that contents with the “X-Robots-Tags” tag should invalidate or override the guidelines in conflict found in the META.ç elements
The indexer guidelines implemented as microformats will invalidate the page settings for certain HTML elements. For example, when the "X-Robots-Tag" tag of a page says "follow" (there is no "nofollow" value), the rel-nofollow directive of an A (link) element overlaps.
Although robots.txt lacks guidelines for indexers, it is possible to set these guidelines for groups of URIs with the scripts on the server acting at the same web level that apply to "X-Robots-Tags" to request resources. This method requires programming knowledge and a good understanding of web servers and the HTTP protocol.
Google and Bing both understand two regular expressions that can be used to identify the pages or subfolders that an SEO consultant wants to exclude from their website. These two characters are the asterisk (*) and the dollar sign ($).
* - What is a wild card that represents any sequence of characters
$ - That matches the end of the URL