robots.txt
What is robots.txt?
robots.txtA text file served at the site root that tells well-behaved web crawlers which paths they may or may not fetch, formalized in IETF RFC 9309.
robots.txt is a plain-text file placed at the root of a website (for example https://example.com/robots.txt) that uses the Robots Exclusion Protocol standardized as IETF RFC 9309. It contains User-agent and Disallow/Allow directives that compliant crawlers consult before fetching URLs, and it can advertise the sitemap location. robots.txt is an advisory mechanism, not an access control: malicious bots routinely ignore it, and listing sensitive paths there often acts as a roadmap for attackers. Defenders should pair robots.txt with proper authentication, authorization, rate limiting, and noindex tags, and avoid using it to hide secret URLs.
● Examples
- 01
Disallow: /admin/ entry that good crawlers respect but attackers use as a hint to probe the admin path.
- 02
A Sitemap: https://example.com/sitemap.xml line that helps search engines index public content.
● Frequently asked questions
What is robots.txt?
A text file served at the site root that tells well-behaved web crawlers which paths they may or may not fetch, formalized in IETF RFC 9309. It belongs to the Application Security category of cybersecurity.
What does robots.txt mean?
A text file served at the site root that tells well-behaved web crawlers which paths they may or may not fetch, formalized in IETF RFC 9309.
How does robots.txt work?
robots.txt is a plain-text file placed at the root of a website (for example https://example.com/robots.txt) that uses the Robots Exclusion Protocol standardized as IETF RFC 9309. It contains User-agent and Disallow/Allow directives that compliant crawlers consult before fetching URLs, and it can advertise the sitemap location. robots.txt is an advisory mechanism, not an access control: malicious bots routinely ignore it, and listing sensitive paths there often acts as a roadmap for attackers. Defenders should pair robots.txt with proper authentication, authorization, rate limiting, and noindex tags, and avoid using it to hide secret URLs.
How do you defend against robots.txt?
Defences for robots.txt typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for robots.txt?
Common alternative names include: Robots Exclusion Protocol, RFC 9309.
● Related terms
- appsec№ 1195
User-Agent Spoofing
Forging the User-Agent header or related client hints so that a request appears to come from a different browser, device, or operating system than it actually does.
- appsec№ 468
Headless Browser
A web browser that runs without a graphical user interface and is driven programmatically, commonly used for testing, scraping, and security automation.