robots.txt Analysis.
Disallow entries can inadvertently reveal sensitive paths to attackers
What does this check test?
The `robots.txt` file is a publicly accessible file at the root of a website that instructs web crawlers which paths they should or should not index. While intended for search engine optimization, the `Disallow` directives often inadvertently reveal the existence of sensitive paths, admin panels, internal APIs, and staging environments. This check examines the file for entries that may disclose paths an attacker would want to probe.
Why does it matter?
Developers often add paths to `robots.txt` to prevent them from appearing in search results, not realizing that this creates a public map of sensitive endpoints. Entries like `Disallow: /admin`, `Disallow: /api/internal`, `Disallow: /backup`, or `Disallow: /staging` tell attackers exactly where to look. The `robots.txt` file is one of the first things reconnaissance tools check, and its contents are used to build target lists for further scanning. Search engines respect `robots.txt`, but attackers do not.
Who is affected?
Every website has a `robots.txt` file (or should). This check is relevant for all web applications, but especially for those that have added sensitive paths to their disallow list as a misguided security measure. Organizations that treat `robots.txt` as an access control mechanism (it is not — it is advisory only) are most at risk. Even well-secured applications can benefit from reviewing their `robots.txt` to ensure it does not reveal internal architecture.
Where does this apply?
The file is always located at `https://yourdomain.com/robots.txt`. Review all `Disallow` entries for paths that reveal admin interfaces, API endpoints, backup directories, staging environments, or internal tools. Also check for `Allow` entries that might indicate recently exposed paths, and `Sitemap` entries that may point to XML sitemaps containing all URL structures.
How to fix it
# Good: Generic rules that don't reveal path structure
User-agent: *
Disallow: /api/
Sitemap: https://example.com/sitemap.xml
# Bad: Reveals specific sensitive paths
User-agent: *
Disallow: /admin/dashboard
Disallow: /api/internal/users
Disallow: /backup/db-dump-2024.sql
Disallow: /staging/ References
- Google: robots.txt Introduction
- OWASP: Review Webserver Metafiles for Information Leakage
- RFC 9309: Robots Exclusion Protocol
AppVet checks robots.txt Analysis automatically
Run a free security scan and get a full report with actionable fixes, including a Fix with AI prompt you can paste into any coding tool.
Run Audit