Skip to main content

robots.txt Analysis.

Security Sensitive File Exposure

Disallow entries can inadvertently reveal sensitive paths to attackers

What does this check test?

The `robots.txt` file is a publicly accessible file at the root of a website that instructs web crawlers which paths they should or should not index. While intended for search engine optimization, the `Disallow` directives often inadvertently reveal the existence of sensitive paths, admin panels, internal APIs, and staging environments. This check examines the file for entries that may disclose paths an attacker would want to probe.

Why does it matter?

Developers often add paths to `robots.txt` to prevent them from appearing in search results, not realizing that this creates a public map of sensitive endpoints. Entries like `Disallow: /admin`, `Disallow: /api/internal`, `Disallow: /backup`, or `Disallow: /staging` tell attackers exactly where to look. The `robots.txt` file is one of the first things reconnaissance tools check, and its contents are used to build target lists for further scanning. Search engines respect `robots.txt`, but attackers do not.

Who is affected?

Every website has a `robots.txt` file (or should). This check is relevant for all web applications, but especially for those that have added sensitive paths to their disallow list as a misguided security measure. Organizations that treat `robots.txt` as an access control mechanism (it is not — it is advisory only) are most at risk. Even well-secured applications can benefit from reviewing their `robots.txt` to ensure it does not reveal internal architecture.

Where does this apply?

The file is always located at `https://yourdomain.com/robots.txt`. Review all `Disallow` entries for paths that reveal admin interfaces, API endpoints, backup directories, staging environments, or internal tools. Also check for `Allow` entries that might indicate recently exposed paths, and `Sitemap` entries that may point to XML sitemaps containing all URL structures.

How to fix it

Review your `robots.txt` and remove entries that reveal sensitive paths. Instead of hiding sensitive paths via `robots.txt`, protect them with proper authentication and access controls:
# Good: Generic rules that don't reveal path structure
User-agent: *
Disallow: /api/
Sitemap: https://example.com/sitemap.xml

# Bad: Reveals specific sensitive paths
User-agent: *
Disallow: /admin/dashboard
Disallow: /api/internal/users
Disallow: /backup/db-dump-2024.sql
Disallow: /staging/
If a path should not be publicly accessible, block it with authentication or network-level access controls — not `robots.txt`. Use the `noindex` meta tag or `X-Robots-Tag` header for individual pages you want excluded from search results without revealing the path publicly.

References

AppVet checks robots.txt Analysis automatically

Run a free security scan and get a full report with actionable fixes, including a Fix with AI prompt you can paste into any coding tool.

Run Audit