Crawling

robots.txt vs noindex: what each one actually controls

A practical breakdown of crawling, indexing, and the mistakes that make pages disappear from search.

Published Jun 10, 2026 | Updated Jun 18, 2026

The main distinction is simple but easy to blur in practice: robots.txt controls crawling, while noindex is an indexing signal placed on the page itself or sent as an HTTP header. A blocked page can still be indexed if search engines learn about it elsewhere, which is why people often think robots.txt failed when the real issue is expectation mismatch.

For a small website, the safest default is usually to keep robots.txt narrow and explicit. Use it to protect obviously non-public sections, not as a blanket instrument for every page you feel unsure about. Then use noindex deliberately for pages you want accessible but absent from search results.

The practical rule is to decide what problem you are solving first. If you are trying to reduce crawler access, robots.txt may help. If you are trying to keep a page out of search results while still allowing crawlers to see it, noindex is usually the more direct signal.

Why this guide matters

Use this guide when you want a little more context before publishing, need a quick refresher on best practices, or want to avoid the mistakes that commonly lead to crawl or indexing issues later.

Use this with the matching tool

Robots.txt Generator

If you want to apply this advice immediately, use the related tool and compare the output against the points covered in this guide.

Open Robots.txt Generator

Open related tool Browse all articles