Before you can optimize for search engines, you need to understand what they do. Search engines have three jobs: discover pages, understand them, and rank them.
The Three Stages
1. Crawling → Discover pages by following links
2. Indexing → Read and store page content
3. Ranking → Decide which pages answer a queryCrawling
Google uses automated programs called crawlers (or Googlebot) that follow links from page to page across the web.
When Googlebot visits your site, it:
- Fetches your page's HTML
- Finds all the links on that page
- Adds those links to a queue
- Visits the next URL in the queue
- Repeats billions of times per day
Crawl budget is how many pages Google will crawl on your site in a given timeframe. For most small-to-medium sites, crawl budget isn't a concern. But for large sites (100k+ pages), you need to make sure Google spends its budget on your important pages.
Things that waste crawl budget:
| Problem | Example |
|---|---|
| Duplicate pages | Same content at multiple URLs |
| Infinite URL parameters | /products?sort=price&page=1&color=red&... |
| Redirect chains | A → B → C → D (fix to A → D) |
| Broken links (404s) | Links to pages that don't exist |
| Blocked resources | CSS/JS blocked by robots.txt |
Indexing
After crawling, Google processes the page:
- Renders the page — executes JavaScript to see the final HTML
- Extracts content — reads text, headings, images, links
- Understands meaning — determines what the page is about
- Stores it — adds the page to Google's index (a massive database)
Not every crawled page gets indexed. Google may skip pages that are:
- Too similar to other pages (duplicate content)
- Too thin (very little useful content)
- Blocked by
noindextags - Low quality or spammy
Ranking
When someone searches, Google:
- Finds all indexed pages that match the query
- Scores each page on hundreds of ranking factors
- Returns results in order of relevance and quality
The main ranking factors:
| Factor | What It Means |
|---|---|
| Relevance | Does the content match the search query? |
| Quality | Is the content comprehensive, accurate, and useful? |
| Authority | Do other reputable sites link to this page? |
| User experience | Is the page fast, mobile-friendly, and easy to use? |
| Freshness | Is the content up to date? |
How Google Finds Your Pages
Google discovers pages through:
- Links from other sites — the primary discovery method
- Your sitemap — an XML file listing all your pages
- Google Search Console — you can manually submit URLs
- Internal links — links between pages on your own site
Sitemaps
A sitemap tells Google about all the pages on your site:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-03-20</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/blog/my-post</loc>
<lastmod>2026-03-18</lastmod>
<priority>0.7</priority>
</url>
</urlset>Submit your sitemap in Google Search Console at Sitemaps > Add a new sitemap.
robots.txt
The robots.txt file tells crawlers which pages they can and cannot access:
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Sitemap: https://example.com/sitemap.xml
Allow— pages crawlers can visitDisallow— pages crawlers should skip- This is a suggestion, not enforcement — well-behaved bots follow it, malicious ones ignore it
Google Search Console
Google Search Console (GSC) is your direct line to Google. It shows you:
- Which pages are indexed (and which aren't)
- What queries bring traffic to your site
- Crawl errors and issues
- Mobile usability problems
- Core Web Vitals scores
Every website owner should set up GSC. It's free and essential for understanding how Google sees your site.
Key Reports
| Report | What It Shows |
|---|---|
| Performance | Clicks, impressions, CTR, and average position |
| Coverage | Which pages are indexed, excluded, or errored |
| Sitemaps | Sitemap submission status |
| URL Inspection | How Google sees a specific URL |
| Core Web Vitals | Page speed and experience scores |
Summary
- Search engines crawl (discover), index (understand), and rank (order) pages
- Googlebot follows links to discover new pages — internal linking matters
- Not every crawled page gets indexed — content quality and uniqueness matter
- Ranking depends on relevance, quality, authority, and user experience
- Use Google Search Console to monitor how Google sees your site
- Submit a sitemap to help Google find all your pages