[SEO 調整] 什麼是 robots.txt 和為什麼要設定 sitemap?

何謂 robots.txt

一份文件告訴瀏覽器哪些網頁要被檢索，哪些不用

以 Google Chrome 為例，我們能夠在搜尋欄輸入一些關鍵字就能呈現一個個網頁，是因為 Chrome 有預先做好兩件事情

檢索 + 建立索引

當我們 Google 任何東西時，並不是真的在這個浩大的網路中搜尋，而是在 Google 的網頁中搜尋，所以會搜尋到的東西必定是：

Google 已經檢索過並且建立過索引
Google 的演算法所推薦出來的

而今天要講的 robots.txt 和 sitemap 就是針對第一點

為何需要 robots.txt

其實我一開始非常好奇，為何有網頁會想要不被 Google 檢索呢？應該大部分都希望能被 Google 檢索畢竟這樣別人才能搜尋到我們的文章

後來才了解到，事實上有些狀況會是網頁不想要被檢索的，像是：

還在施工中的網頁
網頁伺服器可能沒辦法處理太多的流量
或是各種原因不想要被 Google 檢索

但設定 robots.txt 只是讓 Google 不會檢索到某些頁面，他並不會阻止該網頁就不會再出現在 Google Search 結果上

如果要讓該頁面完全不會出現在 Google Search 上，需要在 header 設定

<meta name="robots" content="noindex">

如何設定 robots.txt

首先增加一個檔案叫 robots.txt，我這邊專案是放在 public 底下，以我的部落格為例

https://kenyucode.vercel.app

加完 robots.txt 後在

https://kenyucode.vercel.app/robots.txt

就會顯示

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /

Sitemap: https://kenyucode.vercel.app/sitemap.xml

這也正是 robots.txt 裡面打入的東西

User-agent 使用者代理，在這邊瀏覽器就是我們的使用者代理，所以 User-agent 所輸入為瀏覽器的裝置

整段代表說使用者代理若名為 Googlebot 不能檢索 https://kenyucode.vercel.app/nogooglebot 網頁

另外 Sitemap 會在以下介紹

何謂 Sitemap

為一個檔案提供網頁資訊

他甚至可以對網頁內部的各種檔案提供不同的資訊，像是影片長度或是圖片類型等等

為何需要 Sitemap

官網中提到通常幾種狀況需要使用 Sitemap

網站規模極大(網頁數目超過 500 個)
網頁彼此中缺乏適當的連結(沒有按鈕可以互相連通)
網站剛建立，外部缺乏連結進來
網頁內部有許多多媒體互動題材，像是影片或圖片等

如何設定 Sitemap

線上有許多即時轉譯 Sitemap 的網站能自動產出 Sitemap，像是

https://www.xml-sitemaps.com/

將產出的 XML 檔案加入 public 資料夾內部，讓 https://kenyucode.vercel.app/sitemap.xml 網址能連結出現以下的資訊

<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>https://kenyucode.vercel.app/</loc>
        <lastmod>2021-10-28T06:48:36+00:00</lastmod>
    </url>
    <url>
        <loc>https://kenyucode.vercel.app/js</loc>
        <lastmod>2021-10-28T06:48:36+00:00</lastmod>
    </url>
    <url>
        <loc>https://kenyucode.vercel.app/react</loc>
        <lastmod>2021-10-28T06:48:36+00:00</lastmod>
    </url>
</urlset>