解析主機名需要 5 秒

November 26, 2021

我有一個主bind9DNS 伺服器和 2 個在 IPv4（Debian Jessie）上執行的從伺服器，使用/etc/bind/named.conf：

listen-on-v6 { none; };

當我嘗試從不同的伺服器連接時，每個連接至少需要 5 秒（我使用Joseph 的時間資訊進行調試）：

$ curl -w "@curl-format.txt" -o /dev/null -s https://example.com
           time_namelookup:  5.512
              time_connect:  5.512
           time_appconnect:  5.529
          time_pretransfer:  5.529
             time_redirect:  0.000
        time_starttransfer:  5.531
                           ----------
                time_total:  5.531

根據curl，查找需要大部分時間，但是標準nslookup非常快：

$ time nslookup example.com &gt; /dev/null 2&gt;&1

real    0m0.018s
user    0m0.016s
sys     0m0.000s

強制curl使用 IPv4 後，它變得好多了：

$ curl -4 -w "@curl-format.txt" -o /dev/null -s https://example.com

           time_namelookup:  0.004
              time_connect:  0.005
           time_appconnect:  0.020
          time_pretransfer:  0.020
             time_redirect:  0.000
        time_starttransfer:  0.022
                           ----------
                time_total:  0.022

我在主機上禁用了 IPv6：

echo 1 &gt; /proc/sys/net/ipv6/conf/eth0/disable_ipv6

雖然問題仍然存在。我試過執行strace看看超時的原因是什麼：

write(2, "*", 1*)                        = 1
write(2, " ", 1 )                        = 1
write(2, "Hostname was NOT found in DNS ca"..., 36Hostname was NOT found in DNS cache
) = 36
socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 4
close(4)                                = 0
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f220bcf8000
mprotect(0x7f220bcf8000, 4096, PROT_NONE) = 0
clone(child_stack=0x7f220c4f7fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f220c4f89d0, tls=0x7f220c4f8700, child_tidptr=0x7f220c4f89d0) = 2004
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
poll(0, 0, 4)                           = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
poll(0, 0, 8)                           = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
poll(0, 0, 16)                          = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
poll(0, 0, 32)                          = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f22102e08d0}, NULL, 8) = 0
poll(0, 0, 64)                          = 0 (Timeout)

這似乎不是防火牆問題，因為nslookup（或curl -4）正在使用相同的 DNS 伺服器。知道有什麼問題嗎？

這是tcpdump來自主持人的tcpdump -vvv -s 0 -l -n port 53：

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:14:52.542526 IP (tos 0x0, ttl 64, id 35839, offset 0, flags [DF], proto UDP (17), length 63)
   192.168.1.1.59163 &gt; 192.168.1.2.53: [bad udp cksum 0xf9f3 -&gt; 0x96c7!] 39535+ A? example.com. (35)
20:14:52.542540 IP (tos 0x0, ttl 64, id 35840, offset 0, flags [DF], proto UDP (17), length 63)
   192.168.1.1.59163 &gt; 192.168.1.2.53: [bad udp cksum 0xf9f3 -&gt; 0x6289!] 45997+ AAAA? example.com. (35)
20:14:52.543281 IP (tos 0x0, ttl 61, id 63674, offset 0, flags [none], proto UDP (17), length 158)
   192.168.1.2.53 &gt; 192.168.1.1.59163: [udp sum ok] 45997* q: AAAA? example.com. 1/1/0 example.com. [1h] CNAME s01.example.com. ns: example.com. [10m] SOA ns01.example.com. ns51.domaincontrol.com. 2016062008 28800 7200 1209600 600 (130)
20:14:57.547439 IP (tos 0x0, ttl 64, id 36868, offset 0, flags [DF], proto UDP (17), length 63)
   192.168.1.1.59163 &gt; 192.168.1.2.53: [bad udp cksum 0xf9f3 -&gt; 0x96c7!] 39535+ A? example.com. (35)
20:14:57.548188 IP (tos 0x0, ttl 61, id 64567, offset 0, flags [none], proto UDP (17), length 184)
   192.168.1.2.53 &gt; 192.168.1.1.59163: [udp sum ok] 39535* q: A? example.com. 2/2/2 example.com. [1h] CNAME s01.example.com., s01.example.com. [1h] A 136.243.154.168 ns: example.com. [30m] NS ns01.example.com., example.com. [30m] NS ns02.example.com. ar: ns01.example.com. [1h] A 136.243.154.168, ns02.example.com. [1h] A 192.168.1.2 (156)
20:14:57.548250 IP (tos 0x0, ttl 64, id 36869, offset 0, flags [DF], proto UDP (17), length 63)
   192.168.1.1.59163 &gt; 192.168.1.2.53: [bad udp cksum 0xf9f3 -&gt; 0x6289!] 45997+ AAAA? example.com. (35)
20:14:57.548934 IP (tos 0x0, ttl 61, id 64568, offset 0, flags [none], proto UDP (17), length 158)
   192.168.1.2.53 &gt; 192.168.1.1.59163: [udp sum ok] 45997* q: AAAA? example.com. 1/1/0 example.com. [1h] CNAME s01.example.com. ns: example.com. [10m] SOA ns01.example.com. ns51.domaincontrol.com. 2016062008 28800 7200 1209600 600 (130)

編輯： 在綁定日誌中經常出現此消息：

error sending response: host unreachable

但是，每個查詢最終都會得到回答（只需要 5 秒）。所有機器都是物理伺服器（這不是 NAT 的錯），數據包更有可能被路由器阻止。這是很可能相關的問題：DNS 查找有時需要 5 秒。

簡短的回答：
一種解決方法是強制glibc重用套接字來查找AAAA 和A記錄，方法是在中添加一行/etc/resolv.conf：
options single-request-reopen
這個問題的真正原因可能是：
錯誤配置的防火牆或路由器（例如，此處描述的瞻博網路防火牆配置）導致丟棄AAAADNS 數據包
DNS伺服器中的錯誤
長答案：
程序喜歡curl或wget使用 glibc 的函式getaddrinfo()，它通過並行查找兩個 DNS 記錄來嘗試與 IPv4 和 IPv6 兼容。在收到兩條記錄之前它不會返回結果（存在與此類行為相關的幾個問題） - 這解釋了strace上述內容。當強制使用 IPv4 時，例如在curl -4內部gethostbyname()僅查詢A記錄。
從tcpdump我們可以看出：
-> A?開始時發送兩個請求
-> AAAA?（請求 IPv6 地址）
<- AAAA回复
-> A?再次請求 IPv4 地址
<- A得到答复
-> AAAA?再次請求 IPv6
<- AAAA回复
A由於某種原因，一個回復被丟棄了，這就是這個錯誤消息：
error sending response: host unreachable
然而，我不清楚為什麼需要第二次AAAA查詢。
要驗證您是否遇到同樣的問題，您可以在以下位置更新超時/etc/resolv.conf：
options timeout:3
首先使用自定義時間報告配置創建一個文本文件：
cat &gt;./curl-format.txt  &lt;&lt;-EOF
  time_namelookup: %{time_namelookup}\n
     time_connect: %{time_connect}\n
  time_appconnect: %{time_appconnect}\n
    time_redirect: %{time_redirect}\n
 time_pretransfer: %{time_pretransfer}\n
time_starttransfer: %{time_starttransfer}\n
                   ----------\n
time_total: %{time_total}\n
EOF
然後發送請求：
$ curl -w "@curl-format.txt" -o /dev/null -s https://example.com

           time_namelookup:  3.511
              time_connect:  3.511
           time_appconnect:  3.528
          time_pretransfer:  3.528
             time_redirect:  0.000
        time_starttransfer:  3.531
                           ----------
                time_total:  3.531
中還有另外兩個相關選項man resolv.conf：
**單個請求（自 glibc 2.10 起）**設置RES_SNGLKUP 為 _res.options. 預設情況下，glibc 從 2.9 版本開始並行執行 IPv4 和 IPv6 查找。某些設備 DNS 伺服器無法正確處理這些查詢並使請求超時。此選項禁用該行為並讓 glibc 順序執行 IPv6 和 IPv4 請求（以解析過程的一些減慢為代價）。
single-request-reopen (glibc 2.9 起) 解析器對 A 和 AAAA 請求使用相同的套接字。一些硬體錯誤地只發回一個回复。當這種情況發生時，客戶端系統將坐下來等待第二個回复。打開此選項會更改此行為，以便如果未正確處理來自同一埠的兩個請求，它將關閉套接字並在發送第二個請求之前打開一個新的。
相關問題：
DNS 查找有時需要 5 秒
與 AAAA 請求相關的延遲

引用自：https://unix.stackexchange.com/questions/290987

解析主機名需要 5 秒

相關問答

如何修復 Linux Mint 18.x 上的 IPv6 DNS 伺服器設置？

主機命令成功，但 DNS 無法解析

僅將 BIND 配置為轉發器（無根提示），加密 + RPZ 黑名單/白名單一起

本地 NS 列表與父 NS 列表不匹配

與視圖綁定 RPZ 無效

DNS 伺服器不工作