如何找到為什麼 bash 以信號 11 退出，分段錯誤

January 24, 2018

在我執行 Red Hat Linux (V6) 的生產伺服器中，我經常從 bash 中獲得核心轉儲。這種情況每天發生幾次到每天幾十次。
TLTR
解決方法：安裝 bash-debuginfo 以從核心獲取更多詳細資訊並找到導致崩潰的語句。
原因：在這種情況下，這是因為我的舊版本 bash lists.gnu.org/archive/html/bug-bash/2010-04/msg00038.html中未修復的錯誤在 2010 年 4 月針對 4.1 報告並在 4.2 中修復（2011年初發布）
詳細資訊
此伺服器執行單個 Web 應用程序 (apache + cgi-bin) 和許多批次。webapp cgi（C 程序）經常執行系統呼叫。
沒有那麼多shell互動，所以核心轉儲可能是由某些服務或webapp引起的，我必須知道是什麼導致了這個錯誤。
coredump 回溯有點幹（見下文）。
如何獲得有關錯誤的更多詳細資訊？我想知道什麼是父程序鏈（完全詳細）、目前變數和環境、執行的腳本和/或命令是什麼…
我啟用了審計系統，但是關於這個的審計線也有點枯燥。這是一個例子：
type=ANOM_ABEND msg=audit(1516626710.805:413350): auid=1313 uid=1313 gid=22107 ses=64579 pid=8655 comm="bash" sig=11
這是核心回溯：
   Core was generated by `bash'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000370487b8ec in free () from /lib64/libc.so.6
#0  0x000000370487b8ec in free () from /lib64/libc.so.6
#1  0x000000000044f0b0 in hash_flush ()
#2  0x0000000000458870 in assoc_dispose ()
#3  0x0000000000434f55 in dispose_variable ()
#4  0x000000000044f0a7 in hash_flush ()
#5  0x0000000000433ef3 in pop_var_context ()
#6  0x0000000000434375 in pop_context ()
#7  0x0000000000451fb1 in ?? ()
#8  0x0000000000451c84 in run_unwind_frame ()
#9  0x000000000043200f in ?? ()
#10 0x000000000042fa18 in ?? ()
#11 0x0000000000430463 in execute_command_internal ()
#12 0x000000000046b86b in parse_and_execute ()
#13 0x0000000000444a01 in command_substitute ()
#14 0x000000000044e38e in ?? ()
#15 0x0000000000448d4e in ?? ()
#16 0x000000000044a1b7 in ?? ()
#17 0x0000000000457ac8 in expand_compound_array_assignment ()
#18 0x0000000000445e79 in ?? ()
#19 0x000000000044a264 in ?? ()
#20 0x000000000042ee9f in ?? ()
#21 0x0000000000430463 in execute_command_internal ()
#22 0x000000000043110e in execute_command ()
#23 0x000000000043357e in ?? ()
#24 0x00000000004303bd in execute_command_internal ()
#25 0x0000000000430362 in execute_command_internal ()
#26 0x0000000000432169 in ?? ()
#27 0x000000000042fa18 in ?? ()
#28 0x0000000000430463 in execute_command_internal ()
#29 0x000000000043110e in execute_command ()
#30 0x000000000041d6d6 in reader_loop ()
#31 0x000000000041cebc in main ()
~
更新：系統在 VMWare 處理的虛擬機中執行。
什麼版本的bash？GNU bash，版本 4.1.2(1)-release (x86_64-redhat-linux-gnu)
什麼版本的 libc 和其他連結到 bash 的庫？
ldd (GNU libc) 2.12
（連結到 bash 的其他庫是什麼？是否有命令連續獲取詳細資訊？
在執行腳本或互動式 shell 或兩者時會發生這種情況嗎？如果是腳本，它只發生在一個腳本上還是幾個或任何一個腳本上？一般而言，您的 bash 腳本在做什麼？您是否從其他程序中獲得了段錯誤？您是否在伺服器上執行過記憶體測試？它有ECC RAM嗎？
正如我的問題所述：我不知道，但它應該是由一些預定的腳本或來自互動式 webapp 內部的一些系統呼叫引起的。它也可以是“腳本中的腳本”，就像這種結構一樣：
myVar=$($(some command here ($and here too))
但是我覺得這個問題可能不是 RAM 的物理問題，因為沒有其他隨機崩潰，只有這個，而且我們也有它在 2 個單獨的 VM 上執行在 2 個單獨的物理機器上。
更新 2：
從堆棧中我覺得問題可能與關聯數組有關：
#1  0x000000000044f0b0 in hash_flush ()
#2  0x0000000000458870 in assoc_dispose ()
#3  0x0000000000434f55 in dispose_variable ()
#4  0x000000000044f0a7 in hash_flush ()
這些變數幾乎存在於我們所有的自定義腳本中：有一個主腳本使用了一個包含我們系統常用變數和函式的庫。
這個腳本幾乎來自我們的每一個腳本。

我按照 gdb 的建議安裝了 debuginfo 工具，然後我得到了導致崩潰的表達式：
#20 0x0000000000457ac8 in expand_compound_array_assignment (
   var=&lt;value optimized out&gt;, 
   value=0x150c660 "$(logPath \"$@\")", flags=&lt;value optimized out&gt;
)
所以現在我知道問題出在哪裡了。在我的例子中，它位於 .bashrc 中的一個函式中，根本原因是 Bash 中映射變數的重新定義錯誤：
declare -A myMap
local myMap=""

...
for key in "${!myMap[@]}"; do 
 echo ${myMap[$key]}
done    
此函式在子 shell 中呼叫，導致“分段錯誤”錯誤輸出被隱藏。

引用自：https://unix.stackexchange.com/questions/419118

如何找到為什麼 bash 以信號 11 退出，分段錯誤

相關問答

如何重定向核心轉儲和堆棧粉碎消息

Bash 腳本分段錯誤

核心轉儲未寫入分段錯誤

分段錯誤（核心轉儲） - 到哪裡？它是什麼？為什麼？

如何查看核心文件（一般）

為什麼以下方式不會改變核心文件限制大小？