Hardware

了解機器檢查異常 (MCE)

  • June 8, 2017

嘗試調試執行 Ubuntu 16.04 的新筆記型電腦(KabyLake 架構)的頻繁凍結時,我偶然發現了以下條目kern.log

kernel: [    0.041634] mce: [Hardware Error]: Machine check events logged

從那時起,我已經安裝mcelog但不知道如何處理日誌。內容/var/log/mcelog為:

mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 3880018086 ADDR fef1cf00 
TIME 1479298799 Wed Nov 16 13:19:59 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 43880018086 ADDR fef1ff00 
TIME 1479298799 Wed Nov 16 13:19:59 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 3880018086 ADDR fef1cf00 
TIME 1479321645 Wed Nov 16 19:40:45 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 43880018086 ADDR fef1ff00 
TIME 1479321645 Wed Nov 16 19:40:45 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 43880000086 ADDR fef1db80 
TIME 1479328438 Wed Nov 16 21:33:58 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 13880000086 ADDR fef1dc00 
TIME 1479328438 Wed Nov 16 21:33:58 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 43880000086 ADDR fef1db80 
TIME 1479333991 Wed Nov 16 23:06:31 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 13880000086 ADDR fef1dc00 
TIME 1479333991 Wed Nov 16 23:06:31 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 43880000086 ADDR fef1db80 
TIME 1479373350 Thu Nov 17 10:02:30 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 13880000086 ADDR fef1dc00 
TIME 1479373350 Thu Nov 17 10:02:30 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 3880018086 ADDR fef1cf00 
TIME 1479373810 Thu Nov 17 10:10:10 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 43880018086 ADDR fef1ff00 
TIME 1479373810 Thu Nov 17 10:10:10 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 3880018086 ADDR fef1cf00 
TIME 1479375712 Thu Nov 17 10:41:52 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 43880018086 ADDR fef1ff00 
TIME 1479375712 Thu Nov 17 10:41:52 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 3880018086 ADDR fef1cf00 
TIME 1479385932 Thu Nov 17 13:32:12 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 43880018086 ADDR fef1ff00 
TIME 1479385932 Thu Nov 17 13:32:12 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 3880018086 ADDR fef1cf00 
TIME 1479387666 Thu Nov 17 14:01:06 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 43880018086 ADDR fef1ff00 
TIME 1479387666 Thu Nov 17 14:01:06 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 43880000086 ADDR fef1db80 
TIME 1479456710 Fri Nov 18 09:11:50 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 13880000086 ADDR fef1dc00 
TIME 1479456710 Fri Nov 18 09:11:50 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 43880000086 ADDR fef1db80 
TIME 1479459374 Fri Nov 18 09:56:14 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142
mcelog: Family 6 Model 8e CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7 
MISC 13880000086 ADDR fef1dc00 
TIME 1479459374 Fri Nov 18 09:56:14 2016
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 142

一些觀察(如果其中任何一個錯誤,請糾正我):

  • 幾乎所有錯誤似乎都發生在同一頁面上 ( ADDR fef1xxx)
  • 只有銀行 6 和 7 似乎受到影響。
  • 所有條目都包含“錯誤溢出”和“未糾正錯誤”。

mcelog常見問題解答提到“預期記憶體錯誤糾正率較低,不需要更換硬體或其他操作”。日誌條目包含片語“未糾正的錯誤”,這表明我實際上應該採取一些措施。

我的問題是:

  1. 這些錯誤是什麼意思,我應該擔心它們嗎?
  2. 這些硬體錯誤是否會導致整個系統凍結?
  3. 我應該讓製造商更換筆記型電腦(或元件)嗎?
  4. 我還應該採取其他措施嗎?

首先,我擔心我不能很好地回答你的問題。我還擁有一台戴爾 XPS 13 (9360) 並看到相同的 MCE 消息。由於這些原因,我正在與戴爾支持聯繫。他們更換了主機板,但沒有幫助。日誌中的相同消息。在某些時候,他們得出結論,這可能是誤報。不過,他們不知道是什麼原因造成的(mcelog/kernel/Intel 問題?)。與支持部門的通信仍在進行中。

<rant> 順便說一句,與戴爾支持人員交談是一種非常不愉快的經歷。他們似乎只建議“標準”解決方案,如重置韌體、執行自我健康測試等。我沒有與有一定技術洞察力的人交談的印象。 </rant>

要添加更多詳細資訊,我在 Fedora 24 上看到了相同的問題,因此它似乎與 Ubuntu 無關。

關於你的問題:

這些錯誤是什麼意思,我應該擔心它們嗎?

我不知道。戴爾支持認為這些都是誤報。

這些硬體錯誤是否會導致整個系統凍結?

除了消息之外,我的系統工作正常。我猜凍結是一個不同的問題。

我應該讓製造商更換筆記型電腦(或元件)嗎?

更換主機板並沒有解決 MCE 問題。它可能會解決凍結問題,儘管這似乎已通過核心更新修復

我還應該採取其他措施嗎?

如果您尚未與支持人員聯繫,請與他們聯繫。一旦他們看到它影響到更多的客戶,也許他們會想出一個真正的解決方案。

引用自:https://unix.stackexchange.com/questions/324237