Linux

為 mcelog 編寫觸發器

  • May 18, 2013

剛開始mcelog第一次研究(我之前已經啟用它並看到了 syslog 輸出,但這是我第一次嘗試做一些非預設的事情)。我正在尋找有關如何為其編寫觸發器的資訊。具體來說,我正在尋找可以對哪些類型的事件mcelog做出反應,它如何決定執行哪些腳本等等。我可以從範例觸發器中得到的最好的結果是它在呼叫腳本之前設置了一堆環境變數。那麼它是否只是嘗試執行觸發器目錄(/etc/mcelog在 RHEL 上)中的所有內容並讓腳本決定它想要執行的操作?

我見過其他名稱看起來像 MCE 事件的觸發器腳本,這是約定還是有特殊功能?我創建了一個名為的觸發器/etc/mcelog/joel.sh,它只向我的 gmail 帳戶發送一封基本電子郵件。幾天前顯然觸發器關閉了,因為我沒有手動執行腳本就收到了來自腳本的電子郵件。我沒想過將env輸出通過管道傳遞給mailx命令,joel.sh所以我不知道是什麼硬體事件觸發了腳本執行,也不知道為什麼mcelog選擇joel.sh作為腳本來執行它。

基本上,我正在尋找一個答案,它會給我一個基本的方向mcelog,它是觸發系統,以及我如何使用它來監控我的硬體健康狀況。我敢肯定,一旦我掌握了方向,我就能找出更高級的東西。

查看範例mcelog.conf配置文件,它看起來包含大多數(如果不是全部)它可以處理的觸發器類型。

DIMM

[dimm]
#
# execute these triggers when the rate of corrected or uncorrected
# errors per DIMM exceeds the threshold
# Note when the hardware does not report DIMMs this might also
# be per channel
# The default of 10/24h is reasonable for server quality·
# DDR3 DIMMs as of 2009/10
#uc-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
#ce-error-trigger = dimm-error-trigger
ce-error-threshold = 10 / 24h

插座

[socket]
# Threshold and trigger for uncorrected memory errors on a socket
# mem-uc-error-trigger = socket-memory-error-trigger
mem-uc-error-threshold = 100 / 24h
# Threshold and trigger for corrected memory errors on a socket
mem-ce-error-trigger = socket-memory-error-trigger
mem-ce-error-threshold = 100 / 24h

記憶體

[cache]
# Processing of cache error thresholds reported by Intel CPUs
cache-threshold-trigger = cache-error-trigger

[page]
# Memory error accouting per 4K memory page
# Threshold for the correct memory errors trigger script
memory-ce-threshold = 10 / 24h
# Trigger script for corrected errors
# memory-ce-trigger = page-error-trigger

觸發器

可以在此部分控制觸發器。

[trigger]
# Maximum number of running triggers
children-max = 2
# execute triggers in this directory
directory = /etc/mcelog

範例觸發器

mcelog github 頁面上有一些範例觸發器。

範例觸發腳本,dimm-error-triggers

#!/bin/sh
#  This shell script can be executed by mcelog in daemon mode when a DIMM
#  exceeds a pre-configured error threshold
# 
# environment:
# THRESHOLD     human readable threshold status
# MESSAGE   Human readable consolidated error message
# TOTALCOUNT    total count of errors for current DIMM of CE/UC depending on
#       what triggered the event
# LOCATION  Consolidated location as a single string
# DMI_LOCATION  DIMM location from DMI/SMBIOS if available
# DMI_NAME  DIMM identifier from DMI/SMBIOS if available
# DIMM      DIMM number reported by hardware
# CHANNEL   Channel number reported by hardware
# SOCKETID  Socket ID of CPU that includes the memory controller with the DIMM
# CECOUNT   Total corrected error count for DIMM
# UCCOUNT   Total uncorrected error count for DIMM
# LASTEVENT Time stamp of event that triggered threshold (in time_t format, seconds)
# THRESHOLD_COUNT Total umber of events in current threshold time period of specific type
#
# note: will run as mcelog configured user
# this can be changed in mcelog.conf

logger -s -p daemon.err -t mcelog "$MESSAGE"
logger -s -p daemon.err -t mcelog "Location: $LOCATION"

[ -x ./dimm-error-trigger.local ] && . ./dimm-error-trigger.local

exit 0

參考

引用自:https://unix.stackexchange.com/questions/76307