Ubuntu
系統凍結:輸入/輸出錯誤
我已經設置了一台用作伺服器的機器,使用
Ubuntu 20.04
. 這台機器工作得很好,但最近,它開始給我一個非常奇怪的行為。有一次我在遠端工作時突然無法使用任何東西。所有的二進製文件都無法訪問,每當我嘗試使用它們的路徑呼叫它們時/usr/bin/echo "Test"
,例如我被提示一個cannot <command>: Input/Output error
.上網查了一下,發現可能是硬碟問題。但我的問題是,我該如何解決這個問題?顯然,系統狀態不穩定,必須解決。有什麼建議麼?
這是從同事重新啟動機器後日誌中的範例
dmesg -T --level=warn,err
,儘管我看不到它以某種方式將問題連結到硬碟。
smartmontools log (sudo smartctl -a /dev/sda)
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-47-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Toshiba X300 Device Model: TOSHIBA HDWE140 Serial Number: 69F9K2YWFBBG LU WWN Device Id: 5 000039 95bb0145e Firmware Version: FP1R User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Fri Sep 25 15:35:54 2020 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 479) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0 3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 4092 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 1886 10 Spin_Retry_Count 0x0033 100 100 030 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 12 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 6 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 12 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 48 (Min/Max 26/55) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 253 000 Old_age Always - 0 220 Disk_Shift 0x0002 100 100 000 Old_age Always - 0 222 Loaded_Hours 0x0032 096 096 000 Old_age Always - 1886 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0 224 Load_Friction 0x0022 100 100 000 Old_age Always - 0 226 Load-in_Time 0x0026 100 100 000 Old_age Always - 573 240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
所以,最後,我有機會看看實際的機器並進行一些更改。
通過執行以下操作解決了這個問題(至少看起來是這樣);主機板上的 SATA 埠#0 有一根懸垂的電纜,未連接到任何 HDD 或 SSD。相反,我的 HDD 是通過 SATA 埠#1 上的另一根電纜連接的。的確,在許多情況下,主機板會根據 SATA 插槽的 ID(0 > 1 > 2 > 3 > …)優先考慮 SATA 插槽。
所以,我取下了懸垂的電纜(老實說,我不知道是誰把它放在那裡的),然後我啟動了機器。
自從我做了這個愚蠢的改變,即從主機板上的 SATA 集線器上拔下一根懸空的電纜,問題就沒有再出現過。顯然,這不是磁碟分區故障的問題,因為所有磁碟都是新的並且形狀很好。