How do you know if your hard disk is about to die?
A couple of months ago, a writer friend called me about a problem with her computer. The sort of problem that drives you nuts; an intermittent one.
Sometimes the machine would be slow to boot. V-e-r-y slow. Sometimes it would freeze while she was working, then resume. Other times it would behave perfectly normally. But the bad times were getting more frequent and she’d reached the stage where she no longer trusted the machine.
What was it? What could she do?
She’d called in a geek – the wheeled variety (Kiwis will know who I mean) – who performed some tests, did some checks, presented a bill and declared the machine was fine. Only it wasn’t.
Someone recommended “their guy” who charged in, did some stuff, uttered some techno-babble and charged out again. (As well as charging, in the other sense.)
He went away, but the problems didn’t.
So she called me.
Oh great. Two “experts” had failed. What chance did I have?
But in talking to her and her husband about the problems – something neither of my precedessors had done – I began to see a pattern in the randomness, booted the machine, hit F2, and within two minutes had the solution.
The machine was an HP. Like many “brand” computers, HPs contain a set of hardware diagnostic tools available from the boot menu. All I did was kick them off.
A typical short hard drive check takes around two minutes. And, as I’d guessed, two minutes later the diagnositcs reported the hard drive was failing.
The machine was a little over a year old, still under warranty, and the faulty drive was replaced within a week.
Behind the scenes
Hard drives die in one of two ways. Around 40% go suddenly and without warning. The remainder suffer lingering deaths from mechanical wear and drive surface degradation, sometimes giving out warnings – like my friend’s – in the form of sluggish response and erratic performance. And, if you know where to look, you can see and even log their decline.
Behind the scenes, that HP diagnostics program ran a SMART analysis of the hard disk. SMART stands for Self-Monitoring, Analysis, and Reporting Technology, and is built in to all hard disk and solid-state drives. It tries to aniticpate failure by running a series of electrical and mechanical tests and recording the results. Some tests are more useful than others, but by looking at past failures and their frequency, it can provide you with a vital clue that a drive’s on its way out.
Some motherboards display a SMART drive status when they boot. Some don’t. Plus, there are many different types of drive and types of connection – USB, Firewire, ATA, SATA, SCSI, SSA, RAID, etc. That “low-levelness” is something operating systems like Windows struggle with. What’s more, SMART is only a “sort of” standard. Most drive manufacturers follow the basic implementation, but only some aspects are cross-compatible.
As usual, Linux users have the edge here. Installing SMART is simply a matter of installing Smartmontools:
sudo apt-get install smartmontools
This provides two utilities — smartctl and smartd – a monitoring and control program and a disk monitoring daemon.
To get information about the disk and see whether it supports SMART:
sudo smartctl -i /dev/sda
where sda is the drive concerned. (Use lsblk to see what drives are attached to the machine.)
This will give you a summary of your drive. Look for the lines:
SMART support is: Available - device has SMART capability. SMART support is: Enabled
If SMART’s not enabled, enable it with:
sudo smartctl -s on /dev/sda
To get a quick health status report:
sudo smartctl -H /dev/sda
which should show something like this:
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
If this shows FAILING, back up the data immediately!
To get a full drive report:
sudo smartctl -a /dev/sda
There are two options for testing a drive – short and longt. A short test typically takes around two minutes. Long tests take considerably longer – two to six hours is not uncommon – but both tests will tick away in the background and still allow you to use your machine.
To see roughly how long each test will take, run the full report
sudo smartctl -a /dev/sda
and scroll down to a section under the line
=== START OF READ SMART DATA SECTION === where you'll find something like this:Short self-test routine Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 333) minutes.
To run either test, use the -t option:
sudo smartctl -t short /dev/sda
sudo smartctl -t long /dev/sda
Running a test will give you a completion time:
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === ... Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Tue Jun 21 21:33:38 2016
To abort the test use:
sudo smartctl -X
You can see how the time is going with the date command:
To see the results of the test:
sudo smartctl -l selftest /dev/sda
or run the full report again.
If you prefer a GUI front end for SMART, install GsmartControl:
sudo apt-get install gsmartcontrol
Next time, I’ll show you how to automate drive testing using smartd.