SMARTen up!

How do you know if your hard disk is about to die?

A couple of months ago, a writer friend called me about a problem with her computer. The sort of problem that drives you nuts; an intermittent one.

Sometimes the machine would be slow to boot. V-e-r-y slow. Sometimes it would freeze while she was working, then resume. Other times it would behave perfectly normally. But the bad times were getting more frequent and she’d reached the stage where she no longer trusted the machine.

What was it? What could she do?

She’d called in a geek – the wheeled variety (Kiwis will know who I mean) – who performed some tests, did some checks, presented a bill and declared the machine was fine. Only it wasn’t.

Someone recommended “their guy” who charged in, did some stuff, uttered some techno-babble and charged out again. (As well as charging, in the other sense.)

He went away, but the problems didn’t.

So she called me.

Oh great. Two “experts” had failed. What chance did I have?

But in talking to her and her husband about the problems – something neither of my precedessors had done – I began to see a pattern in the randomness, booted the machine, hit F2, and within two minutes had the solution.

The machine was an HP. Like many “brand” computers, HPs contain a set of hardware diagnostic tools available from the boot menu. All I did was kick them off.

A typical short hard drive check takes around two minutes. And, as I’d guessed, two minutes later the diagnositcs reported the hard drive was failing.

HPfail

The machine was a little over a year old, still under warranty, and the faulty drive was replaced within a week.

 

Behind the scenes

Hard drives die in one of two ways. Around 40% go suddenly and without warning. The remainder suffer lingering deaths from mechanical wear and drive surface degradation, sometimes giving out warnings – like my friend’s – in the form of sluggish response and erratic performance. And, if you know where to look, you can see and even log their decline.

Behind the scenes, that HP diagnostics program ran a SMART analysis of the hard disk. SMART stands for Self-Monitoring, Analysis, and Reporting Technology, and is built in to all hard disk and solid-state drives. It tries to aniticpate failure by running a series of electrical and mechanical tests and recording the results. Some tests are more useful than others, but by looking at past failures and their frequency, it can provide you with a vital clue that a drive’s on its way out.

Some motherboards display a SMART drive status when they boot. Some don’t. Plus, there are many different types of drive and types of connection – USB, Firewire, ATA, SATA, SCSI, SSA, RAID, etc. That “low-levelness” is something operating systems like Windows struggle with. What’s more, SMART is only a “sort of” standard. Most drive manufacturers follow the basic implementation, but only some aspects are cross-compatible.

 

Linux SMARTs

As usual, Linux users have the edge here. Installing SMART is simply a matter of installing Smartmontools:

sudo apt-get install smartmontools

This provides two utilities — smartctl and smartd – a monitoring and control program and a disk monitoring daemon.

 

To get information about the disk and see whether it supports SMART:

sudo smartctl -i /dev/sda

where sda is the drive concerned. (Use lsblk to see what drives are attached to the machine.)

This will give you a summary of your drive. Look for the lines:

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

 

If SMART’s not enabled, enable it with:

sudo smartctl -s on /dev/sda

 

To get a quick health status report:

sudo smartctl -H /dev/sda

which should show something like this:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

If this shows FAILING, back up the data immediately!

 

To get a full drive report:

sudo smartctl -a /dev/sda

There are two options for testing a drive – short and longt. A short test typically takes around two minutes. Long tests take considerably longer – two to six hours is not uncommon – but both tests will tick away in the background and still allow you to use your machine.

 

To see roughly how long each test will take, run the full report

sudo smartctl -a /dev/sda

and scroll down to a section under the line

=== START OF READ SMART DATA SECTION ===
 where you'll find something like this:Short self-test routine
 Short self-test routine
 recommended polling time: ( 1) minutes.
 Extended self-test routine
 recommended polling time: ( 333) minutes.

 

To run either test, use the -t option:

sudo smartctl -t short /dev/sda
sudo smartctl -t long /dev/sda

Running a test will give you a completion time:

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
...
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Jun 21 21:33:38 2016

 

To abort the test use:

sudo smartctl -X

You can see how the time is going with the date command:

 

To see the results of the test:

sudo smartctl -l selftest /dev/sda

or run the full report again.

 

If you prefer a GUI front end for SMART, install GsmartControl:

sudo apt-get install gsmartcontrol

 

Next time, I’ll show you how to automate drive testing using smartd.

 

 

Share this ...
Share on FacebookTweet about this on TwitterPin on PinterestShare on Google+Share on TumblrShare on LinkedInDigg thisShare on RedditShare on StumbleUponEmail this to someonePrint this page

How to make a million in IT

Want to make a fortune in IT? Simple, sell a system to the New Zealand Police.

Whoa! you’re thinking. The police? Is he nuts?

No, seriously, those guys will buy anything. Back in 1999 they paid $100 million for “little more than a highly expensive e-mail system and a number of terminals”, and last week it was revealed that they’ve blown $56 million on a new human resources and payroll system. What’s more, it’s still not ready, and the delay is costing a further $2 million a month. (Not ready? I hear your say. An IT project not delivered on time? Does that ever happen, anywhere…?) It’s now due for completion in September, by which time it will have cost $64 million.

deThe INCIS project, for anyone who cares to remember it, has its own Wikipedia entry and occupies a key chapter in the book Dangerous Enthusiasms: E-government, Computer Failure and Information System Development, by Robin Gauld and Shaun Goldfinch, (Otago University Press, 2006, reprinted 2012). As an Integrated National Crime Investigation System, it was going to be a world-beater, and the money flowed into it like water for five long years.

I had a former colleague splashing about in those waters back in 1998. I remember him boasting he was earning $85 an hour for reading the project’s spec. It took him months. Why? Because the original specification was 4,000 pages long. Yes, 4,000 pages. I heard there were a further 2,000 pages of amendments.

If you have a problem getting your head around documents of that size, remember that a ream of regular office paper (500 sheets) is about 5cm thick. So placed on your desk, the original spec would have measured 40cm (16 inches) high and weighed at around 19kg –  without covers or binding. Or the amendments.

How do projects get so ridiculously out of hand? Three words: IT project managers. It’s that simple.

I’ve spent 25 years in IT, and in all that time I’ve worked with two, maybe three, competent project managers both here and overseas, on large projects and little ones. All the other PMs I’ve encountered have been bunnies, ranging from gullible fools unable to read a Gantt chart through to the wilfully malicious, there to parrot whatever senior management wants to hear1 while padding out their own fat contracts.

But how do you spot a BPM (Bunny Project Manager)? After all, the whole profession is based on the ability to give glib answers and make calming predictions about deliverables based on no data at all.

One simple technique is to employ what I modestly call the Palmer Protocol: just drop the name Fred Brooks into the conversation, or make a passing mention to mythical man-months. If you’re met with the question “Who does he play for?” or a straightforward bewildered look, run for it!

mmmMore than 40 years ago, Fred Brooks2 wrote the book on IT project management. Literally. His slim volume, The Mythical Man-Month: Essays on Software Engineering, was first published in 1975 and is still in print. In it, Brooks reflected on his experiences working on a Really Big Project – the programming behind IBM’s OS/360 mainframe operating system – and all the mistakes he observed.

In The Mythical Man-Month, he describes what’s become known as known Brooks’ Law, the somewhat counter-intuitive, “Adding manpower to a late software project makes it later.” The reason it does so is because complicated programming projects can’t be broken down into discrete units and farmed out to individuals, but most BPMs still operate on the basis that they can.

The rationale goes like this. If one man can cut an acre of wheat in a day, then ten men can clear a ten acre field in a day. And they can. But cutting wheat isn’t cutting code. Programmers need to communicate with each other. Adding new staff – no matter how competent – means that your existing, already over-taxed staff have to take time out to bring them up to speed and then keep everyone updated with each others’ progress. There’s even a simple equation for calculating the effect, called the group intercommunication formula, where n is the number of staff:

n(n − 1) / 2

That means that a ten-person team will have 45 channels of communication [10 x (10 -1) / 2], while a twenty-person team will have 190. 30 people will have 435 channels, 40 people, 780, 50 people, 1,225, and so on.

The book is filled with other observations, still timely 40+ years on. There’s the Second-System Effect – “the tendency of small, elegant, and successful systems to have elephantine, feature-laden monstrosities as their successors due to inflated expectations.” There’s the tendency towards an irreducible number of errors. (All complex systems have bugs, but after a certain point, squashing them simply results in more bugs.)

pgmSong
All software has a tendency towards an irreducible number of errors.

Then there’s progress tracking. “How,” Brooks asks, “does a large software project get to be one year late? Answer: One day at a time!” Incremental slippages accumulate, eventually producing a large overall delay. It’s the project manager’s job to set small individual milestones and ensure they’re met.

At the end of the book, Brooks makes a couple of suggestions for lowering software development costs. Both seem obvious, but as we all know, commonsense isn’t common, particularly amongst BPMs.

The first says don’t hire implementers until the architecture is complete. (My spec-reading former colleague was a programmer, but the bit he was supposed to working on wasn’t ready.)

The second – and here one wonders how uniquely complex the NZ Police’s payroll and human resources needs really are – is even more obvious. Don’t develop software at all, just buy it “off the shelf” wherever possible. At $64 million, (and it will almost certainly come in higher than that), the NZ Police could have bought the whole damn store.

 

PMdilbert

 

 

1 One report into the INCIS systems’ prospects reckoned that Police would save $517 million over 8 years in “efficiency benefits” – whatever the hell they are.

2 Fred Brooks’ book has been called the Bible of Software Engineering because, he reckons, “Everybody quotes it, some people read it, and a few people go by it”.

 

Share this ...
Share on FacebookTweet about this on TwitterPin on PinterestShare on Google+Share on TumblrShare on LinkedInDigg thisShare on RedditShare on StumbleUponEmail this to someonePrint this page