PDA

View Full Version : How to make sure hard disk is dying.



thomasramapuram
14th October 2007, 05:17 PM
Hi,
I have an Acer Laptop which has problems with acpi and apic. The startup script has acpi=off and noapic in it. Recently when I was downloading a file the computer switched to screen saver and I got an error message saying input/output error. Now every time I try to copy the file or read the file I get the error posted below. How do I make sure that the hard disk has a problem and not something that can be fixed. If there are any bad sectors etc. I tried fsck but doing it on a logical volume is quite a problem.
Any suggestions would be greatly appreciated.
Thanks in advance,
Thomas.

Error Log:
Oct 14 17:22:27 localhost kernel: sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
Oct 14 17:22:27 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off
Oct 14 17:22:27 localhost kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 14 17:22:27 localhost kernel: sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
Oct 14 17:22:27 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off
Oct 14 17:22:27 localhost kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 14 17:22:27 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 17:22:27 localhost kernel: ata1.00: (BMDMA2 stat 0x686d0009)
Oct 14 17:22:27 localhost kernel: ata1.00: cmd c8/00:08:69:41:b8/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 in
Oct 14 17:22:27 localhost kernel: res 51/40:03:6e:41:b8/00:00:00:00:00/e7 Emask 0x9 (media error)
Oct 14 17:22:27 localhost kernel: ata1.00: configured for UDMA/100
Oct 14 17:22:27 localhost kernel: ata1: EH complete
Oct 14 17:22:27 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 17:22:27 localhost kernel: ata1.00: (BMDMA2 stat 0x686d0009)
Oct 14 17:22:27 localhost kernel: ata1.00: cmd c8/00:08:69:41:b8/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 in
Oct 14 17:22:27 localhost kernel: res 51/40:03:6e:41:b8/00:00:00:00:00/e7 Emask 0x9 (media error)
Oct 14 17:22:27 localhost kernel: ata1.00: configured for UDMA/100
Oct 14 17:22:27 localhost kernel: ata1: EH complete
Oct 14 17:22:27 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 17:22:27 localhost kernel: ata1.00: (BMDMA2 stat 0x686d0009)
Oct 14 17:22:27 localhost kernel: ata1.00: cmd c8/00:08:69:41:b8/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 in
Oct 14 17:22:27 localhost kernel: res 51/40:03:6e:41:b8/00:00:00:00:00/e7 Emask 0x9 (media error)
Oct 14 17:22:27 localhost kernel: ata1.00: configured for UDMA/100
Oct 14 17:22:27 localhost kernel: ata1: EH complete
Oct 14 17:22:27 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 17:22:27 localhost kernel: ata1.00: (BMDMA2 stat 0x686d0009)
Oct 14 17:22:27 localhost kernel: ata1.00: cmd c8/00:08:69:41:b8/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 in
Oct 14 17:22:27 localhost kernel: res 51/40:03:6e:41:b8/00:00:00:00:00/e7 Emask 0x9 (media error)
Oct 14 17:22:27 localhost kernel: ata1.00: configured for UDMA/100
Oct 14 17:22:27 localhost kernel: ata1: EH complete
Oct 14 17:22:27 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 17:22:27 localhost kernel: ata1.00: (BMDMA2 stat 0x686d0009)
Oct 14 17:22:27 localhost kernel: ata1.00: cmd c8/00:08:69:41:b8/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 in
Oct 14 17:22:27 localhost kernel: res 51/40:03:6e:41:b8/00:00:00:00:00/e7 Emask 0x9 (media error)
Oct 14 17:22:27 localhost kernel: ata1.00: configured for UDMA/100
Oct 14 17:22:27 localhost kernel: ata1: EH complete
Oct 14 17:22:27 localhost kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 17:22:27 localhost kernel: ata1.00: (BMDMA2 stat 0x686d0009)
Oct 14 17:22:52 localhost kernel: ata1.00: cmd c8/00:08:69:41:b8/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 in
Oct 14 17:22:52 localhost kernel: res 51/40:03:6e:41:b8/00:00:00:00:00/e7 Emask 0x9 (media error)
Oct 14 17:22:52 localhost kernel: ata1.00: configured for UDMA/100
Oct 14 17:22:52 localhost kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Oct 14 17:22:52 localhost kernel: sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Oct 14 17:22:52 localhost kernel: Descriptor sense data with sense descriptors (in hex):
Oct 14 17:22:52 localhost kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Oct 14 17:22:52 localhost kernel: 07 b8 41 6e
Oct 14 17:22:52 localhost kernel: sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 14 17:22:52 localhost kernel: end_request: I/O error, dev sda, sector 129515886

Dan
14th October 2007, 05:26 PM
Morning Thomas.


Oct 14 17:22:52 localhost kernel: sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 14 17:22:52 localhost kernel: end_request: I/O error, dev sda, sector 129515886

Purely a pragmatic response here. Order a new drive! Read/Write and seek errors on a hard drive can be ignored only at your own peril. Although some work-arounds may temporarily work ... hard drives do not heal.


Dan

thomasramapuram
14th October 2007, 05:30 PM
Morning Thomas.



Purely a pragmatic response here. Order a new drive! Read/Write and seek errors on a hard drive can be ignored only at your own peril. Although some work-arounds may temporarily work ... hard drives do not heal.


Dan
Is there any utility that will confirm my beleif. Like scandisk or something.

bob
14th October 2007, 05:40 PM
Every drive manufacturer that I know of has a tool on their site to check the drive. Keep in mind that most have a 3-yr warranty, so don't just dump it.

leigh123linux
14th October 2007, 05:51 PM
Try smartctl and see if anything shows up



[root@localhost Desktop]# smartctl --help
smartctl version 5.37 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Usage: smartctl [options] device

============================================ SHOW INFORMATION OPTIONS =====

-h, --help, --usage
Display this help and exit

-V, --version, --copyright, --license
Print license, copyright, and version information and exit

-i, --info
Show identity information for device

-a, --all
Show all SMART information for device

================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====

-q TYPE, --quietmode=TYPE (ATA)
Set smartctl quiet mode to one of: errorsonly, silent

-d TYPE, --device=TYPE
Specify device type to one of: ata, scsi, marvell, sat, 3ware,N

-T TYPE, --tolerance=TYPE (ATA)
Tolerance: normal, conservative, permissive, verypermissive

-b TYPE, --badsum=TYPE (ATA)
Set action on bad checksum to one of: warn, exit, ignore

-r TYPE, --report=TYPE
Report transactions (see man page)

-n MODE, --nocheck=MODE (ATA)
No check if: never, sleep, standby, idle (see man page)

============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS =====

-s VALUE, --smart=VALUE
Enable/disable SMART on device (on/off)

-o VALUE, --offlineauto=VALUE (ATA)
Enable/disable automatic offline testing on device (on/off)

-S VALUE, --saveauto=VALUE (ATA)
Enable/disable Attribute autosave on device (on/off)

======================================= READ AND DISPLAY DATA OPTIONS =====

-H, --health
Show device SMART health status

-c, --capabilities (ATA)
Show device SMART capabilities

-A, --attributes
Show device SMART vendor-specific Attributes and values

-l TYPE, --log=TYPE
Show device log. TYPE: error, selftest, selective, directory,
background

-v N,OPTION , --vendorattribute=N,OPTION (ATA)
Set display OPTION for vendor Attribute N (see man page)

-F TYPE, --firmwarebug=TYPE (ATA)
Use firmware bug workaround: none, samsung, samsung2

-P TYPE, --presets=TYPE (ATA)
Drive-specific presets: use, ignore, show, showall

============================================ DEVICE SELF-TEST OPTIONS =====

-t TEST, --test=TEST
Run test. TEST is: offline short long conveyance select,M-N pending,N afterselect,on afterselect,off

-C, --captive
Do test in captive mode (along with -t)

-X, --abort
Abort any non-captive test on device

================================================== = SMARTCTL EXAMPLES =====

smartctl --all /dev/hda (Prints all SMART information)

smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
(Enables SMART on first disk)

smartctl --test=long /dev/hda (Executes extended disk self-test)

smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda
(Prints Self-Test & Attribute errors)
smartctl --all --device=3ware,2 /dev/sda
smartctl --all --device=3ware,2 /dev/twe0
smartctl --all --device=3ware,2 /dev/twa0
(Prints all SMART info for 3rd ATA disk on 3ware RAID controller)
smartctl --all --device=hpt,1/1/3 /dev/sda
(Prints all SMART info for the SATA disk attached to the 3rd PMPort
of the 1st channel on the 1st HighPoint RAID controller)
[root@localhost Desktop]#

Dies
14th October 2007, 06:22 PM
I'm with Tangled on this one, you should definitely get whatever you need off of it in a hurry.

joe.pelayo
14th October 2007, 07:31 PM
My Acer lappy is currently using its 'stock' damaged hard drive. It came with a Hitachi one, and after installing Linux in a dual boot configuration it turned out that it had some bad sectors. At first there were random instabilities, lock ups, and finally Linux gave me word of the problem (although the damage was in a Windows partition it never became aware of it).

Then I deleted the damaged Windows partition and everything was back to normal...even better, I started to truly enjoy my machine.

The best solution is to get a new drive altogether, but if you can not do it (or can but want to continue using the drive), simply disable the part of the disk where the error was discovered (if you can figure it out, that's it; in my case it was quite easy, just disabling a partition).

As previously mentioned, my lappy is using its stock -damaged- HD, and although with less capacity, has proved to be rock solid so far. Cautious as I am I already have a spare one, just in case, and all my sensitive data is safely backed up. In fact the machine holds no data aside from the OS's themselves and my music collection, the rest I need goes in a USB flash stick.

Good luck.
Joe.

thomasramapuram
15th October 2007, 06:09 AM
SMART Error Log Version: 1
ATA Error Count: 346 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 346 occurred at disk power-on lifetime: 988 hours (41 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 03 6e 41 b8 e7 Error: UNC 3 sectors at LBA = 0x07b8416e = 129515886

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 69 41 b8 e7 00 02:14:34.900 READ DMA
27 00 00 00 00 00 e0 00 02:14:34.900 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 02:14:34.900 IDENTIFY DEVICE
ef 03 45 00 00 00 a0 02 02:14:34.900 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 02:14:34.900 READ NATIVE MAX ADDRESS EXT

Error 345 occurred at disk power-on lifetime: 988 hours (41 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 03 6e 41 b8 e7 Error: UNC 3 sectors at LBA = 0x07b8416e = 129515886

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 69 41 b8 e7 00 02:14:30.800 READ DMA
27 00 00 00 00 00 e0 00 02:14:30.800 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 02:14:30.800 IDENTIFY DEVICE
ef 03 45 00 00 00 a0 02 02:14:30.800 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 02:14:30.800 READ NATIVE MAX ADDRESS EXT

Error 344 occurred at disk power-on lifetime: 988 hours (41 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 03 6e 41 b8 e7 Error: UNC 3 sectors at LBA = 0x07b8416e = 129515886

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 69 41 b8 e7 00 02:14:26.800 READ DMA
27 00 00 00 00 00 e0 00 02:14:26.800 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 02:14:26.800 IDENTIFY DEVICE
ef 03 45 00 00 00 a0 02 02:14:26.800 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 02:14:26.800 READ NATIVE MAX ADDRESS EXT

Error 343 occurred at disk power-on lifetime: 988 hours (41 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 03 6e 41 b8 e7 Error: UNC 3 sectors at LBA = 0x07b8416e = 129515886

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 69 41 b8 e7 00 02:14:22.700 READ DMA
27 00 00 00 00 00 e0 00 02:14:22.700 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 02:14:22.700 IDENTIFY DEVICE
ef 03 45 00 00 00 a0 02 02:14:22.700 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 02:14:22.700 READ NATIVE MAX ADDRESS EXT

Error 342 occurred at disk power-on lifetime: 988 hours (41 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 03 6e 41 b8 e7 Error: UNC 3 sectors at LBA = 0x07b8416e = 129515886

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 69 41 b8 e7 00 02:14:18.600 READ DMA
27 00 00 00 00 00 e0 00 02:14:18.600 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 02:14:18.600 IDENTIFY DEVICE
ef 03 45 00 00 00 a0 02 02:14:18.600 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 02:14:18.600 READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay

leigh123linux
15th October 2007, 06:15 AM
This bit don't look good :(



40 51 03 6e 41 b8 e7 Error: UNC 3 sectors at LBA = 0x07b8416e = 129515886

DamianS
15th October 2007, 04:02 PM
It sounds like some sectors are unreadable.
Assuming the HD has enough spare sectors, those can be remapped, and the HD should run as good as new from then on.
The absolute best tool for this is SpinRite at ww.grc.com

It saved 2 80GB HDs for me which were giving me SMART messages every 10 mins or so, warning of impending failure. I tried all the smartctl tests but they all said the drive was working fine.
I put up with this nonsense for 6 months or so, before finally cracking and stumping up the cash for SpinRite. Each drive took about 8 hours to do, but when it was done, I havent ever had another SMART message.

If you are still running Windows on the laptop, download Speedfan and check its SMART HD diagnostics. It will give you an online evaluation of your drive, and let you know if the problems can be fixed by remapping sectors.

sailor
16th October 2007, 12:30 AM
I was gonna suggest Spinrite (http://www.grc.com) too....I didn't realize it was $89....but I have heard that this a good tool for saving a drive.
If you are not concerned about the data on the drive...probably not worth it.

thomasramapuram
16th October 2007, 04:56 AM
guess I will chuck the drive and get an new one. All the data I need can be retrieved.
Thanks everybody.