PDA

View Full Version : FC5 freeze + fsck failed


dr400
26th September 2007, 12:35 PM
Hi everybody,

First as you can guess, I'm quite the newbie for linux things.
I'm running FC5 on my PC, with two HD :
- one small with different partitions containing everything related to my OSs (currently FC5 et WinXP)
- one bigger for the data, with one partition for WinXP and one ext3 partition mounted as /home for my linux.

Recently FC5 underwent a total freeze while the only program running was Firefox (downloading) :
- gnome GUI utterly frozen (no keyboard, no mouse, etc)
- tried to kill mozilla with a non-graphic shell (as root) -> no result.
- tried to 'shutdown now' (as root) -> no result
I felt all that was left for me to do was hard reboot (wrong :confused: ).

Unfortunately (but understandably) during linux initialization, "/home corrupted file system" was found. It dropped me down to a shell (as root) telling me to run fsck... which I did ("fsck -v /dev/sdb2" in my case).
- fsck.e2fsck version 1.38
- fsck started asking some stuff about fixing corrupted and/or orphan inodes, I answered 'yes' every time.
- after a few fixes I got a whole screen covered with strange output, which I can only describe as "insults" (now that's professional :D ). Things like "irq sequences..." "unknown boot...".
- all activity ceased afterwards (no disk access, no going back to the shell, nothing else happens for 10 min).
- also one 'funny' thing was that all the leds on my keyboard (caplocks, etc) were flashing in this state. I found another post with this behaviour but no explanation :
http://www.fedoraforum.org/forum/showthread.php?t=165035&highlight=fsck
Anyone knows more about this ?

Not knowing what to do I hard-rebooted once again. And got to fsck, which worked this time around.
My PC worked fine this week end, then yesterday the same pb happened again (mozilla freeze -> fsck failed)... only this time fsck won't miraculously succeed anymore :( still giving me this strange output / keyboard flashes.

I'm begining to suspect a virus or more likely a dying hard disk... I did'nt have much time yesterday, but this evening I'm thinking about checking /var/log/messages and using the rescue disk.
I'm open to any good suggestion (keep in mind I'm a noob, so please forgive me if I missed some obvious thing).

thanks.

David Becker
26th September 2007, 12:51 PM
Hi everybody,

First as you can guess, I'm quite the newbie for linux things.
I'm running FC5 on my PC, with two HD :
- one small with different partitions containing everything related to my OSs (currently FC5 et WinXP)
- one bigger for the data, with one partition for WinXP and one ext3 partition mounted as /home for my linux.

Recently FC5 underwent a total freeze while the only program running was Firefox (downloading) :
- gnome GUI utterly frozen (no keyboard, no mouse, etc)
- tried to kill mozilla with a non-graphic shell (as root) -> no result.
- tried to 'shutdown now' (as root) -> no result
I felt all that was left for me to do was hard reboot (wrong :confused: ).

Unfortunately (but understandably) during linux initialization, "/home corrupted file system" was found. It dropped me down to a shell (as root) telling me to run fsck... which I did ("fsck -v /dev/sdb2" in my case).
- fsck.e2fsck version 1.38
- fsck started asking some stuff about fixing corrupted and/or orphan inodes, I answered 'yes' every time.
- after a few fixes I got a whole screen covered with strange output, which I can only describe as "insults" (now that's professional :D ). Things like "irq sequences..." "unknown boot...".
- all activity ceased afterwards (no disk access, no going back to the shell, nothing else happens for 10 min).
- also one 'funny' thing was that all the leds on my keyboard (caplocks, etc) were flashing in this state. I found another post with this behaviour but no explanation :
http://www.fedoraforum.org/forum/showthread.php?t=165035&highlight=fsck
Anyone knows more about this ?

Not knowing what to do I hard-rebooted once again. And got to fsck, which worked this time around.
My PC worked fine this week end, then yesterday the same pb happened again (mozilla freeze -> fsck failed)... only this time fsck won't miraculously succeed anymore :( still giving me this strange output / keyboard flashes.

You might want to check for 'Drive...' entries in /var/log/messages. Chances are something is (physically) going wrong with your harddrive.

I had a similar situation in a system with 3 hard drives. Apparently, the power supply was insufficient for the power demands. At nights, one of the hard drives would spin down due to insufficient power. When it spinned down, it would consume less power and with the extra power available since the drive spinned down, it would spin back up only to spin down later on and the cycle would repeat itself ad infinitum/nauseum. Eventually the disk was unrecoverable.

Thus, check the capacity of your system's power supply. It also seems time to use the 'smartctl' utilities. Something like 'smartctl -a /dev/sda' (replace /dev/sda with /dev/hda for fc5) and look for attributes like 'Raw_Read_Error_Rate' and 'Seek_Error_Rate'.

David

dr400
26th September 2007, 07:04 PM
Hi,

first thx for your answer.
I checked /var/lg/messages and found no Drive error.
I tried smartctl, it recognized my HDD allright (option --a), and found the device was working properly (option --health -> ok). Except it says the device doesn't support auto-save when I try to read error logs (can't enable auto-save either).

So, where am I now ?
- checked connections / power supply : everything looks OK. No scratching noise from the HDD like it did with the last dead HDD I encountered :))
- smartctl seems to find everything's OK.
- fsck still freeze after some fixes. I got a snapshot of this, I'll try to post it.
- my keyboard still flash funkily while fsck is frozen...

I still need to check samsung site for test program for HDD, and I'll check BIOS also, just in to be sure... After that rescue mode ?

Any other idea anyone ?

thanks

David Becker
27th September 2007, 07:50 AM
Hi,

first thx for your answer.
I checked /var/lg/messages and found no Drive error.


Is there something that suggests a kernel panic?


I tried smartctl, it recognized my HDD allright (option --a), and found the device was working properly (option --health -> ok). Except it says the device doesn't support auto-save when I try to read error logs (can't enable auto-save either).

That's fine. auto-save is apparently a feature your drive doesn't support.


So, where am I now ?
- checked connections / power supply : everything looks OK. No scratching noise from the HDD like it did with the last dead HDD I encountered :))

Any rating on the power supply's capacity? 250W? 450W?


- smartctl seems to find everything's OK.
- fsck still freeze after some fixes. I got a snapshot of this, I'll try to post it.
- my keyboard still flash funkily while fsck is frozen...

The keyboard LEDs flashing is normally an indicator of a kernel panic. If the disk doesn't seem to be faulty, then it could be a faulty memory module. I assume you haven't overclocked your system or whatsoever. Normally, I'd try a kernel compile and see if it gets through without any segmentation violations. If there are segfaults, then I return my memory modules to the shop.

I believe the Fedora CDs/DVDs include a memtest86 utility? You could try that, although it doesn't tax the rest of your system while testing the memory. Checking the BIOS may reveal motherboard and CPU temperature and CPU fan speed. You could also try 'lmsensors' from Linux.

Good luck,

David

dr400
27th September 2007, 08:26 AM
Is there something that suggests a kernel panic?

actually, I usually run my PC using hyperthread capacity + kernel smp (never really considered this issue before). yesterday I tried disabling hyperthread + not-smp kernel : at the end of the failure screen for fsck 2, more lines appeared showing a kernel panic (interruption failure).
of course, as I didn't tried this when my computer was still working, it doesn't lead me anywhere :( , but that's all I ever saw about a kernel panic on my installation ;)

Any rating on the power supply's capacity? 250W? 450W?

Fortron FSP400-60GLN - 400W , for 1 mother board + 2HDD + 2DVD + 1Floppy...

The keyboard LEDs flashing is normally an indicator of a kernel panic.

Ah, thanks. At least I will have learned something :)

Normally, I'd try a kernel compile and see if it gets through without any segmentation violations.
I believe the Fedora CDs/DVDs include a memtest86 utility? You could try that, although it doesn't tax the rest of your system while testing the memory.

Allright I'll check to investigate in the direction of faulty memory module. Maybe I could also check different settings / ports for the memory mods.
I never compiled the kernel before, now is a good time to learn how (but can I do this in runlevel 1 or rescue shell ?) The thing is, it'll be hard to know wether the fault comes from my wrong doing or a faulty memory :D

Checking the BIOS may reveal motherboard and CPU temperature and CPU fan speed. You could also try 'lmsensors' from Linux.

I already ckecked this when I looked through the BIOS settings yesterday. Mother Board 36deg, CPU 45deg, Fan 1200~1600rpm. Didn't seemed wrong to me...

Good luck,

David

thanks again for your help.

David Becker
27th September 2007, 10:23 AM
actually, I usually run my PC using hyperthread capacity + kernel smp (never really considered this issue before). yesterday I tried disabling hyperthread + not-smp kernel : at the end of the failure screen for fsck 2, more lines appeared showing a kernel panic (interruption failure).
of course, as I didn't tried this when my computer was still working, it doesn't lead me anywhere :( , but that's all I ever saw about a kernel panic on my installation ;)

I assume you're back to the hyperthread+smp kernel?



Fortron FSP400-60GLN - 400W , for 1 mother board + 2HDD + 2DVD + 1Floppy...

Seems sufficient.


Ah, thanks. At least I will have learned something :)

Allright I'll check to investigate in the direction of faulty memory module. Maybe I could also check different settings / ports for the memory mods.
I never compiled the kernel before, now is a good time to learn how (but can I do this in runlevel 1 or rescue shell ?) The thing is, it'll be hard to know wether the fault comes from my wrong doing or a faulty memory :D

I'd first try memtest. I think it's an option when you boot the Fedora install/rescue CD/DVD.


I already ckecked this when I looked through the BIOS settings yesterday. Mother Board 36deg, CPU 45deg, Fan 1200~1600rpm. Didn't seemed wrong to me...

Seems fine.

BTW, did you run smartctl on the faulty drive? 'smartctl -a /dev/sda' tests your first drive, 'smartctl -a /dev/sdb' tests the 2nd (apparently faulty) drive. You should get some attributes, such as aforementioned 'Raw_Read_Error_Rate' etc.

David

dr400
27th September 2007, 11:57 AM
I assume you're back to the hyperthread+smp kernel?

yup. This configuration worked for more than a year, and I never ran the not-smp kernel. So I think it would be best to keep with the smp, which at least I know worked at some time.
My only worry was that maybe running fsck from the rescue shell was somewhat incompatible with hyperthread-smp :confused:

I'd first try memtest. I think it's an option when you boot the Fedora install/rescue CD/DVD.

I found a version for a bootable floppy disk, I think I'll try this. It should enable testing with all peripherals unplugged (except floppy) to narrow the error on the memory mods. As I have 2x512 Ko modules, I'll also try different configs.
I can't wait to do all these funny testing tonight :D

BTW, did you run smartctl on the faulty drive? 'smartctl -a /dev/sda' tests your first drive, 'smartctl -a /dev/sdb' tests the 2nd (apparently faulty) drive. You should get some attributes, such as aforementioned 'Raw_Read_Error_Rate' etc.

I ran smartctl on the drive.
- it gave me a perfect ID for the drive (manufacturer, model, etc...)
- it told me the drive was "healthy" (smartcl -h, if I remember correctly).
- it told me the drive doesn't support error logging, tried to activate it with 'smartctl -S on' as it suggested, but failed (not supported).
- when I tried to ask for '--attributes' , I think it said the same as above (some 'feature not supported' message). I'll check this to be sure tonight as I don't remember whether I tried this option alone, or with '--log'.

the thing is, apart from file system being corrupted (which can be explained by hard-reboot during download) and fsck crashes (which I can't explain as of now), I can't seem to find anything wrong with my HDD (not hot, no noise, no problem with smartctl, no "drive ..." in /var/log/messages).
So I guess after a detailed look into BIOS settings (which didn't change for the last year), and the memory mods tests, I'll :
- try to format the HDD and see if the problem is (permanently) solved. But then I'll loose all my (not-so-precious-after-all) datas. (Let's be realistic : after crashing fsck on this drive a dozen of times, I don't think I'll get my datas back anyway :rolleyes: )
- or buy a new one... A good way to justify buying a new HDD twice as big as the previous one :p

dr400

David Becker
27th September 2007, 03:41 PM
yup. This configuration worked for more than a year, and I never ran the not-smp kernel. So I think it would be best to keep with the smp, which at least I know worked at some time.
My only worry was that maybe running fsck from the rescue shell was somewhat incompatible with hyperthread-smp :confused:



I found a version for a bootable floppy disk,

This floppy won't contain the memtest (iirc). It's just to boot a system that can't boot from HD, CD/DVD or network. You'll still be prompted for the rescue/install CD/DVD, so if you can boot from CD/DVD, then you might as well skip the floppy.


I think I'll try this. It should enable testing with all peripherals unplugged (except floppy) to narrow the error on the memory mods. As I have 2x512 Ko modules, I'll also try different configs.
I can't wait to do all these funny testing tonight :D



I ran smartctl on the drive.
- it gave me a perfect ID for the drive (manufacturer, model, etc...)
- it told me the drive was "healthy" (smartcl -h, if I remember correctly).
- it told me the drive doesn't support error logging, tried to activate it with 'smartctl -S on' as it suggested, but failed (not supported).
- when I tried to ask for '--attributes' , I think it said the same as above (some 'feature not supported' message). I'll check this to be sure tonight as I don't remember whether I tried this option alone, or with '--log'.


Doesn't smartctl (-a) give you values like this:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_
FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000a 253 252 000 Old_age Always -
3
3 Spin_Up_Time 0x0027 226 226 063 Pre-fail Always -
8598
4 Start_Stop_Count 0x0032 253 253 000 Old_age Always -
278
5 Reallocated_Sector_Ct 0x0033 251 251 063 Pre-fail Always -
6
6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline -
0
7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always -
0
8 Seek_Time_Performance 0x0027 247 236 187 Pre-fail Always -
...


Then, you may want to perform a longer (surface) test:

smartctl --test=long /dev/sdb


the thing is, apart from file system being corrupted (which can be explained by hard-reboot during download) and fsck crashes (which I can't explain as of now), I can't seem to find anything wrong with my HDD (not hot, no noise, no problem with smartctl, no "drive ..." in /var/log/messages).
So I guess after a detailed look into BIOS settings (which didn't change for the last year), and the memory mods tests, I'll :
- try to format the HDD and see if the problem is (permanently) solved. But then I'll loose all my (not-so-precious-after-all) datas. (Let's be realistic : after crashing fsck on this drive a dozen of times, I don't think I'll get my datas back anyway :rolleyes: )

A lot of data may have been recovered in the 'lost+found' directory at the mount point of the drive/partition.


- or buy a new one... A good way to justify buying a new HDD twice as big as the previous one :p

dr400

The filesystem could be severely corrupted. But it's strange that fsck would stall/fail suggesting something else (possible hardware failure) taking place.

Anyway, yet again; good luck,

David

dr400
27th September 2007, 08:01 PM
This floppy won't contain the memtest (iirc). It's just to boot a system that can't boot from HD, CD/DVD or network. You'll still be prompted for the rescue/install CD/DVD, so if you can boot from CD/DVD, then you might as well skip the floppy.

Apparently I did'nt explain myself very well ;)
I was talking about a bootable floppy launching memtest86 without the need for any OS, the most simple / light configuration IMO, which I found here :
http://www.memtest.org/download/1.70/memtest86+-1.70.floppy.zip

Doesn't smartctl (-a) give you values like this:
[...]
Then, you may want to perform a longer (surface) test

In fact smartctl says the device doesn't support smart (I checked it was enabled in the BIOS by the way).
The test finish without any display...

The filesystem could be severely corrupted. But it's strange that fsck would stall/fail suggesting something else (possible hardware failure) taking place.

I think I may have found the problem :)
I ran testmem and to make it short one of the two memory modules was found with 4000+ errors, on both DDR ports, while the other one succeeded through a few passes of test each time.
I then remembered to enable full-boot test in BIOS :rolleyes: The second module never gave the same memory size (100 or 200 instead of 448 Mo), plus half of the time the BIOS displayed a R/W error of the memory.

I'm now running FC5 on the PC with only the good module. At least fsck was able to finish normally, I already restarted a few time and encountered no problems... I think I'll try to burn a few DVDs now :D
I'll see if I can test the faulty module on another mother board, but I guess it will end up with me buying a new one (a lesser harm). I just hope it was an 'accident' and not the consequence of a wrong position of the modules on the DDR port (I double-checked in the manual the first time but...)

A lot of data may have been recovered in the 'lost+found' directory at the mount point of the drive/partition.

this directory looks like this now :

total 66644
-rw------- 1 auclair auclair 69419008 Sep 20 18:05 #22686051

I don't know what I can/should do with this :confused:

In any case now the lesson is learned : afterwards I think the symptoms pointed at a memory fault (freeze of FC5, crash of fsck with segfault-like messages) so maybe I could have enabled the memory test in the BIOS before hard-rebooting a dozen times... :rolleyes:

Well, David let me thank you again for your help and your good advice on this matter ;)
You taught me a few useful things on the way.
I hope this time the problem is solved :rolleyes:

dr400

David Becker
27th September 2007, 09:51 PM
Apparently I did'nt explain myself very well ;)
I was talking about a bootable floppy launching memtest86 without the need for any OS, the most simple / light configuration IMO, which I found here :
http://www.memtest.org/download/1.70/memtest86+-1.70.floppy.zip

Alright

I think I may have found the problem :)
I ran testmem and to make it short one of the two memory modules was found with 4000+ errors, on both DDR ports, while the other one succeeded through a few passes of test each time.
I then remembered to enable full-boot test in BIOS :rolleyes: The second module never gave the same memory size (100 or 200 instead of 448 Mo), plus half of the time the BIOS displayed a R/W error of the memory.

I'm now running FC5 on the PC with only the good module.
Less is more.

At least fsck was able to finish normally, I already restarted a few time and encountered no problems... I think I'll try to burn a few DVDs now :D

That's great! What a relief.


I'll see if I can test the faulty module on another mother board, but I guess it will end up with me buying a new one (a lesser harm). I just hope it was an 'accident' and not the consequence of a wrong position of the modules on the DDR port (I double-checked in the manual the first time but...)



this directory looks like this now :

total 66644
-rw------- 1 auclair auclair 69419008 Sep 20 18:05 #22686051

I don't know what I can/should do with this :confused:


Examine the file contents to see whether it's something worth saving or salvaging.

In any case now the lesson is learned : afterwards I think the symptoms pointed at a memory fault (freeze of FC5, crash of fsck with segfault-like messages) so maybe I could have enabled the memory test in the BIOS before hard-rebooting a dozen times... :rolleyes:

Doesn't necessarily safe guard for these situations, maybe time to read http://www.bitwizard.nl/sig11/


Well, David let me thank you again for your help and your good advice on this matter ;)
You taught me a few useful things on the way.
I hope this time the problem is solved :rolleyes:

dr400
Fingers crossed.

David

Takaoka Photos - Hanamaki Travel Photos on Instagram - Senador Pompeu Instagram Photos