PDA

View Full Version : Server keeps crashing under weird circumstances.



Moku
6th February 2007, 12:45 AM
Hi All;

For the life of me I can't figure this issue out.

It started about a week ago when we were having power issues inside our cabinet, and the server was exposed to about 10 shutdowns before I was able to get it on a new circuit. It started out, that every 12 or so hours, our system would hang to the point no one could access it and we'd need to hard reboot it. Nothing in logs, no reason why it should be doing this.

Eventually it degraded and now it's doing it every hour or so. We don't have any cron's running, and it's at random times - it may be 30 minutes it may be 8 hours.

But today I finally was able to at least capture a dstat which kept running during the slowdown.



10 1 85 3 0 2| 11M 0 | 464k 13M| 0 0 |7482 2324
5 1 84 9 0 1|6268k 488k| 460k 13M| 0 0 |7696 2040
7 2 75 15 0 2| 19M 0 | 480k 14M| 0 0 |8007 2375
12 2 77 7 0 2| 14M 0 | 480k 14M| 0 0 |7797 2926
15 2 77 3 0 2| 17M 0 | 467k 14M| 0 0 |7805 2610
10 1 80 7 0 2| 22M 0 | 447k 14M| 0 0 |7676 2194
7 1 83 7 0 2| 12M 1544k| 462k 14M| 0 0 |8121 2165
11 2 82 4 0 2| 18M 0 | 466k 13M| 0 0 |7767 2387
6 1 87 4 0 2| 15M 0 | 469k 14M| 0 0 |8013 2402
11 2 82 3 0 2| 13M 0 | 473k 14M| 0 0 |8090 2664
9 2 84 4 0 2| 14M 0 | 505k 14M| 0 0 |8096 2522
8 2 78 11 0 2| 15M 756k| 510k 14M| 0 0 |8225 2481
7 1 88 1 0 2|8120k 0 | 546k 14M| 0 0 |8057 3061
8 2 85 3 0 2| 15M 0 | 536k 13M| 0 0 |7859 2373
13 3 78 5 0 2| 18M 0 | 520k 13M| 0 0 |7795 3300
7 1 86 4 0 2| 11M 0 | 554k 14M| 0 0 |7998 2487
12 2 81 4 0 2|8464k 592k| 552k 14M| 0 0 |8092 2676
10 2 83 4 0 2| 21M 0 | 485k 13M| 0 0 |7934 2388
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
3 93 1 0 0 3| 13M 320k|1447k 41M| 0 0 | 25k 3484
2 93 1 2 0 2| 111M 532k|5105k 157M| 0 0 | 92k 13k
11 18 1 69 0 1| 35M 0 | 131k 4945k| 0 0 |5465 2959
1 98 0 0 0 1| 83M 596k|3361k 117M| 0 0 | 76k 7754
0 99 0 0 0 0| 45M 212k|1232k 39M| 0 0 | 43k 4370
1 98 0 0 0 0| 51M 708k|1009k 29M| 0 0 | 38k 4057
1 99 0 0 0 0|2084k 236k| 63k 1582k| 0 0 |2074 240
1 99 0 0 0 0| 20M 472k| 250k 5740k| 0 0 |9433 1144
2 97 0 0 0 1| 67M 216k| 454k 14M| 0 4096B| 14k 1974
0 98 0 0 0 1| 52M 0 | 950k 29M| 0 4096B| 23k 2258
1 98 0 0 0 1| 24M 344k| 534k 15M| 0 0 | 17k 1705
1 99 0 0 0 1| 9.8M 4096B| 231k 6205k| 0 4096B|7361 749
1 99 0 0 0 1| 932k 0 | 51k 1438k| 0 0 |2197 226
0 99 0 0 0 0|1420k 0 | 46k 1191k| 0 0 |1792 193
2 98 0 0 0 0|3872k 128k| 45k 1056k| 0 128k|1793 209
1 98 0 0 0 0|4172k 124k| 42k 1112k| 0 124k|2346 227
3 96 0 0 0 0|1444k 372k| 36k 731k| 0 132k|1823 306
0 99 0 0 0 0| 96k 128k| 87k 1858k| 0 128k|1879 179
1 98 0 0 0 0|2280k 0 | 46k 792k| 0 0 |1739 242
0 99 0 0 0 0|1688k 576k| 37k 598k| 0 252k|1691 167
1 99 0 0 0 0|1100k 416k| 45k 970k| 0 112k|1762 195
1 98 0 0 0 0|3036k 268k| 42k 755k| 0 0 |1823 187
0 99 0 0 0 0|2612k 124k| 34k 838k| 0 124k|1708 183
2 97 0 0 0 1|7288k 352k| 178k 3511k| 0 120k|6912 712
0 99 0 0 0 0|1828k 140k| 41k 576k| 0 132k|2418 289
1 99 0 0 0 0|1932k 128k| 35k 838k| 0 128k|1632 223
1 99 0 0 0 0|3292k 0 | 41k 1236k| 0 0 |1753 211
0 99 0 0 0 0| 12k 128k| 51k 1387k| 0 128k|1750 229
0 99 0 0 0 0|1904k 232k| 48k 1259k| 0 232k|1679 187
0 99 0 0 0 0|3080k 84k| 73k 1591k| 0 84k|1471 167
0 100 0 0 0 0| 52k 0 | 29k 387k| 0 0 |1362 137
0 100 0 0 0 0|2236k 56k| 28k 671k| 0 56k|1536 179
0 100 0 0 0 0|1024k 168k| 27k 490k| 0 168k|1464 197
0 99 0 0 0 0|1076k 104k| 49k 1536k| 0 104k|1939 173
0 100 0 0 0 0|1040k 144k| 42k 1105k| 0 144k|1681 200
[root@server ~]# dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
7 12 76 3 0 1|8184k 342k| 0 0 | 14k 185k|6114 2024
5 2 38 54 0 1|8808k 0 | 335k 6784k|5584k 0 |5313 2935
3 89 2 5 0 1| 26M 796k|1901k 40M|8404k 0 | 36k 9432
3 81 5 10 0 1| 10M 0 | 89k 2347k| 976k 0 |2645 1210
2 94 0 3 0 1| 26M 816k|1731k 46M|8500k 0 | 41k 8548
1 99 0 0 0 1| 11M 0 | 321k 8654k|1492k 0 | 12k 1814
1 99 0 0 0 1|3316k 0 | 106k 3005k| 856k 0 |3433 532
0 99 0 0 0 1|7120k 0 | 53k 1436k| 468k 0 |2716 344
1 98 0 0 0 1|7268k 720k| 98k 3032k|1396k 0 |2824 529
5 94 0 0 0 1|9900k 264k| 93k 2109k|2428k 0 |2632 628
3 92 0 3 0 1| 17M 1128k| 634k 20M|3036k 0 | 19k 4467
1 98 0 0 0 1|2208k 4096B| 107k 2856k| 544k 0 |1547 237
16 80 0 2 0 1|5992k 1032k| 84k 2599k|2948k 0 |2742 2230
2 97 0 0 0 1|2856k 280k| 124k 3022k| 408k 0 |2755 529
2 96 0 1 0 1|1992k 0 | 101k 2834k|1336k 0 |2569 628
2 96 1 0 0 1|5152k 600k| 104k 2604k| 192k 0 |2694 373
2 97 0 0 0 1|2556k 0 | 104k 3008k| 500k 0 |2584 333
3 96 0 0 0 1|6224k 0 | 102k 3234k| 948k 0 |2846 502
3 95 1 1 0 1|3644k 0 | 95k 3269k|1260k 0 |2711 735


As you can see, idle goes from 83 to 0 in a matter of seconds, hard-drive read goes from 10-20 megs to 111 megs, and the machine believes it spit out 157 megabytes over the network. (It's only on a gigabit port, not possible to send that much) - After that sys cpu usage hovers around %100, all other read/send stats fall to nothing, and the machine is hosed for a good 2-3 minutes.

Sometimes the machine will recover, most of the time it needs a hard reboot.

Anyone have any idea why this is happening? It's got me stumped! All the packages are the latest available via YUM.

Thanks!

Rich

marcrblevins
6th February 2007, 03:01 AM
Was that server protected by a UPS? If not, some damage may occur? Test each hardware to may sure its not damage.

Moku
6th February 2007, 07:51 PM
Was that server protected by a UPS? If not, some damage may occur? Test each hardware to may sure its not damage.

There was initial damage to the software RAID-0 array, but it was cleaned up. Still having the problems.