Hello all
This is my first post in these forums, so sincere apologies if I should have posted to some more specific forum instead. I am facing a reproducible problem which, even though I've done my research bit, I seem unable to address.
My system is a FC6 installation with a 2.6.22.14-72.fc6 kernel, a Promise EX8350 sata raid controller and four raid-1 arrays. Each array contains a single ext3 filesystem, identified as /dev/sdc1 through to /dev/sdf1. The system has been functioning problem-free for about two years now.
When copying or moving data from one of the above filesystems to another, say, from /dev/sdc1 to /dev/sdf1, I *always* get a frozen machine, after arbitrary periods of time and numbers of file operations, with the following showing-up in the messages log:
Code:
Nov 27 23:38:48 timemachine-s kernel: list_add corruption. next->prev should be prev (f7e4f888), but was f7fbeb80. (next=f7fbeb80).
Nov 27 23:38:48 timemachine-s kernel: ------------[ cut here ]------------
Nov 27 23:38:48 timemachine-s kernel: kernel BUG at lib/list_debug.c:27!
Nov 27 23:38:48 timemachine-s kernel: invalid opcode: 0000 [#1]
Nov 27 23:38:48 timemachine-s kernel: SMP
Nov 27 23:38:48 timemachine-s kernel: last sysfs file: /class/input/input3/event3/dev
Nov 27 23:38:48 timemachine-s kernel: Modules linked in: pcspkr autofs4 hidp rfcomm l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 xt_state nf_conntrack nfnetlink xt_tcpudp ip6table_filter ip6_tables x_tables dm_mirror dm_multipath dm_mod video sbs button dock battery ac ipv6 lp floppy ata_piix iTCO_wdt ide_cd iTCO_vendor_support sg e1000 cdrom i2c_i801 parport_pc i2c_core parport stex ahci libata sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd
Nov 27 23:38:48 timemachine-s kernel: CPU: 1
Nov 27 23:38:48 timemachine-s kernel: EIP: 0060:[<c04ef3af>] Not tainted VLI
Nov 27 23:38:48 timemachine-s kernel: EFLAGS: 00010086 (2.6.22.14-72.fc6 #1)
Nov 27 23:38:48 timemachine-s kernel: EIP is at __list_add+0x26/0x5c
Nov 27 23:38:48 timemachine-s kernel: eax: 00000061 ebx: f7fbea10 ecx: 00000082 edx: 00000000
Nov 27 23:38:48 timemachine-s kernel: esi: f7fbea10 edi: 00000000 ebp: f7e4f880 esp: f79ffac8
Nov 27 23:38:48 timemachine-s kernel: ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
Nov 27 23:38:48 timemachine-s kernel: Process fsck.ext3 (pid: 3461, ti=f79ff000 task=f4784000 task.ti=f79ff000)
Nov 27 23:38:48 timemachine-s kernel: Stack: c06dca69 f7e4f888 f7fbeb80 f7fbeb80 f7fd4228 f7fbea10 c04e1728 c0430213
Nov 27 23:38:48 timemachine-s kernel: c22f731c 00000086 c0430213 f7f32bc4 00000000 f7f32b58 f7841c00 f7f22000
Nov 27 23:38:48 timemachine-s kernel: f7f32b58 f8875925 f7fbea10 c04e02f0 f7fbea10 f7fbea10 f7f32b58 f7fbea10
Nov 27 23:38:48 timemachine-s kernel: Call Trace:
Nov 27 23:38:48 timemachine-s kernel: [<c04e1728>] blk_queue_start_tag+0xd2/0xdf
Nov 27 23:38:48 timemachine-s kernel: [<c0430213>] lock_timer_base+0x19/0x35
Nov 27 23:38:48 timemachine-s kernel: [<c0430213>] lock_timer_base+0x19/0x35
Nov 27 23:38:48 timemachine-s kernel: [<f8875925>] scsi_request_fn+0x119/0x31d [scsi_mod]
Nov 27 23:38:48 timemachine-s kernel: [<c04e02f0>] blk_remove_plug+0x58/0x64
Nov 27 23:38:48 timemachine-s kernel: [<c04e0319>] __generic_unplug_device+0x1d/0x1f
Nov 27 23:38:48 timemachine-s kernel: [<c04dd787>] elv_insert+0x146/0x1ec
Nov 27 23:38:48 timemachine-s kernel: [<c04e06bd>] blk_plug_device+0x6c/0xb3
Nov 27 23:38:48 timemachine-s kernel: [<c04e1bcd>] __make_request+0x498/0x4f7
Nov 27 23:38:48 timemachine-s kernel: [<c046095c>] __rmqueue+0x7f/0xac
Nov 27 23:38:48 timemachine-s kernel: [<c04df445>] generic_make_request+0x318/0x346
Nov 27 23:38:48 timemachine-s kernel: [<c04613bd>] __alloc_pages+0x68/0x2a0
Nov 27 23:38:48 timemachine-s kernel: [<c04e136b>] submit_bio+0xca/0xd1
Nov 27 23:38:48 timemachine-s kernel: [<c045fa38>] mempool_alloc+0x37/0xd5
Nov 27 23:38:48 timemachine-s kernel: [<c04989ae>] bio_alloc_bioset+0x9b/0xf3
Nov 27 23:38:48 timemachine-s kernel: [<c0495992>] submit_bh+0xd7/0xf5
Nov 27 23:38:48 timemachine-s kernel: [<c0498136>] block_read_full_page+0x2eb/0x2fc
Nov 27 23:38:48 timemachine-s kernel: [<c049a525>] blkdev_get_block+0x0/0x43
Nov 27 23:38:48 timemachine-s kernel: [<c0462c70>] __do_page_cache_readahead+0x16b/0x1c0
Nov 27 23:38:48 timemachine-s kernel: [<c040847c>] profile_pc+0x21/0x45
Nov 27 23:38:48 timemachine-s kernel: [<c043eca9>] tick_handle_periodic+0x17/0x5a
Nov 27 23:38:48 timemachine-s kernel: [<c043c86d>] getnstimeofday+0x30/0xbf
Nov 27 23:38:48 timemachine-s kernel: [<c042d4d3>] irq_exit+0x53/0x6b
Nov 27 23:38:48 timemachine-s kernel: [<c0462d11>] blockable_page_cache_readahead+0x4c/0x9f
Nov 27 23:38:48 timemachine-s kernel: [<c0462de5>] make_ahead_window+0x81/0x9e
Nov 27 23:38:48 timemachine-s kernel: [<c0462f7c>] page_cache_readahead+0x17a/0x1a5
Nov 27 23:38:48 timemachine-s kernel: [<c045db8e>] do_generic_mapping_read+0x158/0x4a0
Nov 27 23:38:48 timemachine-s kernel: [<c045f89a>] generic_file_aio_read+0x149/0x16f
Nov 27 23:38:48 timemachine-s kernel: [<c045d1c8>] file_read_actor+0x0/0xe0
Nov 27 23:38:48 timemachine-s kernel: [<c047a6f6>] do_sync_read+0xc7/0x10a
Nov 27 23:38:48 timemachine-s kernel: [<c043877d>] autoremove_wake_function+0x0/0x35
Nov 27 23:38:48 timemachine-s kernel: [<c0425912>] scheduler_tick+0x1a1/0x274
Nov 27 23:38:48 timemachine-s kernel: [<c062ec47>] mutex_lock+0x1a/0x29
Nov 27 23:38:48 timemachine-s kernel: [<c049985d>] block_llseek+0xad/0xb9
Nov 27 23:38:48 timemachine-s kernel: [<c047a62f>] do_sync_read+0x0/0x10a
Nov 27 23:38:48 timemachine-s kernel: [<c047af96>] vfs_read+0xa6/0x158
Nov 27 23:38:48 timemachine-s kernel: [<c047b3f4>] sys_read+0x41/0x67
Nov 27 23:38:48 timemachine-s kernel: [<c0404f8e>] syscall_call+0x7/0xb
Nov 27 23:38:48 timemachine-s kernel: =======================
Nov 27 23:38:48 timemachine-s kernel: Code: 83 c4 0c 5b c3 56 53 89 c3 83 ec 10 8b 41 04 39 d0 74 1c 89 4c 24 0c 89 54 24 04 89 44 24 08 c7 04 24 69 ca 6d c0 e8 6c a4 f3 ff <0f> 0b eb fe 8b 32 39 ce 74 1c 89 54 24 0c 89 74 24 08 89 4c 24
Nov 27 23:38:48 timemachine-s kernel: EIP: [<c04ef3af>] __list_add+0x26/0x5c SS:ESP 0068:f79ffac8
I only noticed the problem today as I just installed two extra HDs as a raid-1 array, set /dev/sdf1 up on it and then started moving data to it. I have not tried to check if the problem reproduces itself when filesystems other than the raid-1 ones are involved (i.e. /dev/sda1 and /dev/sdb1), but somehow I don't feel this is a raid-specific problem. Also, I should probably say that I've never tried to perform as massive copy/move operations on the system's filesystems as the ones I did today that revealed the problem.
My knowledge of Fedora and Linux in general doesn't go as far as to know what the above means or how to address it, so, I would *dearly* appreciate any help.
Thanks in advance.
mada