houston we have a problem !
FedoraForum.org - Fedora Support Forums and Community
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 21
  1. #1
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Angry houston we have a problem !

    Code:
    Jun 30 13:42:09 boa kernel: general protection fault: 0000 [#1] SMP NOPTI
    Jun 30 13:42:09 boa kernel: Modules linked in: xt_CHECKSUM ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables macvtap macvlan vhost_net vhost tap nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache devlink tun cfg80211 rfkill ip_set nfnetlink bridge stp llc libcrc32c amd64_edac_mod edac_mce_amd kvm_amd kvm joydev irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif
    Jun 30 13:42:09 boa kernel: ccp sp5100_tco k10temp i2c_piix4 ipmi_si shpchp ipmi_devintf ipmi_msghandler pinctrl_amd acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper mpt3sas ttm igb raid_class crc32c_intel drm scsi_transport_sas dca nvme i2c_algo_bit nvme_core [last unloaded: ip6_tables]
    Jun 30 13:42:09 boa kernel: CPU: 21 PID: 2906 Comm: nfsd Not tainted 4.17.2-200.fc28.x86_64 #1
    Jun 30 13:42:09 boa kernel: Hardware name: Supermicro AS -2023US-TR4/H11DSU-iN, BIOS 1.1 02/07/2018
    Jun 30 13:42:09 boa kernel: RIP: 0010:prefetch_freepointer+0x10/0x20
    Jun 30 13:42:09 boa kernel: RSP: 0018:ffffa90492027c58 EFLAGS: 00010206
    Jun 30 13:42:09 boa kernel: RAX: 0000000000000000 RBX: 6ede0f286c1041d1 RCX: 000000000002fba2
    Jun 30 13:42:09 boa kernel: RDX: 000000000002fba1 RSI: 6ede0f286c1041d1 RDI: ffff9365debb56c0
    Jun 30 13:42:09 boa kernel: RBP: ffff9365debb56c0 R08: ffff9371dfa6c2e0 R09: 0000000000000004
    Jun 30 13:42:09 boa kernel: R10: 0000000000000000 R11: 0000000000000014 R12: 00000000014080c0
    Jun 30 13:42:09 boa kernel: R13: ffffffffc0372ac1 R14: ffff93717e0dd6a9 R15: ffff9365debb56c0
    Jun 30 13:42:09 boa kernel: FS:  0000000000000000(0000) GS:ffff9371dfa40000(0000) knlGS:0000000000000000
    Jun 30 13:42:09 boa kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jun 30 13:42:09 boa kernel: CR2: 000055c5a450d008 CR3: 000000180b6c6000 CR4: 00000000003406e0
    Jun 30 13:42:09 boa kernel: Call Trace:
    Jun 30 13:42:09 boa kernel: kmem_cache_alloc+0xb4/0x1d0
    Jun 30 13:42:09 boa kernel: ? nfsd4_free_file_rcu+0x20/0x20 [nfsd]
    Jun 30 13:42:09 boa kernel: nfs4_alloc_stid+0x21/0xa0 [nfsd]

    hardcore.
    The server just took a dump cr4p.
    Thats an NFS server on an NVME !, mounted by the clients with v4, rsize=32768,wsize=32768
    I guess I will have to put the files somewhere else until I can figure this out.
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  2. #2
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Re: houston we have a problem !

    Interestingly, the server was working fine, until I started to copy some large file from some other server into a completely different partition (on a different drive of course).
    at that exact moment, the server dumped.

    EDIT: the clients mounted the partition as follows:
    Code:
    mount -t nfs -o vers=4,noatime,nodiratime,norelatime,rsize=32768,wsize=32768 boa.internal:/NVME0 /boa/NVME0
    Last edited by bobx001; 30th June 2018 at 03:57 PM.
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  3. #3
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Re: houston we have a problem !

    I am going to try and run this again but mounted using NFSv3

    EDIT#2: the clients, to copy files into the NFS server, use the following little thingy I wrote:
    Code:
    int copy_file(from,to,bytes)
    char	from[2048],to[2048];
    size_t	bytes;
    {
    	int	in_fd,out_fd;
    
    	in_fd=open(from, O_RDONLY); 
    	out_fd=open(to, O_WRONLY | O_CREAT, 0777); 
    	sendfile(out_fd, in_fd, NULL, bytes);
    	close(in_fd);
    	close(out_fd);
    }
    I could be it's too fast for the server ?
    Last edited by bobx001; 30th June 2018 at 03:59 PM.
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  4. #4
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Re: houston we have a problem !

    UPDATE:, submitted a bug report: https://bugzilla.kernel.org/show_bug.cgi?id=200379
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  5. #5
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Re: houston we have a problem !

    UPDATE2: I now remounted the same NVME using -o vers=3 on the 3 clients, and I set up an alarm that will SMS my phone (loud) if I see any kernel panics in /var/log/messages, let's see if I can sleep tonight, unlike last night.
    Another thing I am planning is to change the C function file copy, into a system ("cp -n source target"); It may slow down the clients a bit, but it maybe more "nice" to the file system. Also, I have read that some dudes were having some similar kernel panics on samba mounts, and they changed the O_WRONLY into O_RDWR, and that seemed to fix it. I will test that too.
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  6. #6
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Re: houston we have a problem !

    Update:

    I have run more than a day with the clients mounting the same NVME0 partition, via NFSv3.
    Also, I modified the copy_file , to contain the system "cp -n source target" , which is of course a zillion times slower, but it works.
    And so far so good.

    The file cache hit/miss is now about 90% hit, so I doubt there will be problems, but I will post here if I see anything of interest.
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  7. #7
    Join Date
    Dec 2008
    Location
    Vancouver, BC
    Posts
    4,328
    Linux (Fedora) Firefox 61.0

    Re: houston we have a problem !

    I saw what looks like the same thing on the Fedora openQA boxes, filed an RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1598229 . I'll link up the reports. Thanks!
    Adam Williamson | awilliam AT redhat DOT com
    Fedora QA Community Monkey
    IRC: adamw | Twitter: AdamW_Fedora | identi.ca: adamwfedora
    http://www.happyassassin.net

  8. #8
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Re: houston we have a problem !

    Quote Originally Posted by AdamW
    I saw what looks like the same thing on the Fedora openQA boxes, filed an RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1598229 . I'll link up the reports. Thanks!
    I see this was also NFSv4 in that bug report.
    Well, all I can say is that my system is still working fineola with NFSv3 (5 days and counting), however, I am still using the "cp source_file target_file_in_nfs_server" command, which is most definitely not optimal for my taste.
    What I will do today, is to modify the caching proggy to use my "sendfile" C proggy, but with the O_RDWR flag, keep the mounts on V3, and clean the cache, to force it to go into 1000 file copies / sec. I will report what I see.
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  9. #9
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 60.0

    Re: houston we have a problem !

    Interestingly, in the bug you saw, the second line after the kmem_cache_alloc is related to nfsv4_stid (mine is: nfsd4_free_file_rcu, and then comes stid)

    in the nfs4state.c source we can see:
    Code:
    static struct nfs4_stid *nfs4_alloc_stid(struct nfs4_client *cl,
    					 struct kmem_cache *slab)
    {
    	struct nfs4_stid *stid;
    	int new_id;
    
    	stid = kmem_cache_zalloc(slab, GFP_KERNEL);
    	if (!stid)
    		return NULL;
    
    	idr_preload(GFP_KERNEL);
    	spin_lock(&cl->cl_lock);
    I guess the stid is actually returned with a value not zero even though the kmem_cache_alloc fails, and it continues, failing in idr_preload. The main issue is that these programmers pass references to functions which are structs, in other function arguments..... (lunacy) like it was water, without any clue as to the ram state of such structs, all it takes is a small traffic congestion, and you got yourself utter memory corruption. This is the C++ syndrome, which I call optimistically lazy programming, the unwillingness to accept the fact that data memory will get scrubbed AT RANDOM, and not in a sequential programmatical way, and if you put calls to functions in that memory, then you got yourself a problem.
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  10. #10
    Join Date
    Apr 2013
    Location
    United Kingdom
    Posts
    27
    Linux (Fedora) Firefox 61.0

    Re: houston we have a problem !

    Have you had any hard locks from this or am I hitting a different bug?

    I noticed that on 4.17.3 I get this on the server, but on 4.17.4 and 4.17.5 I get hard lockups on both client and server with nothing to report in messages at all.

    Reverted both back to 4.17.3 which seems to fix the client but obviously not the server, just haven't gotten around to re-installing an older kernel on the server yet and am just hoping it doesn't die too often in the meantime.

  11. #11
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 61.0

    Re: houston we have a problem !

    Quote Originally Posted by alexatkin
    Have you had any hard locks from this or am I hitting a different bug?

    I noticed that on 4.17.3 I get this on the server, but on 4.17.4 and 4.17.5 I get hard lockups on both client and server with nothing to report in messages at all.

    Reverted both back to 4.17.3 which seems to fix the client but obviously not the server, just haven't gotten around to re-installing an older kernel on the server yet and am just hoping it doesn't die too often in the meantime.
    I have just had another "soft" lockup, meaning machine kept running, albeit a bit off kilter. with Kernel: 4.17.3-200.fc28.x86_64
    here is the abrt-cli output:
    Code:
    id e795ea5fb0078a965d3c2ec2da85674b4b084143
    reason:         kmem_cache_alloc(): general protection fault in kmem_cache_alloc
    time:           Fri 13 Jul 2018 08:26:51 AM UTC
    cmdline:        BOOT_IMAGE=/boot/vmlinuz-4.17.3-200.fc28.x86_64 root=UUID=19e1c7a5-747e-4ad4-92da-77cfb1417062 ro rhgb quiet selinux=0 intel_iommu=off amd_iommu=on nohpet LANG=en_US.UTF-8
    package:        kernel-core-4.17.3-200.fc28
    count:          11
    Directory:      /var/spool/abrt/oops-2018-07-13-08:26:51-1616-0
    I have now upgraded the server to: 4.17.5-200.fc28.x86_64
    And I am now in the process of remounting all the NFS clients into vers=3, since the problem shows all nfsv4 errors:
    Code:
    [root@boa ~]# more /var/spool/abrt/oops-2018-07-13-08:26:51-1616-0/backtrace
    general protection fault: 0000 [#1] SMP NOPTI
    Modules linked in: macvtap macvlan vhost_net vhost tap nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xt_CHECKSUM ip6t_MASQUERADE nf
    _nat_masquerade_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT 
    nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_
    raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_secu
    rity ebtable_filter ebtables ip6table_filter ip6_tables tun devlink cfg80211 rfkill ip_set nfnetlink bridge stp llc libcrc32c amd64_edac_m
    od edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul joydev crc32_pclmul ghash_clmulni_intel ipmi_ssif
     sp5100_tco ccp ipmi_si i2c_piix4 k10temp shpchp ipmi_devintf ipmi_msghandler pinctrl_amd acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grac
    e sunrpc ast drm_kms_helper mpt3sas ttm igb raid_class drm scsi_transport_sas crc32c_intel nvme dca i2c_algo_bit nvme_core [last unloaded:
     ip6_tables]
    CPU: 57 PID: 2940 Comm: nfsd Not tainted 4.17.3-200.fc28.x86_64 #1
    Hardware name: Supermicro AS -2023US-TR4/H11DSU-iN, BIOS 1.1 02/07/2018
    RIP: 0010:prefetch_freepointer+0x10/0x20
    RSP: 0018:ffffb05c51d5fc58 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: 010ea052660ca032 RCX: 0000000000003c43
    RDX: 0000000000003c42 RSI: 010ea052660ca032 RDI: ffff94609eb9f800
    RBP: ffff94609eb9f800 R08: ffff94749fb6c3e0 R09: 0000000000000004
    R10: 0000000000000000 R11: 0000000000000019 R12: 00000000014080c0
    R13: ffffffffc0500ac1 R14: ffff945cd60dccb1 R15: ffff94609eb9f800
    FS:  0000000000000000(0000) GS:ffff94749fb40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000b7a69000 CR3: 000000101975a000 CR4: 00000000003406e0
    Call Trace:
     kmem_cache_alloc+0xb4/0x1d0
     ? nfsd4_free_file_rcu+0x20/0x20 [nfsd]
     nfs4_alloc_stid+0x21/0xa0 [nfsd]
     nfsd4_process_open2+0x1048/0x1360 [nfsd]
     ? nfsd_permission+0x63/0xe0 [nfsd]
     ? fh_verify+0x17a/0x5b0 [nfsd]
     ? nfsd4_process_open1+0x139/0x420 [nfsd]
     nfsd4_open+0x2b1/0x6b0 [nfsd]
     nfsd4_proc_compound+0x33e/0x640 [nfsd]
     nfsd_dispatch+0x9e/0x210 [nfsd]
     svc_process_common+0x46e/0x6c0 [sunrpc]
     ? nfsd_destroy+0x50/0x50 [nfsd]
     svc_process+0xb7/0xf0 [sunrpc]
     nfsd+0xe3/0x140 [nfsd]
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  12. #12
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 61.0

    Re: houston we have a problem !

    I wonder what would happen if I modify the file /etc/nfs.conf
    and uncomment/change as follows:
    Code:
    [nfsd]
    threads=32
    vers3=y
    vers4=n
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  13. #13
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 61.0

    Re: houston we have a problem !

    BTW, this issue is only happening in an AMD Epyc. I have a couple of more Intel-based systems using NFSv4 as servers, and they do not see this problem. May it be that the nfs/kernel developers use an intel compiler to test, and not gcc ?
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

  14. #14
    Join Date
    Apr 2013
    Location
    United Kingdom
    Posts
    27
    Linux (Fedora) Firefox 61.0

    Re: houston we have a problem !

    Its certainly not limited to AMD as I'm using an i5-4690 on the server and i5-8600K on the client.

  15. #15
    Join Date
    Dec 2012
    Location
    santa barbara, CA
    Posts
    948
    Linux (Fedora) Firefox 61.0

    Re: houston we have a problem !

    Quote Originally Posted by alexatkin
    Its certainly not limited to AMD as I'm using an i5-4690 on the server and i5-8600K on the client.
    Interesting. The Intel box that works without a hitch is an older:
    Xeon E5-2643 0s (-MT-MCP-SMP-) arch: Sandy Bridge rev.7 cache: 20480 KB
    "monsters John ... monsters from the ID..."
    "ma vule teva maar gul nol naya"

Page 1 of 2 1 2 LastLast

Similar Threads

  1. fedup 18->20 ... Houston ... problem ...
    By RawFoX in forum Installation, Upgrades and Live Media
    Replies: 7
    Last Post: 6th January 2014, 04:44 PM
  2. whitney Houston Concert in Australia
    By Demz in forum Wibble
    Replies: 9
    Last Post: 24th February 2010, 11:25 PM
  3. rawhide repo madness, or - Houston, we have a problem...
    By CiaW in forum Fedora 13 Development Branch
    Replies: 6
    Last Post: 19th February 2010, 03:43 AM
  4. Houston, Texas Rocks!
    By BStarbuck in forum Wibble
    Replies: 25
    Last Post: 20th December 2008, 08:07 AM
  5. Guys (Gals(Houston))... We have a problem...
    By cooney in forum Using Fedora
    Replies: 93
    Last Post: 12th February 2007, 05:39 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •