<---- template headericclude ----->
[SOLVED] [HDD] hardware problem?
FedoraForum.org - Fedora Support Forums and Community
Results 1 to 10 of 10
  1. #1
    Join Date
    Aug 2014
    Location
    Oregon
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    [HDD] hardware problem?

    For some days now I get error messages while booting up fedora. As they appear in an irreproducible manner, I suspect my HDD to be faulty. Nonetheless, the "Lenovo Bootable Diagnostics ISO" does not show any errors and I need to prove reliably that the HDD is not working in order to get a new HDD payed by the warranty.





    OCR of the screenshot showing the error messages:
    [126.798506] ata1.00: failed command: READ FPDMA QUEUED
    [126.800501] ata1.00: cmd 60/08:60:e0:6f:f8/00:00:35:00:00/40 tag 1Z ncq 4096 in
    [126.800501] res 41/84:08:e0:6f:f8/00:00:35:00:00/00 Emask 0x410 (ATA bus error)
    [126.804484] ata1.00: status: { DRDY ERR 1
    [126.806355] ata1.00: error: { ICRC ABRT
    [127.279527] end_request: I/O error, deu sda, sector 905474016
    [127.294777] ata1.00: exception Emask Ox0 SAct Ox7c0000 SErr Ox0 action 0x6
    [127.296798] ata1.00: irq_stat Ox40::c008
    [127.298796] ata1.00: failed command: READ FPDMA QUEUED L
    [127.300789] ata1.00: cad 60/08:b0:d8:6f:f8/00:00:35:00:00/40 tag 22 ncq 4096 in E
    [127.300789] res 41/84:08:d8:6f:f8/00:00:35:00:00/00 Enask 0)(410 (ATA bus error) <F> E
    [127.304770] ata1.00: status: { DRDY ERR L
    [127.306640] ata1.00: error: { ICRC ABRT
    [127.778700] end request: I/O error, deu sda, sector 905474608 E
    [127.814471] ata1.00: exception Enask Ox0 SAct Ox7fffffff SErr OxO action 0x6 I
    [127.816498] ata1.00: irq_stat 0)(40090008 L
    [127.818496] ata1.00: failed command: READ FPDMA QUEUED L
    [127.820486] ata1.00: end 60/08:d8:00:61:f8/00:00:35:00:00/40 tag 27 ncq 4096 in I
    [127.820486] res 41/84:08:02:61:f8/00:00:35:00:00/00 Enask 0x410 (ATA bus error) <F> L
    [127.824461] ata1.00: status: { DRDY ERR I
    [127.826326] ata1.00: error: f ICRC ABET 1 r
    [17€.360152] end request: I/O error, deu sda, sector 905470208
    [128.3164591 ata1.00: exception Enask 6x0 SAct Ox7fffffff SErr OxO action 0x6
    [178.318496] ata1.00: irq_stat ex40ee0ee8
    [128.320499] ata1.00: failed command: READ FPDMA QUEUED
    [128.322488] ata1.00: cud 60/08:d0:08:61:f8/00:00:35:00:00/40 tag 26 ncq 4096 in I
    [128.322488] res 41/84:08:0e:61:f8f00:00:35:00:00/00 Enask 0x410 (ATA bus error) F.
    [128.326473] ata1.00:-'t*Atus: { DRDY ERR
    [128.328352] ata1.00: error: f ICRC ABET 7: 1278.7993711 end_request: I/O error, deu sda, sector 905470216
    Starting Create list of required static deuice nodes for the current kernel_ OM I
    Listening on udeu Kernel Socket.
    QE Listening on udeu Control Socket.
    Startino udeu Coldolua all npuirl

  2. #2
    stevea Guest

    Re: [HDD] hardware problem?

    You are seeing data scrambled between the disk controller and the system memory - the ICRC errors.

    The most suspect parts are ...
    1 /The disk controller or power
    2/ The SATA cable
    3/ The system memory.

    You should always run memtest86+ overnight as a system burn-in. That will check memory intensively.

    For the disk problems you'll want to examine the output of -
    sudo smartctl -H /dev/sda
    but also post result of
    sudo smartctl -x /dev/sda

    PLEASE POST THE ENTIRE OUTPUT

    Since this is a transient problem I expect they will show nothing except the ICRC count, but you might catch an error with one of the smartctl disk self-tests (see the "-t " option of smartctl).


    I'd normally say "replace the SATA cable and check the power supply", but since this is from Lenovo I expect it's a laptop. You might try re-seating the drive (electrically disconnect & reconnect it, according to the service manual).

  3. #3
    Join Date
    Aug 2014
    Location
    Oregon
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Re: [HDD] hardware problem?

    Quote Originally Posted by stevea
    You should always run memtest86+ overnight as a system burn-in. That will check memory intensively.
    running...

    Quote Originally Posted by stevea
    For the disk problems you'll want to examine the output of -
    sudo smartctl -H /dev/sda
    but also post result of
    sudo smartctl -x /dev/sda
    I did a long test but could not produce any errors, it seems.
    Code:
    $ sudo smartctl -t long /dev/sda
    Code:
    $ sudo smartctl -H /dev/sda
    smartctl 6.2 2014-07-16 r3952 [x86_64-linux-3.15.10-200.fc20.x86_64] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    Please note the following marginal Attributes:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    184 End-to-End_Error        0x0032   090   090   099    Old_age   Always   FAILING_NOW 10
    Code:
    $ sudo smartctl -x /dev/sda
    see http://pastebin.ubuntu.com/8190512/




    Quote Originally Posted by stevea

    You might try re-seating the drive (electrically disconnect & reconnect it, according to the service manual).
    I took it out and back in. First, I thought the errors were gone but just now another boot completely failed.

  4. #4
    dobbi Guest

    Re: [HDD] hardware problem?

    Could you test that drive with another PSU(Power System Unit in Another system)?

    I have this USB HDD that anytime some event disturbs the power brick, the drive malfunctions, I need to change that or the power strip it is connected to.

    You could also make a video and send that to the help desk they may accept it.

  5. #5
    Join Date
    Aug 2014
    Location
    Oregon
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Re: [HDD] hardware problem?

    Quote Originally Posted by dobbi
    Could you test that drive with another PSU(Power System Unit in Another system)?
    No, I only have this G700 lenovo.

    The memtest had 4 errors in the first pass. (see screenshot) Does this mean the HDD is functioning and the RAM is faulty?


  6. #6
    dobbi Guest

    Re: [HDD] hardware problem?

    Yep, you have memory problems.

    But don't take my word for it, read what people told others about.

    Code:
    The most important number first: The error count for healthy memory should be 0. Any number above 0 may indicate damaged/faulty sectors.
    source: https://superuser.com/questions/3260...-a-memtest-run

    And the quote that put a smile on my face.

    Code:
    #1 You want to keep using memory with any errors? –  Daniel Beck♦ Aug 21 '11 at 19:38
    source: https://superuser.com/questions/3260...-a-memtest-run

    TST = Test

    #5 = Test 5 [Moving inversions, 8 bit pat]

    source: http://www.memtest86.com/technical.htm#results

    This may or may not be affecting other systems like the HDD controller. But don't be optimistic here err on the side of doom, the memory is bad and it is the cause of all your malaises.

    Resources:
    http://www.memtest.org/

    icrc abrt errors.
    http://support.wdc.com/techinfo/general/errorcodes.asp (#115 - crc error)
    http://tali.admingilde.org/linux-doc...ml#excatDevErr
    https://superuser.com/questions/6412...ut-log-entries (great read)
    https://ata.wiki.kernel.org/index.ph...error_messages (If anybody is interested in decoding the errors messages in dmesg for ATA here is how)

    Just to reitered what already was said.

    - This is a interface problem.
    - Usual suspects are memory, cables, controllers, system drivers.
    - S.M.A.R.T. is useless in this case since it can only detect problems inside the drive not in the path.

    But in your case:
    Code:
    184 End-to-End_Error        0x0032   090   090   099    Old_age   Always   FAILING_NOW 10
    I didn't check the error code because I don't know what hdd you are using, but "end-to-end-error" looks to me another name for icrc error or crc error.

    Nope I was wrong apparently.
    SMART attribute 184 (end-to-end error) indicates that the drive's on-PCB cache is going bad. The attribute is used to compare parity data between the original piece of data sent by the controller (in your PC/system) and the drive itself. This is not "bad system RAM", this is not "a bad motherboard", this is not "bad NAND flash on the SSD". It's bad cache on the SSD.
    source: https://communities.intel.com/message/127998

    Unless is a Seagate which they claim uses proprietary mambo jambo codes that can't be read correctly any 3rd party apps.
    Code:
    Here is their answer:What gave you tthat SMART error? It seems to me the
    Linux diagnostic? Our SMART values are proprietary and do not conform to
    the industrie standard. That is why 3rd party tools cannot correctly
    read our drives.
    
    To check on the condition of the drive, download Seatools for DOS (it
    boots into Free BSD and works with Linux).
    Here is the download link:
    http://www.seagate.com/ww/v/index.jsp?locale=en-US&name=SeaTools&vgnextoid=720bd20cacdec010VgnVCM100000dd04090aRCRD
    
    Seatools for DOS is an ISO image file that is burnt to CD. You boot with
    that CD and run the long test that will examine every sector on that
    drive. If Seatools indicates the drive has a problem, you should
    exchange it (it will generate an error code)."
    source: http://sourceforge.net/p/smartmontoo...sage/24435188/

    In your case memory seems to be the cause of it, there may be more, but memory is more than likely to be one of them.
    Last edited by dobbi; 31st August 2014 at 02:04 PM.

  7. #7
    Join Date
    Jan 2011
    Posts
    15
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Re: [HDD] hardware problem?

    Quote Originally Posted by md7sum
    The memtest had 4 errors in the first pass. (see screenshot) Does this mean the HDD is functioning and the RAM is faulty?
    Usually a memory diagnostic on failing memory will demonstrate a pattern. Your memory did not display one of the typical patterns. Implying memory failure is due to something else; not defective memory.

    Smart usually would have identified a defective drive. The pattern of those failures also implies faults due to something else.

    You need not replace a PSU. A comprehensive reply identifying or exonerating many suspects requires some requested instructions, one full minute of labor, and a multimeter. Resulting numbers can quickly identify the reason for your intermittents in a next reply without even disconnecting one connector. However, with laptops, sometimes those important numbers are not easily obtained.

    Another powerful diagnostic is heat. For example, memory could be completely defective. Work just fine in a 70 degree F room. But MemTst would identify that one or many defects when executed in a 100 degree F room or when memory is heated by a hairdryer on highest heat. Those temperatures are ideal for any properly working semiconductors.

    Two suggestions to identify what is defective long before replacing anything. And how to obtain useful recommendations quickly.

  8. #8
    stevea Guest

    Re: [HDD] hardware problem?

    dobbi called it - your have a memory problem. This would explain the other symptoms.

    You might want to try the most current firmware (v3.07 according to lenovo) but the changelog on that version has no relevant changes. If you are far behind 3.07 there MIGHT have been changed that are relevant, but it's unlikely.

    The S.M.A.R.T. 184 parameter is well described on the wikipedia - and as dobbi conjectures it's related to data path errors (parity or crc).

    Lenovo may or my not accept memtest86+ errors as sufficient for a return, but you'll need to run some intensive memory test to see this problem.

    ===

    Memory is a major subsystem of any computer - it's a lot more than RAM chips. Software memory tests necessarily test the entire path from CPU to controllers (on chip these days) to buses to the memory parts. Errors anywhere along this path and back may show up. It's a bit silly to worry about exactly where the problem is, unless you have access to hardware analyzers, or at least have replaceable components (unlike a laptop).

    There is a clear and obvious pattern from those 4 lines of errors.

    The errors occur pairwise across 64 bits (so 2 pairs of errors).
    The addresses are apparently unrelated wrt dram access.
    The addresses do not align on a cache-line boundary ...
    - the two errors occurred at "addresses mod 64" of 16 and 48 bytes respectively - unaligned wrt cache lines.
    The data written (and expected to be read) was suspiciously identical for both errors.
    The data actually read is unrelated to the data pattern written and flushed, but is oddly identical in both cases:
    - 0xe4fb1a17-88000040. The first word looks suspiciously like an address of a PCI MBAR register, or other memory
    - mapped device (often at 0xExxxxxxx) , tho' that's speculative.

    This suggests that the bad pattern likely appeared on the data bus while the read was clocked in but was never in dram at the indicated address. There are are many possible explanations, but it's most likely that some device incorrectly enabled its bus outputs when it should not have ("should not" according to the software, not electrical signals).

    This sort of pattern is common when there are timing or noise problems on the buses. If some over-clocker/mobo voltage tweaker observed this I'd refer then to the default setting and power. On a laptop it's very unlikely that you've tweaked the timing or multipliers.

    --

    Since we generally don't have bus analyzers nor the ability to replace/ swap the major components - this sort of memtest pattern review is MOSTLY pointless.

    "It's broken" and occasionally "this dram stick seems broken" is about as much resolution as you should expect from memtest86+.



    This is from the parent app "memtest"

    http://www.memtest.org/download/1.55...st86+-1.55/FAQ

    - What do I do when I get errors?

    Firstly, don't start drawing any conclusions. You only know that memtest86+
    is giving your errors, not what the cause is. Unfortunately it is not a
    straightforward exercise to decisively test the memory in an actual system.
    This is because a computer is not just built up of some memory, but also
    includes many other elements such as a memory controller, cache, a cache
    controller, algorithmic and logic units, etc, all of which contribute to the
    machine. If there are faults in any of these other parts of the computer you
    will likely also see errors showing up in memtest.

    So what to do? First verify that the BIOS settings of your machine are
    correctly configured. Look up the memory timing settings applicable to the
    brand and type of memory modules you have and check they match your BIOS
    settings, correct them if they don't and run memtest again

    Ok, you have all the settings correctly set and you're still getting errors.
    Well of course a very likely cause are the memory modules and the logical
    course of action is to look into them further.

    If you are well stocked, have a few other machines at your disposal, or just
    want to spend the cash for some new modules the best way to test if the
    cause are your memory modules is just to replace them and test again. If you
    are less fortunate though there is still something you can do.

    If you have more then one module in your system, test them one by one, if
    one is consistently giving errors and another is consistently showing no
    errors it's a pretty good bet that the module giving the errors is simply
    defective. To exclude the possibility that a defective slot is throwing your
    results, use the same slot to test each different module.

    If each module by itself shows no errors, but when you place two or more
    modules into the machine at the same time you do get errors, you are most
    likely stuck with a compatibility issue and unfortunately there isn't a
    whole lot you can do about it. Be sure to check your computer/motherboard
    manual to see if the setup you are trying is allowed, some boards require
    special restrictions in the sizes of modules, the order of modules, the
    placement of double sided and single sides modules and more of such things.

    If you have only one module in your system, or all modules are giving
    errors, there are only very few options left. The only thing you can do
    really is to try the module(s) in another slot. Finally simply try out
    different orders of the memory modules, although your manual might not
    mention anything on the matter sometimes there simply exist timing or other
    issues which can be resolved by changing the order of your modules. And of
    course test each slot by putting a single module into that slot and running
    memtest on it.

    In the end if you still have not been able to localize the problem you will
    have to find a replacement module to establish whether the problem lies in
    your modules. See if you can borrow a module from someone else.

    When you have replaced the memory by new memory and the errors still
    persist, first check if you can rule out any compatibility issues or timing
    issues. If you are sure the memory should work in the system the cause of
    the errors must obviously lie someplace else in the system.

    The only way to find out where, is by trial and error really. Simply start
    replacing and/or removing parts of your computer one by one, running memtest
    each time you changed anything, until the errors are resolved.


    - I'm getting errors in test #x, what doest that mean?

    Interpreting memtest results is as scientific an endeavour as testing
    whether a person is a witch by the methods used in Monty Python's Holy
    Grail. In short, don't even start, it's not going to get you anywhere. Just
    interpret any error as you should any other and use the methods descibed in
    the previous question to determine the cause.

    Few options on a laptop w/ internal memory.
    Last edited by stevea; 2nd September 2014 at 06:33 PM.

  9. #9
    Join Date
    Apr 2009
    Location
    central NY, USA
    Posts
    1,669
    Mentioned
    6 Post(s)
    Tagged
    0 Thread(s)

    Re: [HDD] hardware problem?

    FWIW - I've had a couple machines of late where re-seating the memory fixed my 'bad memory' problems.
    ---

  10. #10
    Join Date
    Aug 2014
    Location
    Oregon
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Re: [HDD] hardware problem?

    Thank you all for your help and additional information.

    I decided to print the memtest result and send it together with the laptop to lenovo. It seems they exchanged the RAM and everything works as usual, now.

Similar Threads

  1. Do I have a HDD problem
    By brigame in forum Hardware
    Replies: 6
    Last Post: 7th June 2007, 05:03 PM
  2. Triple Boot-winxp and suse10 on 1st hdd, fedora core 5 on 2nd hdd
    By likesfedora in forum Installation, Upgrades and Live Media
    Replies: 2
    Last Post: 9th July 2006, 03:18 PM
  3. Adding slave HDD causes system to load from wrong HDD
    By red_lego_man in forum Hardware
    Replies: 3
    Last Post: 12th April 2006, 07:11 PM
  4. Hardware Problems (HDD)
    By haddad in forum Hardware
    Replies: 4
    Last Post: 5th September 2004, 01:46 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
[[template footer(Guest)]]