A “bad hard drive” on Linux can mean two very different things, and your recovery plan depends on which monster is hiding under the bed. Monster one is filesystem corruption: the drive itself is basically okay, but ext4, XFS, or another filesystem got scrambled after a crash, sudden power loss, or an unfortunate dance with a flaky USB cable. Monster two is physical media failure: bad sectors, mounting errors, I/O timeouts, SMART warnings, clicking noises, disappearing devices, or the classic Linux love letter: Input/output error.

The trick is not to panic and immediately start firing random repair commands like a sysadmin in an action movie. On Linux, the safest order is simple: stop writing to the disk, identify the device, check SMART health, clone the drive if it is failing physically, and only then repair the filesystem. That order matters. If you skip straight to “repair,” you can turn a recoverable mess into a memorable disaster.

This guide walks through the practical Linux workflow, explains which commands are safe, shows when to use smartctl, ddrescue, e2fsck, xfs_repair, and badblocks, and helps you decide when a drive should be fixed, cloned, retired, or politely escorted into the recycling bin.

What “Fixing” a Bad Drive Actually Means

Let’s clear up a common misconception: software usually cannot “heal” a dying drive the way a bandage helps a scraped knee. What Linux tools can do is:

  • Diagnose whether the problem is physical or logical.
  • Reduce additional damage by stopping unnecessary writes.
  • Rescue readable data from a failing drive.
  • Repair filesystem metadata corruption on a good clone or on a drive that is still physically stable.
  • Mark bad blocks so ext-family filesystems avoid reusing them.

What Linux tools cannot do is magically restore a hard drive with spreading media damage, a failing controller, or heads that are one bad morning away from retirement. If the SMART counters keep climbing, the drive disappears from the bus, or read errors multiply during normal use, your real goal is not “fix the drive forever.” It is “save the data, replace the drive, and pretend this was all part of the maintenance plan.”

Step 1: Stop Writing to the Drive

If the drive is throwing errors, every write can make things worse. Journaling filesystems may also replay logs automatically, which sounds helpful until you realize you wanted a frozen scene for recovery, not a surprise rewrite.

Start by finding the device:

That gives you the layout so you do not accidentally repair the wrong disk. Linux does not judge, but it also does not forgive typos like /dev/sdb versus /dev/sdc.

Unmount any mounted partitions on the suspect drive:

If you need to keep the device as untouched as possible, set it read-only:

If you absolutely must mount an ext3 or ext4 filesystem just to inspect files, use read-only mode with noload so the journal is not replayed:

That is the “look, don’t touch” setting. Think museum glass, but for a panicking disk.

Step 2: Check the Symptoms Before You Swing a Hammer

Now gather clues. A bad drive usually leaves fingerprints:

  • Physical failure signs: I/O errors, bad sectors, rising SMART counters, timeouts, device resets, the drive vanishing after heavy reads, or painfully slow access.
  • Filesystem corruption signs: the disk shows up normally, but a partition will not mount, Linux reports a dirty filesystem, or you see messages like Structure needs cleaning.
  • Connection problems: an external enclosure, weak USB power, a bad SATA cable, or a flaky adapter can imitate a dying disk surprisingly well.

Kernel logs often tell the story:

If you see repeated link resets, controller errors, or USB disconnects, test a different cable, port, enclosure, or power source before sentencing the drive. Sometimes the villain is not the disk. Sometimes it is the five-dollar adapter that came free with something else and has been lying to you for months.

Step 3: Use SMART Data to Judge the Drive’s Health

The next stop is SMART, the built-in health reporting system for modern drives. On Linux, the standard tool is smartctl from the smartmontools package.

Pay close attention to attributes and logs related to:

  • Reallocated sectors
  • Current pending sectors
  • Uncorrectable sectors
  • Reported uncorrectable errors
  • Command timeouts

A single weird reading does not always mean instant death, but counters that are nonzero and climbing are bad news. In practice, a drive with growing pending or uncorrectable sectors is not a “repair project.” It is a “get your files off now” project.

Run SMART self-tests too:

Then check the results:

The short test is a quick screening pass. The long test is more thorough and can take from tens of minutes to several hours depending on capacity and health. If the long test stops with read failures or the drive slows to a crawl, take the hint. The disk is not being dramatic. It is asking for retirement.

Step 4: If the Drive Is Failing Physically, Clone It First

This is the most important step in the whole guide.

If the drive has bad sectors, read timeouts, or rising SMART error counts, do not run aggressive repairs first. Clone the drive or image it first with ddrescue. Unlike old-school dd, GNU ddrescue is designed for failing media. It copies the easy, readable parts first, keeps a mapfile of what was rescued, and lets you resume later without starting over.

Create an image on a healthy destination drive with enough free space:

That first pass grabs the low-hanging fruit without wasting time fighting every damaged spot. Then make a slower retry pass:

Why this order works:

  • The fast pass captures good data before the drive gets worse.
  • The log file records progress so you can stop and resume safely.
  • The retry pass focuses only on the trouble areas instead of rereading the entire disk.

If the source is a whole disk, image the whole disk. If only one partition matters and the hardware is stable enough, you can work partition by partition. Still, whole-disk imaging is usually smarter because partition tables and boot structures matter during recovery.

After imaging, do your repair work on the image or on a cloned replacement disk, not the failing original. That one has already done enough for the team.

Step 5: Repair ext4, ext3, or ext2 Filesystems Safely

If SMART looks decent and the real issue is logical corruption, the ext-family repair tool is e2fsck. Use it only on an unmounted filesystem unless you are doing a no-write dry run.

Start with a dry run:

This prints what would be changed without modifying the filesystem. If the output shows directory errors, inode problems, journal issues, or orphaned files, you have confirmation that the filesystem is damaged.

To actually repair it:

The -f option forces a full check. You can answer prompts interactively, which is often better than blindly auto-accepting everything. If this is a noncritical filesystem and you have a verified backup or clone, you can use -y to answer yes automatically, but do that with your eyes open.

If you suspect weak sectors on an ext filesystem and the hardware is stable enough, e2fsck can run a read-only bad-block scan and add the bad areas to the bad block inode:

That is useful when the media problem is limited and you are trying to keep an older disk alive just long enough for migration. It is not a miracle cure for a drive that is actively shedding sectors like autumn leaves.

Step 6: Repair an XFS Filesystem the Right Way

XFS plays by slightly different rules. The main repair tool is xfs_repair, and the filesystem must be unmounted. Also, fsck.xfs is basically a placeholder, not your hero.

First do a dry run:

If the mount failed with Structure needs cleaning, XFS metadata or the log may be corrupted. A normal repair looks like this:

But there is one spicy option you should treat like hot sauce: -L.

This zeros the log and is a last resort when the log cannot be replayed. It may discard metadata updates that were in flight during the crash, which can cause real data loss. In plain English: use it only after you understand the risk and preferably after imaging the filesystem first.

Step 7: Use badblocks Carefully, Not Recklessly

badblocks is useful, but it is also one of those Linux tools that rewards caution and punishes enthusiasm.

A read-only scan is the least risky option:

That can help you identify unreadable areas on an unmounted partition. For ext filesystems, you can later feed the list into e2fsck so the filesystem avoids those blocks.

What you should not do on a disk with valuable data is casually use write-mode testing:

That option overwrites data. Completely. It is fantastic if your goal is “test an empty device.” It is terrible if your goal is “keep my family photos.” Non-destructive read-write mode exists too, but on an unstable drive it still adds stress. In most real recovery situations, ddrescue first, badblocks later, and only if there is a good reason.

Step 8: What If the Root Filesystem Is Damaged?

If the broken filesystem is your Linux root partition, repair it from a live USB or rescue environment. Running filesystem repair on the partition you booted from is like trying to replace the floor while standing on it. Linux will object, and for once it is right.

The usual workflow is:

  1. Boot from a live Linux USB.
  2. Identify the root partition with lsblk.
  3. Do not mount it read-write.
  4. Run e2fsck -n or xfs_repair -n first.
  5. Repair only after reviewing the dry-run output.

If the drive itself is physically unstable, image it before you get ambitious. A rescue environment does not magically make bad sectors behave better.

Step 9: Know When to Replace the Drive

Sometimes the most professional fix is a replacement, not a heroic command sequence.

Replace the drive if:

  • SMART reports increasing pending, reallocated, or uncorrectable sectors.
  • Long self-tests fail.
  • The disk repeatedly disappears or resets under load.
  • ddrescue shows large unreadable regions or progress keeps degrading.
  • Filesystem corruption returns soon after repair.
  • The drive makes mechanical noises that sound like a tiny robot learning percussion.

A repaired filesystem on a dying drive is still a filesystem on a dying drive. The command succeeded; the mission did not.

How to Prevent This Problem Next Time

Linux gives you great tools, but prevention is still cheaper than recovery. A few habits go a long way:

  • Enable SMART monitoring with smartd or check smartctl regularly.
  • Keep backups using the 3-2-1 rule: three copies, two media types, one off-site copy.
  • Replace questionable SATA cables, adapters, and external enclosures early.
  • Shut systems down cleanly, especially for USB backup drives and home servers.
  • Test restores from backups, not just backup creation.

The best hard-drive repair story is the one that ends with, “No big deal, I restored from backup in twenty minutes and then made coffee.”

Experience-Based Lessons From Real Linux Drive Recovery

In real-world Linux recovery, the biggest mistake is usually not a wrong command. It is the wrong mindset. People often see a drive mount once, assume it is “mostly okay,” and then start copying files by hand, opening directories, retrying failed reads, rerunning checks, and generally poking the disk until it gets grumpier. A failing drive may give you one or two good reads and then fall apart under sustained access. That is why experienced Linux admins often act calm but move fast: they know the first hour matters more than the fanciest tool.

Another common experience is that the filesystem gets blamed for everything. ext4 gets accused. XFS gets accused. Linux gets accused. The cat may also get accused. But once SMART data and kernel logs are checked, the real problem is often hardware: a weak USB bridge, a half-dead enclosure, underpowered external storage, or a SATA cable that has decided reliability is optional. Swapping the cable or connecting the drive directly to SATA has rescued plenty of “bad drives” that were never truly bad in the first place.

There is also a pattern that repeats with older spinning disks: the first SMART warning is rarely the last one. A drive might show just a few pending sectors, still pass a basic health line, and even let you browse files normally. That false sense of security is dangerous. In practice, once a mechanical drive starts developing new bad sectors, many users discover that the error counts either hold steady for a short while or begin creeping upward. The experienced move is to treat that first warning as a migration signal, not as permission to keep using the drive for another six months because “it still works.”

Linux tools also teach humility. e2fsck and xfs_repair are powerful, but they are not magic wands. If the underlying media is unstable, repair tools may expose more corruption simply because they have to read parts of the disk that ordinary day-to-day use had not touched yet. That is why seasoned users love dry runs. A no-modify pass tells you whether you are looking at mild metadata damage or a much uglier story. It is the difference between needing a broom and needing a contractor.

And then there is ddrescue, which has saved an astonishing number of bad days because it respects reality. It does not pretend the drive is healthy. It assumes errors will happen, records what it has already read, and keeps moving. That approach feels wonderfully Linux: practical, patient, and just cynical enough to be correct. If you spend enough time around broken disks, you learn that recovery is not about finding the most dramatic command. It is about minimizing risk, preserving options, and knowing when the right answer is “clone first, repair later.” That mindset fixes more bad-drive situations than any single utility ever will.

Final Takeaway

If you want the shortest useful answer to How do I fix a bad hard drive on Linux?, here it is: do not write to the drive, check SMART, clone failing media with ddrescue, repair the filesystem only on an unmounted device or clone, and replace the drive if the hardware shows signs of deterioration.

Linux gives you excellent recovery tools, but it also assumes you know the difference between bravery and recklessness. Use smartctl to diagnose, ddrescue to rescue, e2fsck or xfs_repair to repair, and badblocks only when you understand exactly what kind of scan you are running. Do that, and you have a real chance of saving both your data and your dignity.

By admin