15%

Save 15% on All Hosting Services

Test your skills and get Discount on any hosting plan

Use code:

Skills
Get Started
08.10.2024

smartctl and smartmontools: The Complete Linux Drive Health Monitoring Guide

smartctl is the primary command-line interface of the smartmontools package, designed to query, test, and interpret S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) data embedded in the firmware of HDDs, SSDs, and NVMe drives. It communicates directly with drive firmware over ATA, SCSI, or NVMe interfaces to surface raw diagnostic telemetry that the operating system itself does not expose through standard I/O paths.

For any Linux administrator managing physical or virtual storage — whether on a bare-metal server, a Dedicated Server, or a locally-attached disk array — smartctl is the single most reliable tool for detecting imminent drive failure before it causes unrecoverable data loss.

What Is S.M.A.R.T. and Why Does It Matter

S.M.A.R.T. is a monitoring system built into virtually every consumer and enterprise storage device manufactured after 1996. It operates at the firmware level, continuously tracking dozens of internal parameters: read/write error rates, mechanical stress indicators, NAND wear levels, reallocated sector counts, and thermal readings.

The critical distinction most guides miss: S.M.A.R.T. data is predictive, not reactive. A drive can pass a filesystem check and serve I/O normally while simultaneously accumulating reallocated sectors at a rate that statistically predicts failure within weeks. smartctl surfaces this hidden degradation state.

S.M.A.R.T. operates across three storage interface families:

  • ATA/SATA — the original S.M.A.R.T. specification, most attribute-rich
  • SCSI/SAS — uses a different attribute model (Informational Exceptions log pages)
  • NVMe — exposes health data through the NVMe Health Information Log (Log Page 0x02), with metrics like available spare capacity, percentage used, and unsafe shutdowns

Installing smartmontools on Linux

The smartmontools package is available in every major Linux distribution's official repositories. Install the version appropriate for your environment:

Debian / Ubuntu:

“`bash

sudo apt-get update && sudo apt-get install smartmontools

“`

CentOS / RHEL 7:

“`bash

sudo yum install smartmontools

“`

CentOS Stream / RHEL 8+ / AlmaLinux / Rocky Linux:

“`bash

sudo dnf install smartmontools

“`

Fedora:

“`bash

sudo dnf install smartmontools

“`

Arch Linux:

“`bash

sudo pacman -S smartmontools

“`

openSUSE:

“`bash

sudo zypper install smartmontools

“`

After installation, verify the version and confirm NVMe support is compiled in:

“`bash

smartctl –version

“`

Look for `NVMe` in the supported device types list. Versions prior to 6.6 have incomplete NVMe support — on modern servers running NVMe SSDs, ensure you are running smartmontools 7.x or later.

Identifying Your Storage Devices

Before running any smartctl command, identify the correct device node. Mixing up device identifiers on a multi-disk system is a common and costly mistake.

“`bash

lsblk -d -o NAME,SIZE,MODEL,TRAN

“`

This outputs device names alongside the transport type (sata, nvme, usb), which directly informs which smartctl flags you will need. For NVMe devices, the node will appear as `/dev/nvme0`, `/dev/nvme1`, etc. — not `/dev/sdX`.

For hardware RAID controllers (LSI MegaRAID, Adaptec, HP Smart Array), the drives are hidden behind the controller and require explicit pass-through flags, covered in the advanced section below.

Core smartctl Commands

Viewing Device Identity Information

“`bash

sudo smartctl -i /dev/sda

“`

This queries the device's IDENTIFY DATA page and returns the model number, serial number, firmware revision, capacity, sector size (512-byte logical vs. 4096-byte physical — important for alignment), and the S.M.A.R.T. capability flags. On NVMe devices:

“`bash

sudo smartctl -i /dev/nvme0

“`

Running a Full Health Assessment

“`bash

sudo smartctl -H /dev/sda

“`

Returns a single-line verdict: `PASSED` or `FAILED`. A `FAILED` result means the drive's own firmware has determined that one or more critical thresholds have been crossed. A drive reporting FAILED should be treated as failed — do not wait for further confirmation.

However, a `PASSED` result does not mean the drive is healthy. It only means no threshold has been formally breached. This is why raw attribute analysis is essential.

Displaying All S.M.A.R.T. Attributes

“`bash

sudo smartctl -A /dev/sda

“`

This is the most information-dense command in routine use. The output table contains several columns that require precise interpretation:

ColumnMeaning
**ID#**Attribute identifier (vendor-specific, but many are standardized)
**ATTRIBUTE_NAME**Human-readable name
**FLAG**Bitmask indicating attribute type (pre-failure vs. advisory)
**VALUE**Normalized value (typically 0–253); lower is worse for most attributes
**WORST**Lowest VALUE ever recorded during the drive's lifetime
**THRESH**Threshold below which the drive declares failure
**TYPE**Pre-failure (critical) or Old_age (informational)
**RAW_VALUE**The actual measured quantity in native units

The RAW_VALUE is what you should primarily analyze. The normalized VALUE/WORST/THRESH system is useful for automated threshold detection but can be misleading — some manufacturers use non-standard normalization curves.

Comprehensive Output: Combining Flags

For a complete picture in a single command, combine the information, health, and attributes flags:

“`bash

sudo smartctl -a /dev/sda

“`

The lowercase `-a` flag is equivalent to `-H -i -c -A -l error -l selftest`. This is the standard command to run when diagnosing an unfamiliar drive.

For an even more verbose output including all log pages:

“`bash

sudo smartctl -x /dev/sda

“`

Critical S.M.A.R.T. Attributes and How to Interpret Them

Not all S.M.A.R.T. attributes carry equal diagnostic weight. The following are the attributes that experienced storage engineers treat as primary failure indicators:

High-Priority Failure Indicators

Reallocated_Sector_Ct (ID 5)

The count of sectors the drive has remapped to spare area due to read/write/verify errors. Any non-zero value on a drive under two years old warrants immediate attention. A steadily increasing count — even from 1 to 5 over a month — is a strong predictor of imminent failure. On enterprise drives, a small number may be acceptable depending on the manufacturer's specification.

Current_Pending_Sector (ID 197)

Sectors flagged as unstable and awaiting remapping. These sectors have produced errors during reads but have not yet been successfully remapped. A non-zero value here means the drive is actively struggling to read data. If a pending sector is subsequently written successfully, it may be remapped or cleared — but the underlying media is suspect.

Uncorrectable_Sector_Count (ID 198) / Offline_Uncorrectable

Sectors that could not be corrected by ECC and could not be remapped. This is the most severe attribute. A non-zero value here means data has already been lost from those sectors. Immediate backup and drive replacement is the only appropriate response.

Reported_Uncorrect (ID 187)

On modern drives, this counts errors that the drive's internal ECC could not correct. High values indicate serious media degradation.

Spin_Retry_Count (ID 10)

For HDDs, repeated failures to spin up the platters to operating speed. Indicates mechanical stress on the spindle motor or bearings. Any non-zero value on a drive under heavy use is a red flag.

Command_Timeout (ID 188)

The count of commands that aborted due to timeout. Elevated values often indicate interface issues (cable, controller, or power delivery) rather than media failure — but they can also precede total drive failure.

Secondary Monitoring Attributes

Raw_Read_Error_Rate (ID 1)

Frequently misread: on Seagate drives, this attribute has a very high raw value by design, as it represents a ratio encoded in a 48-bit field. A raw value of several million on a Seagate drive is normal. On Western Digital and other manufacturers, the raw value should be near zero. Always cross-reference with the manufacturer's documentation before alarming on this attribute.

Power_On_Hours (ID 9)

Total operational hours. Consumer HDDs are typically rated for 20,000–25,000 hours (roughly 2–3 years of continuous operation). Enterprise drives are rated for 55,000+ hours. Use this to contextualize other attribute values.

Temperature_Celsius (ID 194)

The current drive temperature. The optimal operating range for most HDDs is 25–45°C. Sustained temperatures above 55°C accelerate bearing wear and magnetic media degradation. For SSDs, the thermal tolerance is generally higher, but sustained temperatures above 70°C will accelerate NAND wear. On servers without adequate airflow — a common issue in dense rack deployments — thermal throttling can mask itself as I/O latency before the temperature attribute crosses any formal threshold.

Wear_Leveling_Count (ID 177) / Media_Wearout_Indicator (ID 233)

SSD-specific attributes tracking NAND endurance consumption. When the normalized VALUE approaches the THRESH value, the SSD is approaching its rated write endurance. Plan replacement proactively.

Power_Cycle_Count (ID 12)

Frequent power cycling causes mechanical stress on HDDs (head parking/unparking, spindle motor stress) and, to a lesser extent, electrical stress on SSDs. Unusually high counts relative to Power_On_Hours can indicate an unstable power environment.

Running Diagnostic Self-Tests

S.M.A.R.T. self-tests are executed by the drive's own firmware, not by the operating system. The drive continues to serve I/O during testing (with a minor performance impact), making these tests safe to run on production systems.

Short Self-Test

“`bash

sudo smartctl -t short /dev/sda

“`

Duration: typically 1–5 minutes. Tests the electrical and mechanical components and a small percentage of the disk surface. Useful for a quick sanity check. View results after completion:

“`bash

sudo smartctl -l selftest /dev/sda

“`

Long Self-Test (Extended Test)

“`bash

sudo smartctl -t long /dev/sda

“`

Duration: proportional to drive capacity — expect 1 hour per terabyte on a typical 7200 RPM HDD, and 20–60 minutes on most SSDs. Performs a complete surface scan of every sector. This is the definitive test for detecting bad sectors across the entire media surface.

Monitor progress without waiting for completion:

“`bash

sudo smartctl -c /dev/sda

“`

The output includes a percentage complete and an estimated finish time.

Conveyance Self-Test

“`bash

sudo smartctl -t conveyance /dev/sda

“`

Available on most ATA drives. Designed to detect damage incurred during shipping or physical handling. Checks for issues caused by mechanical shock. Duration is typically 5 minutes. This test is underused — it is particularly valuable when commissioning new hardware or after a server has been physically relocated.

Selective Self-Test

“`bash

sudo smartctl -t select,0-1000000 /dev/sda

“`

Allows testing a specific LBA range rather than the entire drive. Invaluable when filesystem checks have flagged errors in a specific region and you need to confirm whether the underlying media is responsible.

NVMe-Specific Health Monitoring

NVMe drives use a fundamentally different attribute model. The primary health command:

“`bash

sudo smartctl -a /dev/nvme0

“`

Key NVMe-specific fields to monitor:

  • Available Spare: Percentage of spare NAND blocks remaining. Below 10% warrants replacement planning.
  • Available Spare Threshold: The manufacturer-defined threshold below which the drive may report degraded reliability.
  • Percentage Used: Estimated percentage of rated write endurance consumed. At 100%, the drive has reached its rated TBW (Terabytes Written) — it may continue to function, but reliability guarantees no longer apply.
  • Data Units Read / Written: Cumulative I/O in 512,000-byte units. Useful for calculating actual workload vs. rated TBW.
  • Media and Data Integrity Errors: Unrecovered media errors or data integrity errors detected by the NVMe controller. Any non-zero value is serious.
  • Number of Error Information Log Entries: Count of error log entries. A rapidly growing count indicates persistent issues.
  • Power State: Current NVMe power state — relevant for latency-sensitive workloads where the drive may be in a deep power-saving state.

Hardware RAID and Pass-Through Access

When drives are connected through a hardware RAID controller, the OS sees the controller's virtual disk, not the physical drives. smartctl requires explicit pass-through to reach the drive firmware directly.

LSI MegaRAID:

“`bash

sudo smartctl -a /dev/sda -d megaraid,0

sudo smartctl -a /dev/sda -d megaraid,1

“`

HP Smart Array (hpsa driver):

“`bash

sudo smartctl -a /dev/sda -d cciss,0

“`

3ware RAID:

“`bash

sudo smartctl -a /dev/twa0 -d 3ware,0

“`

If you are unsure which pass-through type to use, smartctl can attempt auto-detection:

“`bash

sudo smartctl –scan

“`

This scans for all detectable storage devices and outputs the recommended device path and type flag for each.

Enabling and Disabling S.M.A.R.T.

S.M.A.R.T. is enabled by default on virtually all modern drives. In rare cases — typically older drives or certain virtualized environments — it may be disabled:

“`bash

sudo smartctl -s on /dev/sda

“`

To disable (not recommended except for specific testing scenarios):

“`bash

sudo smartctl -s off /dev/sda

“`

Note: In virtualized environments such as KVM or VMware, S.M.A.R.T. pass-through to guest VMs depends on the hypervisor configuration. On a VPS Hosting environment, the hypervisor typically abstracts the physical drive, and smartctl inside the guest may return limited or no data. For full S.M.A.R.T. access, physical host-level monitoring or a Dedicated Server is required.

Automated Monitoring with smartd

The smartd daemon is the production-grade component of smartmontools. It runs in the background, periodically polling drives and executing scheduled tests, then alerting administrators when thresholds are crossed or test failures occur.

Configuring /etc/smartd.conf

The default configuration monitors all detected drives with basic settings. A production-hardened configuration looks like this:

“`

Monitor all ATA/SCSI/NVMe drives with full attribute checking

Run short test every day at 02:00, long test every Saturday at 03:00

Email alert on any failure or attribute change

DEVICESCAN -a -o on -S on

-s (S/../.././02|L/../../6/03)

-m admin@yourdomain.com

-M exec /usr/share/smartmontools/smartd-runner

“`

Key directives explained:

DirectiveFunction
`DEVICESCAN`Auto-detect all supported drives
`-a`Enable all S.M.A.R.T. checks
`-o on`Enable automatic offline data collection
`-S on`Enable attribute autosave
`-s (S/../.././HHL/../../D/HH)`Schedule short (S) and long (L) tests
`-m email@domain`Send alert emails to this address
`-M exec script`Execute a script on alert (for custom notifications)
`-d removable`Do not raise an error if the device is absent at startup

Enabling and Starting smartd

“`bash

sudo systemctl enable smartd

sudo systemctl start smartd

sudo systemctl status smartd

“`

Verify the daemon is actively monitoring:

“`bash

sudo journalctl -u smartd -f

“`

Email Alerting Configuration

For smartd email alerts to function, the system requires a working MTA (mail transfer agent). On minimal server installations, install and configure `postfix` or `msmtp` for relay. If you are using a dedicated mail infrastructure, consider pairing smartd alerts with a properly configured Email Hosting service to ensure alert delivery is not blocked by spam filters.

S.M.A.R.T. Attribute Comparison: HDD vs. SSD vs. NVMe

Attribute CategoryHDD (SATA)SSD (SATA)NVMe SSD
**Primary failure indicator**Reallocated_Sector_Ct (ID 5)Reallocated_Sector_Ct (ID 5)Media and Data Integrity Errors
**Endurance tracking**Power_On_Hours (ID 9)Wear_Leveling_Count (ID 177)Percentage Used
**Pending bad sectors**Current_Pending_Sector (ID 197)Current_Pending_Sector (ID 197)N/A (controller-managed)
**Thermal monitoring**Temperature_Celsius (ID 194)Temperature_Celsius (ID 194)Temperature Sensor 1/2
**Spare capacity**N/AAvailable_Reservd_Space (ID 232)Available Spare
**Write endurance**N/ATotal_LBAs_Written (ID 241)Data Units Written
**Mechanical health**Spin_Retry_Count (ID 10)N/AN/A
**Test types supported**Short, Long, Conveyance, SelectiveShort, Long, SelectiveShort, Long (vendor-dependent)
**Interface for data access**ATA SMART READ DATAATA SMART READ DATANVMe Log Page 0x02

Practical Workflow: Diagnosing a Suspect Drive

When a drive exhibits symptoms — I/O errors in `dmesg`, filesystem corruption, unusual latency — follow this diagnostic sequence:

Step 1: Confirm S.M.A.R.T. availability

“`bash

sudo smartctl -i /dev/sda | grep -i "SMART support"

“`

Step 2: Check overall health verdict

“`bash

sudo smartctl -H /dev/sda

“`

Step 3: Inspect the error log

“`bash

sudo smartctl -l error /dev/sda

“`

The error log records the last 5 ATA errors with timestamps and LBA addresses. Cross-reference the LBA addresses with filesystem block maps using `debugfs` or `xfs_db` to determine if errors affect critical filesystem structures.

Step 4: Analyze critical attributes

“`bash

sudo smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrectable|Command_Timeout"

“`

Step 5: Run a short test for immediate confirmation

“`bash

sudo smartctl -t short /dev/sda

sleep 120

sudo smartctl -l selftest /dev/sda

“`

Step 6: If the short test passes and symptoms persist, run a long test during a maintenance window

“`bash

sudo smartctl -t long /dev/sda

“`

Step 7: Review the self-test log for LBA failure addresses

“`bash

sudo smartctl -l selftest /dev/sda

“`

Failed tests report the LBA of the first error encountered. This pinpoints the exact location of media damage.

Common Pitfalls and Edge Cases

USB-attached drives: smartctl often cannot communicate with drives connected via USB-to-SATA adapters because the USB bridge chip does not forward ATA commands. Use the `-d sat` or `-d usb` flag, or specify the bridge type explicitly (e.g., `-d usb,0x0bc2,0x2312`). The `–scan` flag will attempt to identify the correct type automatically.

Virtualized environments: Inside a KVM or VMware guest, `/dev/sda` is a virtual disk. smartctl may return `Device does not support SMART` or fabricated data passed through by the hypervisor. Do not rely on in-guest smartctl output for physical drive health assessment on shared hosting infrastructure.

False positives on Raw_Read_Error_Rate: As noted above, Seagate's encoding of this attribute causes alarming raw values that are entirely normal. Always verify against the manufacturer's attribute documentation before acting on this value.

Drives behind software RAID (mdadm): smartctl accesses the component drives directly (`/dev/sda`, `/dev/sdb`), not the RAID device (`/dev/md0`). Monitor each member drive individually.

NVMe namespace vs. controller: Use `/dev/nvme0` (controller) for health data, not `/dev/nvme0n1` (namespace/block device). Some smartctl versions accept both, but the controller node is authoritative for health log access.

Drives that lie: Some consumer SSDs, particularly lower-tier QLC NAND drives, have been documented to report healthy S.M.A.R.T. attributes until complete and sudden failure. S.M.A.R.T. is a strong indicator but not a guarantee — it does not replace regular backups.

Deploying smartmontools in Server Environments

For administrators managing multiple physical servers — such as those running workloads on Dedicated Servers — centralizing S.M.A.R.T. monitoring is essential. Consider the following architecture:

  • Prometheus + node_exporter: The `node_exporter` includes a `smartmon` textfile collector script that exports S.M.A.R.T. attributes as Prometheus metrics. This enables alerting via Alertmanager and visualization in Grafana dashboards.
  • Nagios / Icinga2: The `check_smart` plugin provides S.M.A.R.T. monitoring integration for traditional monitoring stacks.
  • Custom smartd scripts: The `-M exec` directive in smartd.conf can trigger any script on alert — useful for integrating with PagerDuty, Slack webhooks, or custom ticketing systems.
  • Logfile aggregation: Configure smartd to write to syslog, then forward to a centralized log aggregator (ELK stack, Loki) for historical trend analysis across a fleet.

For web hosting environments using control panels, VPS with cPanel deployments benefit from host-level S.M.A.R.T. monitoring configured by the infrastructure provider, since cPanel itself does not expose drive health data natively.

Key Takeaway Checklist

Use this as a pre-deployment and ongoing operational reference:

  • Install smartmontools 7.x or later to ensure full NVMe and modern SSD support
  • Run `smartctl –scan` on new hardware to identify all drives and their required interface flags
  • Check `-H` health verdict first — a `FAILED` result requires immediate action regardless of other attributes
  • Treat any non-zero Reallocated_Sector_Ct, Current_Pending_Sector, or Uncorrectable_Sector_Count as a failure signal — do not wait for the count to grow
  • Do not rely solely on Raw_Read_Error_Rate — validate against manufacturer documentation before alarming
  • Schedule automated long tests weekly via smartd on all production drives; short tests daily
  • Configure smartd email alerts and verify delivery before relying on them
  • For hardware RAID, always use the appropriate pass-through flag — monitoring the virtual disk is not equivalent to monitoring the physical drives
  • In virtualized environments, perform S.M.A.R.T. monitoring at the hypervisor/host level, not inside guest VMs
  • Pair S.M.A.R.T. monitoring with a backup strategy — S.M.A.R.T. predicts failure, it does not prevent data loss
  • For NVMe drives, monitor Available Spare and Percentage Used in addition to error counts
  • On multi-drive servers, integrate smartd with a centralized alerting platform rather than relying on local email delivery

Frequently Asked Questions

Does smartctl work inside a VPS or cloud instance?

In most VPS environments, the hypervisor presents a virtual block device to the guest. smartctl inside the guest will either return no data or return data synthesized by the hypervisor, which does not reflect the physical drive's actual health. Meaningful S.M.A.R.T. monitoring requires access to the physical host. For full drive-level visibility, a Dedicated Server is the appropriate solution.

What is the difference between `smartctl -a` and `smartctl -x`?

The `-a` flag outputs the standard set of S.M.A.R.T. data: device info, health verdict, capability flags, all attributes, error log, and self-test log. The `-x` flag outputs everything `-a` provides plus additional log pages including the selective self-test log, device statistics log, pending defects log, and ATA device statistics — providing a more complete picture of drive history. Use `-x` for thorough diagnostics; `-a` for routine checks.

How long does a long self-test take, and is it safe to run on a production drive?

Duration depends on drive capacity and speed: approximately 1 hour per terabyte for a 7200 RPM HDD, and 20–60 minutes for most SSDs. The test runs in the drive's firmware background, and the drive continues to serve I/O during the test. Performance impact is typically minor (5–15% throughput reduction). It is safe to run on production systems during low-traffic periods, but scheduling during a maintenance window is recommended for latency-sensitive workloads.

What does it mean when Current_Pending_Sector is non-zero but Reallocated_Sector_Ct has not increased?

The drive has identified sectors that are producing read errors but has not yet successfully remapped them. Remapping occurs when the drive can write to the sector — either through a write operation or an offline scan. If the count remains static, the sectors may be in a read-only region or the drive may lack available spare sectors for remapping. A non-zero and increasing Current_Pending_Sector count, with Reallocated_Sector_Ct remaining flat, often indicates the drive has exhausted its spare sector pool — a critical failure condition.

Can smartctl detect SSD wear before the drive fails?

Yes, for SATA SSDs, the Wear_Leveling_Count (ID 177), Media_Wearout_Indicator (ID 233), and Available_Reservd_Space (ID 232) attributes track NAND endurance consumption. For NVMe SSDs, the Percentage Used and Available Spare fields in the NVMe Health Information Log serve the same purpose. When Available Spare drops below its threshold or Percentage Used reaches 100%, the drive has consumed its rated write endurance. Unlike sudden mechanical failures, SSD wear degradation is gradual and highly predictable — making it one of the strongest use cases for proactive S.M.A.R.T. monitoring.

15%

Save 15% on All Hosting Services

Test your skills and get Discount on any hosting plan

Use code:

Skills
Get Started