Search This Blog

Friday, September 16, 2016

Backblaze hard disk drive failure data: Update to Q2 2016

Ross Lazarus, September 2016

This is a Kaplan Meier analysis of the BackBlaze hard drive reliability data, using all available data to end second quarter of 2016 from 

Previous posts are at and .

I reran my scripts and got the plots shown below. It's taking a while to read all the data as there are now a very large number of drives spinning. A total of 41740623 rows were processed in about 35 minutes on my home desktop by the python script in the github repository.

The new 8TB drives are performing the best of all - even better than the HGST and Hitachis - and way better than any of the earlier seagates. Hard to miss here - not so obvious in the report at Backblaze

Updated curves:

By Manufacturer:

Add caption

Once again for me, little change is seen in the KM curves and statistics with a lot more drives and a lot more observaton time, suggesting that this statistical approach is reliable and robust, although in general we expect that more data provides better resolution. 

In terms of the KM statistical tests, additional data confirms the earlier inference that there are significant differences between the manufacturer and model risk profiles over time.

survdiff(formula = sm ~ model, data = dm, rho = 0)

                                  N Observed Expected (O-E)^2/E (O-E)^2/V

model=HGST HMS5C4040ALE640     7168       85   505.51   349.800   406.826
model=HGST HMS5C4040BLE640     8505       29   269.99   215.103   231.736
model=Hitachi HDS5C3030ALA630  4664      117   466.48   261.826   302.989
model=Hitachi HDS5C4040ALE630  2719       71   268.60   145.365   157.458
model=Hitachi HDS722020ALA330  4774      215   472.27   140.149   161.908
model=Hitachi HDS723030ALA640  1048       55   103.54    22.753    23.459
model=ST3000DM001              4707     1705   246.40  8634.322  9272.385
model=ST31500341AS              787      216    35.74   909.141   917.789
model=ST31500541AS             2188      392   157.42   349.574   363.940
model=ST4000DM000             36089     1123  1500.66    95.042   151.313
model=ST500LM012 HN             801       26    22.42     0.573     0.577
model=ST6000DX000              1915       31    77.14    27.601    28.497
model=ST8000DM002              2754        3     3.74     0.146     0.149
model=WDC WD10EADS              550       60    46.72     3.773     3.818
model=WDC WD30EFRX             1289      136    87.38    27.053    27.637

 Chisq= 11353  on 14 degrees of freedom, p= 0 

survdiff(formula = s ~ manufact, data = ds, rho = 0)

                     N Observed Expected (O-E)^2/E (O-E)^2/V

manufact=HGST    15840      120    821.8   599.348   744.193
manufact=Hitachi 13246      462   1433.5   658.440  1046.810
manufact=HN        801       26     23.6     0.242     0.243
manufact=ST      49900     3792   2255.7  1046.249  2067.849
manufact=TOSHIBA   279       12     13.6     0.181     0.182
manufact=WDC      3920      385    248.7    74.701    78.874

 Chisq= 2493  on 5 degrees of freedom, p= 0 

Wednesday, May 18, 2016

Survival analysis of hard disk drive failure data: Update to Q1 2016

Ross Lazarus, May 2016

This is an update to now that additional data for Q1 2016 has been released from
I reran my scripts and got the plots shown below. Whole process only takes a few minutes.

For me, the interesting thing is that so little really changes in the KM curves and statistics with 10% more data, suggesting that this statistical approach is reliable and robust, although in general we expect that more data provides better resolution. 

The WD30-EFRX and WD10-EADS and drives are reordered in terms of failure risk with more data down near the middle of the pack, but the updated models KM curves otherwise suggest the same pattern of risk of failure over time. Hitachi and HGST have reversed their positions at the top of the manufacturer survival curves as a result of the additional data, but the other manufacturers remain largely unchanged.

In terms of the KM statistical tests, additional data confirms the earlier inference that there are significant differences between the manufacturer and model risk profiles over time.

survdiff(formula = sm ~ model, data = dm, rho = 0)

                                  N Observed Expected (O-E)^2/E (O-E)^2/V
model=HGST HMS5C4040ALE640     7168       83    473.2    321.78    376.18
model=HGST HMS5C4040BLE640     3115       21    231.4    191.29    205.14
model=Hitachi HDS5C3030ALA630  4664      106    458.0    270.51    313.62
model=Hitachi HDS5C4040ALE630  2719       70    263.4    141.98    153.89
model=Hitachi HDS722020ALA330  4774      195    466.4    157.94    183.39
model=Hitachi HDS723030ALA640  1048       47    101.7     29.42     30.35
model=ST3000DM001              4707     1705    258.4   8100.00   8753.25
model=ST31500341AS              787      216     37.8    839.35    848.18
model=ST31500541AS             2188      392    166.0    307.66    321.95
model=ST4000DM000             35858      895   1302.4    127.45    195.39
model=ST500LM012 HN             656       24     17.2      2.70      2.71
model=ST6000DX000              1909       26     57.4     17.19     17.76
model=WDC WD10EADS              550       59     47.5      2.78      2.81
model=WDC WD30EFRX             1280      124     82.2     21.26     21.72

 Chisq= 10647  on 13 degrees of freedom, p= 0 
survdiff(formula = s ~ manufact, data = ds, rho = 0)

                     N Observed Expected (O-E)^2/E (O-E)^2/V
manufact=HGST    10449      110    740.5   536.857   668.750
manufact=Hitachi 13246      422   1380.6   665.600  1060.350
manufact=HN        656       24     18.2     1.885     1.896
manufact=ST      46909     3507   2032.7  1069.209  2056.135
manufact=TOSHIBA   255       10     11.3     0.154     0.155
manufact=WDC      3838      342    231.7    52.560    55.539

 Chisq= 2420  on 5 degrees of freedom, p= 0 

Here are the updated curves:

Sunday, February 28, 2016

Survival analysis of hard disk drive failure data.

Ross Lazarus, February 2016

Executive Summary:

Using a well established, objective analysis and data presentation method designed for right censored hard disk drive failure data provides insights which are not provided by simple descriptive statistics or charts. The Kaplan-Meier statistics and plots are recommended for routine use with hard drive failure data and their use is illustrated with 30M data points from the BackBlaze public data.


Hard disk drives are widely used for mass storage in servers, network attached storeage devices, laptops and desktop computers. Familiar and convenient as they are, these complex electro-mechanical devices are prone to sudden catastrophic failure, which can lead to very unpleasant consequences such as loss of data which was not securely backed up elsewhere. Selecting drive manufacturers and models for home or for commercial applications is complicated by the problem that objective and reliable measurements of the reliability of specific drive models or manufacturers is hard to find.

Subjective experience of individual consumers who purchase a few drives at a time is readily available in on-line product reviews at the larger retailers like Newegg or Amazon. These reviews are likely to be biased by negative reviews from those unlucky owners of a drive which happened to fail quickly - satisfied owners are less likely to take the time to share their experiences compared to unhappy owners who have just lost precious data.

Large commercial purchasers such as Google or Amazon probably do their own in-house testing, but rarely share their hard won findings or raw data. As drive capacities grow, new models are released on a regular basis but it takes at least 2 or 3 years of observation of a large number of sample drives under typical field operation conditions before robust conclusions can be drawn on the reliability over time for each new model.

The most recent analysis of about 50,000 hard disks deployed in a commercial on line storage facility over nearly 3 years run by Backblaze is one of the largest published studies and can be viewed at and in other Backblaze blogs. Simple statistics, tables and bar graphs derived from 30,301,566 observations are presented and discussed. That's a lot of data and the Backblaze engineers have done their best to make sense of it. Unfortunately I'm not sure that you can really see what's going on from their presentation. For example, time is split into 3 year-long intervals in the main table, making it confusing and hard to figure out what's really going on, and the summary bar charts hide an awful lot of interesting detail.

Failure time (or survival) analysis:

Part of the challenge in interpreting this type of data is the problem that at any point in time during the observation period, one or more drives (or patients or more generally, units of analysis) may fail, and one or more drives may be removed from any further study before failure because of firmware diagnostics or planned maintenance. In terms of statistical analysis, this problem is termed right censoring because no further information is available after a drive is removed. Right censoring must be taken into account in order to correctly calculate the instantaneous failure rate of drives in the context of drives removed from further observation at some point before they failed together with the remaining drives which have not failed (yet).

Epidemiologists and statisticians have established valid and robust methods for handling right censored data in the context of survival analysis, which are applicable to the Backblaze data. Survival rates are the inverse of failure rates, so survival and failure analysis are more or less mathematically equivalent, being two sides of the same technical coin although failure time analysis predominates in engineering circles whereas the survival analysis paradigm predominates in biology.

One popular method is the Kaplan-Meier (KM) plot and KM statistics, widely used to compare (for example) survival time after diagnosis for patients with the same cancer but different treatments. This kind of data is similar to the hard drive failure data because the reality is that it is almost inevitable that some patients in any clinical study will be lost to further follow up after a visit at which they were clearly alive. Those right censored patients, like the drives removed before failure, contribute no more information to the study, but do contribute useful information for the whole time they are being observed. Some details on where the data came from and how the analysis was performed are provided at the end of this article.

Application of survival analysis to hard disk drive failure data:

Here's a KM plot showing the survival of each drive by the manufacturer.

The vertical axis represents the fraction of drives which survived at any given point in time and the horizontal axis represents days since time zero. Each individual disk drive's history over time is "lined up" so the first day of observation is always at the far left, at time zero - like a race where each competitor starts at the same point, although in the raw data, drives were introduced to the pool continuously over the entire study period. Each manufacturer's drives are grouped together and their survival in service over time is plotted as a single line. When one or more drives fail, there is a small vertical step in the curve.  Each cross on each line represents a right censored observation removed from further study. Note that right censoring has no effect on the instantaneous survival rate - it simply changes the denominator for failure or conversely, survival rate calculations. Each downward step in each line represents one or more failures at that time.

Here is the Backblaze summary chart linked from their report at hard drive reliability by manufacturer

To my eye, the KM curve provides a much more detailed and arguably more accurate summary of what happened during the observation period. Note the curve for ST500LM012 which is an obvious anomaly arising from an abberant manufacturer string in the data ("ST500LM012 HN") where the two space delimited components in the data field are reversed (see below) compared to the majority of the data where the model follows the manufacturer abbreviation. This does not seem to have been noticed in the Backblaze analysis but the KM plot makes it obvious. No attempt has been made to correct this anomaly because it is not clear whether the model number means that the "HN" is wrong and should be replaced by "ST" - I'll leave that for the BackBlaze engineers to figure out and fix!

One example of a feature that was not at all obvious from the Backblaze analysis, but is clear from the KM plot, is the crossover in failure rate between ST (Seagate) and WDC (Western Digital). Initially, the WDC family failed slightly faster but the Seagate family of samples failed more quickly after about the first year of operation.

The KM statistical test estimates expected failure rates from mean failure rates and the number of units under observation at each time point and as shown below, suggests that drive survival is significantly different between manufacturers with some (eg HGST) having far fewer observed failures than expected and others (eg ST) having far more than expected, with a global Chisquared value of 2535 which is extremely unlikely to have arisen by chance alone :

                        N Observed Expected (O-E)^2/E (O-E)^2/V
manufact=HGST       10424      100   515.21  3.35e+02  4.08e+02
manufact=Hitachi    13244      385  1533.11  8.60e+02  1.53e+03
manufact=ST         32714     3266  1798.14  1.20e+03  2.21e+03
manufact=ST500LM012   377       22     8.89  1.93e+01  1.94e+01
manufact=TOSHIBA      254        9     9.15  2.59e-03  2.59e-03
manufact=WDC         3753      298   215.49  3.16e+01  3.34e+01

 Chisq= 2535  on 5 degrees of freedom, p= 0 

The KM plot pattern seems much easier to understand and at all obvious from the table or bar graphs shown in the original article.

For individual drive models, the KM curves are complex but even more revealing:

The KM curves show that one particular Seagate model failed at an unusually high rate over the entire period, whereas the curves at the top of the plot show a group of very reliable drive models which had very few failures over the entire period of observation. These individual drive model curves are made from the same data as the manufacturer curves but reveal a great deal of interesting variation within each manufacturer's offerings - again suggesting that descriptive and summary statistics presented in the Backblaze blogs hide a lot of important and interesting complexity.

Again, the KM statistics show that the differences between models seen in the KM plot are statistically significant and unlikely to have arisen by chance alone.

                                  N Observed Expected (O-E)^2/E (O-E)^2/V
model=HGST HMS5C4040ALE640     7168       73    285.7  1.58e+02  1.83e+02
model=HGST HMS5C4040BLE640     3115       21    194.6  1.55e+02  1.67e+02
model=Hitachi HDS5C3030ALA630  4662       98    519.4  3.42e+02  4.09e+02
model=Hitachi HDS5C4040ALE630  2719       63    298.9  1.86e+02  2.05e+02
model=Hitachi HDS722020ALA330  4774      175    530.7  2.38e+02  2.86e+02
model=Hitachi HDS723030ALA640  1048       45    115.3  4.29e+01  4.45e+01
model=ST3000DM001              4707     1705    305.5  6.41e+03  7.06e+03
model=ST31500341AS              787      216     45.1  6.47e+02  6.55e+02
model=ST31500541AS             2188      392    199.1  1.87e+02  1.98e+02
model=ST4000DM000             21671      695   1025.8  1.07e+02  1.52e+02
model=ST6000DX000              1906       26     27.6  9.20e-02  9.51e-02
model=WDC WD10EADS              550       53     54.7  5.38e-02  5.47e-02
model=WDC WD30EFRX             1267      114     73.6  2.22e+01  2.27e+01

 Chisq= 8587  on 12 degrees of freedom, p= 0 

More complex models:

The KM plot is a robust, non-parametric method which is attractive because of the lack of assumptions about the data. More sophisticated methods such as Cox proportional hazards models require distributional or other assumptions, but allow adjustment for additional variables such as the kind of storage pod (see the Backblaze blogs), drive capacity, number of platters or other factors of interest. My view is that this is not going to be at all useful until a lot more data becomes available. 


Other than as a consumer, I don't have any particular expertise on hard disk drives but I have made a successful career out of interpreting large scale data sets using appropriate statistical methods. I find the KM analysis much more clear and easy to interpret compared to the simple descriptive statistics presented by Backblaze and I hope they use more appropriate methods going forward. I'm happy to help if anyone cares to ask.

Technical details and data source:

The Backblaze folk have done a great service to the community by making their data freely available for anyone willing to poke at it at data release which includes the third quarter of 2015 was downloaded in early February 2016 and is reported here.

Here's a small sample of the 30,301,566 rows of raw data available from Backblaze. There's a separate CSV format file for each day of each year. These are stored under three year (eg 2013) directories. This is from the start of  "2013/2013-04-10.csv"

date serial_number model capacity_bytes failure
2013-04-10 MJ0351YNG9Z0XA Hitachi HDS5C3030ALA630 3000592982016 0
2013-04-10 MJ0351YNG9WJSA Hitachi HDS5C3030ALA630 3000592982016 0
2013-04-10 MJ0351YNG9Z7LA Hitachi HDS5C3030ALA630 3000592982016 0
2013-04-10 MJ0351YNGAD37A Hitachi HDS5C3030ALA630 3000592982016 0

Since I don't trust the smartdrive stats, I threw all those columns away and split out the manufacturer code and model from the "model" field.

The Kaplan-Meier plot and test statistics are available in most worthwhile statistical packages and I used the npsurv function from the R survival package for the plots and statistics reported here. In order to improve the reliability of the model curves, drives with fewer than 500 observations were dropped.

A python script was used to read all the files, keeping track of the appearance and disappearance of each unique drive as defined by a combination of model and serial_number, while processing each day's data in sequence. No database needed - python easily handles this data as an in memory dictionary, after dropping all the smartdrive columns. After reading all 30 million rows, a summary file containing a single row for each unique drive with the date it first appeared, the number of days it was under observation and a code indicating whether it failed or not was written. That script processed about 30,000 csv rows a second on my oldish desktop taking about 17 minutes for the entire dataset. The R script takes only a few seconds to perform the KM analysis and generate plots.