## Survival analysis of hard disk drive failure data.

### Ross Lazarus, February 2016

### Executive Summary:

**Using a well established, objective analysis and data presentation method designed for right censored hard disk drive failure data provides insights which are not provided by simple descriptive statistics or charts. The Kaplan-Meier statistics and plots are recommended for routine use with hard drive failure data and their use is illustrated with 30M data points from the BackBlaze public data.**

### Introduction:

Subjective experience of individual consumers who purchase a few drives at a time is readily available in on-line product reviews at the larger retailers like Newegg or Amazon. These reviews are likely to be biased by negative reviews from those unlucky owners of a drive which happened to fail quickly - satisfied owners are less likely to take the time to share their experiences compared to unhappy owners who have just lost precious data.

Large commercial purchasers such as Google or Amazon probably do their own in-house testing, but rarely share their hard won findings or raw data. As drive capacities grow, new models are released on a regular basis but it takes at least 2 or 3 years of observation of a large number of sample drives under typical field operation conditions before robust conclusions can be drawn on the reliability over time for each new model.

The most recent analysis of about 50,000 hard disks deployed in a commercial on line storage facility over nearly 3 years run by Backblaze is one of the largest published studies and can be viewed at https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/ and in other Backblaze blogs. Simple statistics, tables and bar graphs derived from 30,301,566 observations are presented and discussed. That's a lot of data and the Backblaze engineers have done their best to make sense of it. Unfortunately I'm not sure that you can really see what's going on from their presentation. For example, time is split into 3 year-long intervals in the main table, making it confusing and hard to figure out what's really going on, and the summary bar charts hide an awful lot of interesting detail.

### Failure time (or survival) analysis:

Part of the challenge in interpreting this type of data is the problem that at any point in time during the observation period, one or more drives (or patients or more generally, units of analysis) may fail, and one or more drives may be removed from any further study before failure because of firmware diagnostics or planned maintenance. In terms of statistical analysis, this problem is termed

*because no further information is available after a drive is removed. Right censoring must be taken into account in order to correctly calculate the instantaneous failure rate of drives in the context of drives removed from further observation at some point before they failed together with the remaining drives which have not failed (yet).*

**right censoring**Epidemiologists and statisticians have established valid and robust methods for handling right censored data in the context of survival analysis, which are applicable to the Backblaze data. Survival rates are the inverse of failure rates, so survival and failure analysis are more or less mathematically equivalent, being two sides of the same technical coin although failure time analysis predominates in engineering circles whereas the survival analysis paradigm predominates in biology.

One popular method is the Kaplan-Meier (KM) plot and KM statistics, widely used to compare (for example) survival time after diagnosis for patients with the same cancer but different treatments. This kind of data is similar to the hard drive failure data because the reality is that it is almost inevitable that some patients in any clinical study will be lost to further follow up after a visit at which they were clearly alive. Those right censored patients, like the drives removed before failure, contribute no more information to the study, but do contribute useful information for the whole time they are being observed. Some details on where the data came from and how the analysis was performed are provided at the end of this article.

### Application of survival analysis to hard disk drive failure data:

Here's a KM plot showing the survival of each drive by the manufacturer.

The vertical axis represents the fraction of drives which survived at any given point in time and the horizontal axis represents days since time zero. Each individual disk drive's history over time is "lined up" so the first day of observation is always at the far left, at time zero - like a race where each competitor starts at the same point, although in the raw data, drives were introduced to the pool continuously over the entire study period. Each manufacturer's drives are grouped together and their survival in service over time is plotted as a single line. When one or more drives fail, there is a small vertical step in the curve. Each cross on each line represents a right censored observation removed from further study. Note that right censoring has no effect on the instantaneous survival rate - it simply changes the denominator for failure or conversely, survival rate calculations. Each downward step in each line represents one or more failures at that time.

Here is the Backblaze summary chart linked from their report at https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/

To my eye, the KM curve provides a much more detailed and arguably more accurate summary of what happened during the observation period. Note the curve for ST500LM012 which is an obvious anomaly arising from an abberant manufacturer string in the data ("ST500LM012 HN") where the two space delimited components in the data field are reversed (see below) compared to the majority of the data where the model follows the manufacturer abbreviation. This does not seem to have been noticed in the Backblaze analysis but the KM plot makes it obvious. No attempt has been made to correct this anomaly because it is not clear whether the model number means that the "HN" is wrong and should be replaced by "ST" - I'll leave that for the BackBlaze engineers to figure out and fix!

One example of a feature that was not at all obvious from the Backblaze analysis, but is clear from the KM plot, is the crossover in failure rate between ST (Seagate) and WDC (Western Digital). Initially, the WDC family failed slightly faster but the Seagate family of samples failed more quickly after about the first year of operation.

The KM statistical test estimates expected failure rates from mean failure rates and the number of units under observation at each time point and as shown below, suggests that drive survival is significantly different between manufacturers with some (eg HGST) having far fewer observed failures than expected and others (eg ST) having far more than expected, with a global Chisquared value of 2535 which is extremely unlikely to have arisen by chance alone :

N Observed Expected (O-E)^2/E (O-E)^2/V

manufact=HGST 10424 100 515.21 3.35e+02 4.08e+02

manufact=Hitachi 13244 385 1533.11 8.60e+02 1.53e+03

manufact=ST 32714 3266 1798.14 1.20e+03 2.21e+03

manufact=ST500LM012 377 22 8.89 1.93e+01 1.94e+01

manufact=TOSHIBA 254 9 9.15 2.59e-03 2.59e-03

manufact=WDC 3753 298 215.49 3.16e+01 3.34e+01

Chisq= 2535 on 5 degrees of freedom, p= 0

The KM plot pattern seems much easier to understand and at all obvious from the table or bar graphs shown in the original article.

For individual drive models, the KM curves are complex but even more revealing:

The KM curves show that one particular Seagate model failed at an unusually high rate over the entire period, whereas the curves at the top of the plot show a group of very reliable drive models which had very few failures over the entire period of observation. These individual drive model curves are made from the same data as the manufacturer curves but reveal a great deal of interesting variation within each manufacturer's offerings - again suggesting that descriptive and summary statistics presented in the Backblaze blogs hide a lot of important and interesting complexity.For individual drive models, the KM curves are complex but even more revealing:

Again, the KM statistics show that the differences between models seen in the KM plot are statistically significant and unlikely to have arisen by chance alone.

N Observed Expected (O-E)^2/E (O-E)^2/V

model=HGST HMS5C4040ALE640 7168 73 285.7 1.58e+02 1.83e+02

model=HGST HMS5C4040BLE640 3115 21 194.6 1.55e+02 1.67e+02

model=Hitachi HDS5C3030ALA630 4662 98 519.4 3.42e+02 4.09e+02

model=Hitachi HDS5C4040ALE630 2719 63 298.9 1.86e+02 2.05e+02

model=Hitachi HDS722020ALA330 4774 175 530.7 2.38e+02 2.86e+02

model=Hitachi HDS723030ALA640 1048 45 115.3 4.29e+01 4.45e+01

model=ST3000DM001 4707 1705 305.5 6.41e+03 7.06e+03

model=ST31500341AS 787 216 45.1 6.47e+02 6.55e+02

model=ST31500541AS 2188 392 199.1 1.87e+02 1.98e+02

model=ST4000DM000 21671 695 1025.8 1.07e+02 1.52e+02

model=ST6000DX000 1906 26 27.6 9.20e-02 9.51e-02

model=WDC WD10EADS 550 53 54.7 5.38e-02 5.47e-02

model=WDC WD30EFRX 1267 114 73.6 2.22e+01 2.27e+01

Chisq= 8587 on 12 degrees of freedom, p= 0

### More complex models:

The KM plot is a robust, non-parametric method which is attractive because of the lack of assumptions about the data. More sophisticated methods such as Cox proportional hazards models require distributional or other assumptions, but allow adjustment for additional variables such as the kind of storage pod (see the Backblaze blogs), drive capacity, number of platters or other factors of interest. My view is that this is not going to be at all useful until a lot more data becomes available.

### Conclusions:

Other than as a consumer, I don't have any particular expertise on hard disk drives but I have made a successful career out of interpreting large scale data sets using appropriate statistical methods. I find the KM analysis much more clear and easy to interpret compared to the simple descriptive statistics presented by Backblaze and I hope they use more appropriate methods going forward. I'm happy to help if anyone cares to ask.

### Technical details and data source:

The Backblaze folk have done a great service to the community by making their data freely available for anyone willing to poke at it at https://www.backblaze.com/hard-drive-test-data.html.The data release which includes the third quarter of 2015 was downloaded in early February 2016 and is reported here.Here's a small sample of the 30,301,566 rows of raw data available from Backblaze. There's a separate CSV format file for each day of each year. These are stored under three year (eg 2013) directories. This is from the start of "2013/2013-04-10.csv"

date | serial_number | model | capacity_bytes | failure |

2013-04-10 | MJ0351YNG9Z0XA | Hitachi HDS5C3030ALA630 | 3000592982016 | 0 |

2013-04-10 | MJ0351YNG9WJSA | Hitachi HDS5C3030ALA630 | 3000592982016 | 0 |

2013-04-10 | MJ0351YNG9Z7LA | Hitachi HDS5C3030ALA630 | 3000592982016 | 0 |

2013-04-10 | MJ0351YNGAD37A | Hitachi HDS5C3030ALA630 | 3000592982016 | 0 |

Since I don't trust the smartdrive stats, I threw all those columns away and split out the manufacturer code and model from the "model" field.

The Kaplan-Meier plot and test statistics are available in most worthwhile statistical packages and I used the npsurv function from the R survival package for the plots and statistics reported here. In order to improve the reliability of the model curves, drives with fewer than 500 observations were dropped.

A python script was used to read all the files, keeping track of the appearance and disappearance of each unique drive as defined by a combination of model and serial_number, while processing each day's data in sequence. No database needed - python easily handles this data as an in memory dictionary, after dropping all the smartdrive columns. After reading all 30 million rows, a summary file containing a single row for each unique drive with the date it first appeared, the number of days it was under observation and a code indicating whether it failed or not was written. That script processed about 30,000 csv rows a second on my oldish desktop taking about 17 minutes for the entire dataset. The R script takes only a few seconds to perform the KM analysis and generate plots.

I'm interested in this phrase:

ReplyDelete"Since I don't trust the smartdrive stats, I threw all those columns away"

Why not? Is there good statistical basis not to trust them? I generally find the uncorrectable error rates and realloc sectors to be good (if fuzzy) indicators of failure, but most of my data points are based on the spectacularly unreliable Seagate Barracudas.

Thanks for the comment.

ReplyDeleteThree reasons why the smartdrive stats are not used here:

1) I'm not smart enough to understand how they can be used to provide useful insight into this data

2) I wanted the simplest models that provide interesting insight. Cox and other models incorporating continuous covariates are much harder to interpret

3) I read some discouraging comments about the smartdrive stats in some of the backblaze data blogs.

4) The smartdrive stats have far too much missing for me to trust them or to want to fool around with imputation to make KM or Cox models possible.

Please feel free to share your own findings if they help make the data easier to understand.

Thank you for providing this analysis. I find the legends for the plot very hard to use. I think the usability of this information could be improved by adding a label on each of the lines.

ReplyDeleteI also found the legend very difficult to read, and the colours too similar. I've just discovered your interesting post due to going through another research/buying cycle for a couple of new drives.

DeleteThere's a legend which works for me but please feel free to send me a pull request to improve the plots so they'er labelled the way you prefer. Source at https://github.com/fubar2/backblazeKM

ReplyDeleteGGally::ggsurv can add texts along the curves, which would be more readable for large plots IMHO. Thanks for your analysis.

DeleteThank you for this interesting insight!

ReplyDeleteI fully agree that your analysis is much easier to interpret and reveals more detail than the original analysis done by Backblaze. I have never used or seen KM-plots but it seems to be a very handy tool.

Thanks again and thumbs up.

For Nathan and other people like me who strangle to distinguish the colours in the models plot: https://dl.dropboxusercontent.com/u/242368/km_model_feb2015_rl.png (I only spent time on the top survivors)

ReplyDeleteDid you see the updates at http://bioinformare.blogspot.com.au/2016/05/survival-analysis-of-hard-disk-drive.html ? I reran the scripts with the Q1 2016 data added. More data = more reliable estimates.

DeleteOh! and thanks for your effort and mostly the fine idea of using KM-plots for such cases Lazarus. Vastly better than the simple statistics we usually see.

ReplyDeleteExamining the nations of the brands is interesting. The Japanese brands perform better than the USA brands. So WD bought HGST, the best performer. HGST is now totally owned in every way now by WD, so will the worst performer now and the best performer move towards the mean?

ReplyDeleteOutsiders like myself are wondering if and when South Korea and Chin will enter these charts. Unfortunately these charts do not cover the nations of manufacture of the products, ... yet.

Ownership and brand-origin of the brands seem to show patterns in the above charts. I am guessing that all items are made in factories based in East Asia, including Thailand, Singapore, Vietnam & China? Perhaps the nation of final assembly of the metal units might show interesting patterns?

In the developed nations like Australia (where we live now), USA, etc have lost most of our factory creativity. Will East Asia be able to better our abilities?

I can say as an end user I agree more with these graphical representations then backblaze, seagates are absolute garbage, lost everything I had

ReplyDeleteSMART was done to hide data rather than make it public. In the old days, when you bought drives they came with a defect list which you manually added into a bad sector table. Drive manufactures realized having a list of how many bad spots were on their products was not a good marketing move. So SMART was born to hide them. It was spun to give "early" warning. But I have never once seen a SMART alert warn of a drive failure. But plenty proclaim a drive fine that had problems. I have been sysadmin and network admin for more years that I care to admit so have seen plenty of drives (no where as many as backblaze but still enough)

ReplyDeleteRoss - really interesting analysis. I'll see if I can teach myself some R and replicate your work. I'm a geophysicist with a decent maths background so should be able to get into it.

ReplyDeleteQuick question - can the plots be altered so the symbols (+) colour matches the line colour?

it'll be interesting to see how the new 8TB Seagate drives behave. First decent numbers just in and the infant mortality rates are showing promise.

Any chance seeing the K-M analysis including the 2016 data?

DeleteSVG graphs would make this a lot easier to read :|

ReplyDeleteThanks for sharing the article. Keep sharing more with us.

ReplyDeleteHere is something useful for all of you if you are searching for the best place for your home. BLfBhumi providing you the great varieties of

Plots near Super Corridor Indoreat very affordable rates.