Facebook Study of Failures/Errors in NAND Flash Memory based SSDs

A Large-Scale Study of Flash Memory Failures in the Field on SIGMETRICS 2015.

This is the first large-scale study of flash-based SSD reliability from enterprise world. The study is based on data collected from a majority of flash-based SSDs in Facebook’s server fleet over nearly four years.

The data collected are first stored in register which is very similar to SMART data stored by some SSDs. Then, the data are parallel aggregated to Hive using MapReduce jobs in real time every hour. The analysis is done in R.

While 4 years is long enough to make statistical observations, it also means the observations may not reflect the state-of-art SSDs. Most of the observation is a confirmation of what is expected, such as sparse writes consumes more DRAM and contribute more the the SSD failures, and the high temperature increase the failure rate of SSDs.

Prior academic study of SSDs error pattern is on DATA 2012.

SSD Failure and Error STATISTICS

There are 3 conclusion for the statistical distribution of the SSD Failures and Errors.

The SSD Failure rate are different among different platforms despite their comparable amounts of data written and read, suggesting there are other factors play a role in determining the occurrence of uncorrectable errors.
The older Platforms have a higher error rate than the younger Platforms, suggesting that the incidence of uncorrectable errors increases as SSDs are utilized more.
The total number of errors per SSD is highly skewed, with a small fraction of SSDs accounting for a majority of the errors.

ssd_failure_skew

Failure and Error Correlations

The data suggest that:

An SSD that has had an error in the past is highly likely to continue to have errors in the future.
One SSD failing in a machine does increase the probability of the other SSD failing, suggesting that operational conditions contribute to the SSD failure.

Early Detection Period

As expected, the write has more significant contribution to the SSD failures. The new observation related to data written is the existence of early detection period. During the early detection period which occurs when SSDs are young there is an initially high failure rate among devices. This could be because of the weaker cells generates uncorrectable errors much quicker. After this weaker pool are exhausted, the overall error rate starts decreasing.

early_detection

DRAM Buffer Usage

The SSDs we examine use the DRAM buffer less when data is densely allocated (e.g., contiguous data) and use the DRAM buffer more when data is sparsely allocated (e.g., non-contiguous data).

We conclude that small, sparse writes negatively affect SSD failure rates the most for sparse data mappings (e.g., non-contiguous data).

Temperature-sensitive

It is true that SSD failure rate are sensitive to the temperature.

In flash cells, higher temperatures have been shown to cause cells to age more quickly due to the temperature-activated Arrhenius effect.

This report has observed 3 phenemon regarding to temperature:

Temperature-sensitive with increasing failure rate. In these platforms, no machines or few machines have been throttled based on the temperature.
Less temperature-sensitive. These platforms throttle their SSDs more aggressively across a range of temperatures.
Temperature-sensitive with decreasing failure rate. The SSDs in these platforms still in early failure periods. These failure rate may be affected by other factors.