Jon G. Elerath
Network Appliance, Inc.
elerath@netapp.com
Michael Pecht
CALCE
University of Maryland
College Park, MD 20742
Abstract:
A flexible model for estimating reliability of RAID storage systems is presented. This model corrects errors associated with the common assumption that
system times to failure follow a homogeneous Poisson process. Separate generalized failure distributions are
used to model catastrophic failures and usage dependent data corruptions for each hard drive. Catastrophic failure restoration is represented by a three-parameter Weibull, so the model can include a minimum time to restore as a function of data transfer rate and hard drive storage capacity. Data can be scrubbed as a background operation to eliminate corrupted data that, in the event of a simultaneous catastrophic failure, results in double disk failures. Field-based times to failure data and mathematic justification for a new model are presented. Model
results have been verified and predict between 2 to 1,500 times as many double disk failures as that estimated using the current mean time to data loss
method.
Complete article is available to CALCE Consortium Members.