The statistical bases for current models of RAID reliability are reviewed, and a highly accurate alternative is provided and justified. This new model corrects statistical errors associated with the pervasive assumption that system (RAID group) times-to-failure follow a homogeneous Poisson process, and it corrects errors associated with the assumption that the time-to-failure and time-to-restore distributions are exponentially distributed. Statistical justification for the new model uses theories of reliability of repairable systems. Four critical component distributions are developed from field data. These distributions are for times to catastrophic failure, reconstruction and restoration, read errors, and disk data scrubs. Model results have been verified to predict between 2 and 1,500 times as many double disk failures as estimates made using the mean time-to-data-loss (MTTDL) method. Model results are compared to system-level field data for a RAID group of 14 drives and show excellent correlation and greater accuracy than either MTTDL or Markov models.
Index Terms: Monte Carlo simulation, redundant systems, reliability modeling, repairable systems
Complete article is available to CALCE Consortium Members.
© IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.