reliability, availability and serviceability

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by IBM as a term to describe the robustness of their mainframe computers.{{cite book|title=Reliable computer systems: design and evaluation|url=https://archive.org/details/reliablecomputer00siew|first1=Daniel P.|last1=Siewiorek|first2=Robert S.|last2=Swarz|authorlink2=Robert S. Swarz|year=1998|page=[https://archive.org/details/reliablecomputer00siew/page/508 508]|publisher=Taylor & Francis |isbn=9781568810928 }}. "The acronym RAS (reliability, accessibility and serviceability) came into widespread acceptance at IBM as the replacement for the subset notion of recovery management."{{cite journal|title=Data processor, Issues 13-17|author=Data Processing Division, International Business Machines Corp., 1970|year=1970}}- "The dependability [...] experienced by other System/370 users is the result of a strategy based on RAS (Reliability-Availability-Serviceability)"

Computers designed with higher levels of RAS have many features that protect data integrity and help them stay available for long periods of time without failure.{{cite web|url=http://mercury.pr.erau.edu/%7Esiewerts/extra/papers/IBM-out-of-print/big-iron-2.pdf|first=Sam|last=Siewert|title=Big iron lessons, Part 2: Reliability and availability: What's the difference?|date=Mar 2005}} This data integrity and uptime is a particular selling point for mainframes and fault-tolerant systems.

Definitions

While RAS originated as a hardware-oriented{{cn|reason=Wasn't software first? Discuss at Talk:Reliability, availability and serviceability#Not limited to hardware|date=November 2023}} term, systems thinking has extended the concept of reliability-availability-serviceability to systems in general, including software:

For example:

{{cite book

| last1 = Laros III

| first1 = James H.

| others = et al.

| title = Energy-Efficient High Performance Computing: Measurement and Tuning

| url = https://books.google.com/books?id=Wq_XngsfAPQC

| series = SpringerBriefs in Computer Science

| date = 4 September 2012

| publisher = Springer Science & Business Media

| publication-date = 2012

| page = 8

| isbn = 9781447144922

| accessdate = 2014-07-08

| quote = Historically, Reliability Availability and Serviceability (RAS) systems were commonly provided by vendors on mainframe class systems. [...] The RAS system shall be a systematic union of software and hardware for the purpose of managing and monitoring all hardware and software components of the system to their individual potential.

}}

Reliability can be defined as the probability that a system will produce correct outputs up to some given time t.{{cite book|author1=E.J. McClusky |author2=S. Mitra |name-list-style=amp |title="Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press|year=2004}} Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data. Instead, it detects and, if possible, corrects the corruption, for example: by retrying an operation for transient (soft) or intermittent errors, or else, for uncorrectable errors, isolating the fault and reporting it to higher-level recovery mechanisms (which may failover to redundant replacement hardware, etc.), or else by halting the affected program or the entire system and reporting the corruption. Reliability can be characterized in terms of mean time between failures (MTBF), with reliability = exp(−t/MTBF).
Availability means the probability that a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. High-availability systems may report availability in terms of minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational. Availability is typically given as a percentage of the time a system is expected to be available, e.g., 99.999 percent ("five nines").
Serviceability or maintainability is the simplicity and speed with which a system can be repaired or maintained; if the time to repair a failed system increases, then availability will decrease. Serviceability includes various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault. The traditional focus has been on making the correct repairs with as little disruption to normal operations as possible.

Note the distinction between reliability and availability: reliability measures the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often the system is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption.

{{cite book

| last1 = Spencer

| first1 = Richard H.

| last2 = Floyd

| first2 = Raymond E.

| title = Perspectives on Engineering

| date = 11 July 2011

| url = https://books.google.com/books?id=dYVHOds2vQoC

| location = Bloomington, Indiana

| publisher = AuthorHouse

| publication-date = 2011

| page = 33

| isbn = 9781463410919

| accessdate = 2014-05-05

| quote = [...] a system server may have excellent availability (runs forever), but continues to have frequent data corruption (not very reliable).

}}

Failure types

Physical faults can be temporary or permanent:

Permanent faults lead to a continuing error and are typically due to some physical failure such as metal electromigration or dielectric breakdown.
Temporary faults include transient and intermittent faults.
Transient (a.k.a. soft) faults lead to independent one-time errors and are not due to permanent hardware faults: examples include alpha particles flipping a memory bit, electromagnetic noise, or power-supply fluctuations.
Intermittent faults occur due to a weak system component, e.g. circuit parameters degrading, leading to errors that are likely to recur.

Failure responses

Transient and intermittent faults can typically be handled by detection and correction by e.g., ECC codes or instruction replay (see below). Permanent faults will lead to uncorrectable errors which can be handled by replacement by duplicate hardware, e.g., processor sparing, or by the passing of the uncorrectable error to high level recovery mechanisms. A successfully corrected intermittent fault can also be reported to the operating system (OS) to provide information for predictive failure analysis.

Hardware features

{{Original research section|date=May 2024|reason=Source needed for the list of hardware features (maybe only a subsystem level source is required)? Most statements inside a subsystem bullet point don't have any source.}}

Example hardware features for improving RAS include the following, listed by subsystem:

Processor:
Processor instruction error detection (e.g. residue checking of results{{cite web|url=http://www.acsel-lab.com/arithmetic/papers/ARITH20/ARITH20_Lipetz.pdf|archive-url=https://wayback.archive-it.org/all/20120124194631/http://www.acsel-lab.com/arithmetic/papers/ARITH20/ARITH20_Lipetz.pdf|url-status=dead|archive-date=2012-01-24|title=Self Checking in Current Floating-Point Units. Proceedings of 2011 20th IEEE Symposium on Computer Arithmetic|author1=Daniel Lipetz|author2=Eric Schwarz|name-list-style=amp|year=2011|access-date=2012-05-06}}) with instruction retry e.g. alternative processor recovery in IBM mainframes,{{cite web|citeseerx = 10.1.1.85.5994|url=https://web.stanford.edu/class/ee392c/papers/fault/ibm_faults_mainframe.spainhower.ibmjrd.1999.pdf|title=IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective. IBM Journal of Research and Development. Volume 43 Issue 5|author1=L. Spainhower |author2=T. A. Gregg |name-list-style=amp |date=September 1999}} or "Instruction replay technology" in Itanium systems.{{cite web|title=Intel Instruction Replay Technology Detects and Corrects Errors|url=http://www.intel.com/content/www/us/en/processors/itanium/itanium-9500-reliability-mission-critical-applications-paper.html|accessdate=2012-12-07}}
Processors running in lock-step to perform master-checker or voting schemes.
Machine Check Architecture and ACPI Platform Error Interface to report errors to the OS.
Memory:
Parity or ECC (including single device correction) protection of memory components (cache and main memory); bad cache line disabling; memory scrubbing; memory sparing, memory mirroring;{{cite web|url=http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00256987/c00256987.pdf|title=Memory technology evolution: an overview of system memory technologies Technology brief, 9th edition (page 8)|author=HP|url-status=dead|archiveurl=https://web.archive.org/web/20110724013507/http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00256987/c00256987.pdf|archivedate=2011-07-24}} bad page offlining; redundant bit steering; redundant array of independent memory (RAIM).
I/O:
Cyclic redundancy check checksums for data transmission/retry and data storage, e.g. PCI Express (PCIe) Advanced Error Reporting (AER),{{cite web|url=http://www.intel.com/content/www/us/en/io/pci-express/pci-express-initiatives-and-technology-article.html|title=PCI Express Provides Enterprise Reliability, Availability, and Serviceability|author=Intel Corp.|year=2003}} redundant I/O paths.
Storage:
RAID configurations for hard disk drive and solid-state drive storage.
Journaling file systems for file repair after crashes.
Checksums on both data and metadata, and background scrubbing.
Self-Monitoring, Analysis, and Reporting Technology for hard disk drive and solid-state drive.
Power/cooling:
Duplicating components to avoid single points of failure, e.g., power-supplies.
Over-designing the system for the specified operating ranges of clock frequency, temperature, voltage, vibration.
Temperature sensors to throttle operating frequency when temperature goes out of specification.
Surge protector, uninterruptible power supply, auxiliary power.
System:
Hot swapping of components: CPUs, RAMs, hard disk drives and solid-state drives.
Predictive failure analysis to predict which intermittent correctable errors will lead eventually to hard non-correctable errors.
Partitioning/domaining of computer components to allow one large system to act as several smaller systems.
Virtual machines to decrease the severity of operating system software faults.
Redundant I/O domains{{cite web|title=Best Practices for Data Reliability with Oracle VM Server for SPARC|url=http://www.oracle.com/technetwork/articles/systems-hardware-architecture/vmsrvrsparc-reliability-163931.pdf|accessdate=2013-07-02}} or I/O partitions{{cite web|title=IBM Power Redundancy considerations|url=http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/iphb1/iphb1_vios_virtualioserverpartition.htm|accessdate=2013-07-02}} for providing virtual I/O to guest virtual machines.
Computer clustering capability with failover capability, for complete redundancy of hardware and software.
Dynamic software updating to avoid the need to reboot the system for a kernel software update, for example Ksplice under Linux.
Independent management processor for serviceability: remote monitoring, alerting and control.

Fault-tolerant designs extended the idea by making RAS to be the defining feature of their computers for applications like stock market exchanges or air traffic control, where system crashes would be catastrophic. Fault-tolerant computers (e.g., see Tandem Computers and Stratus Technologies), which tend to have duplicate components running in lock-step for reliability, have become less popular, due to their high cost. High availability systems, using distributed computing techniques like computer clusters, are often used as cheaper alternatives.{{citation needed|date=December 2012}}

References

External links

[http://labs.hoffmanlabs.com/node/95 Itanium Reliability, Availability and Serviceability (RAS) Features] Overview of RAS features in general and specific features of the Itanium processor.
[http://www-05.ibm.com/de/events/breakfast/pdf/POWER7_RAS_Features_Feb_2012.pdf POWER7 System RAS Key Aspects of Power Systems Reliability, Availability, and Serviceability. Daniel Henderson, Jim Mitchell, and George Ahrens. February 10, 2012] Overview of RAS features in Power processors.
[http://www.intel.com/content/www/us/en/servers/reliability-availability-and-serviceability-for-the-always-on-enterprise-paper.html Intel Corp. Reliability, Availability, and Serviceability for the Always-on Enterprise (appendix B)] and [http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html Intel Xeon Processor E7 Family: supporting next generation RAS servers. White paper.] Overview of RAS features in Xeon processors.
[http://www-01.ibm.com/support/docview.wss?uid=isg2c24bd608371def398525776100545fcb&aid=1 zEnterprise 196 System Overview. IBM Corp. (Chapter 10)] Overview of RAS features of IBM z196 processor and zEnterprise 196 server.
[http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o13-026-m5-32-ras-1923495.pdf Maximizing Application Reliability and Availability with the SPARC M5-32 Server] RAS features of Oracle’s SPARC M5-32 server

Category:Fault-tolerant computer systems

Category:Systems engineering