计算机系统容错设计简述
Short Survey on Design for Fault Tolerance of Computer Systems
-
摘要: 高可靠计算机系统是是保证信息服务质量的基石。从第一台计算机 ENIAC 诞生起, 可靠性就是计算机系统面临的主要挑战之一, 容错设计是实现可靠性的有效途径, 也是一项典型的跨计算机多个设计层次的系统科学。从底层的器件到顶层的应用程序, 都存在优化可靠性的设计空间, 每个层次的设计面向特定的可靠性设计挑战。文章将遵循自底向上的逻辑层次简述这些经典的设计方法。Abstract: Highly reliable computer systems are the foundation of QoS (Quality of Services) of IT services. Since the birth of ENIAC, the first electronic computer in history, reliability has become one of the major challenges in computer design. Fault tolerance serves as a major approach to high reliability. It is also a systematic science crossing multiple logical layers of the classical computing stacks. The design opportunity comes from the bottom device layer to the much higher application layer. Each logical layer faces specific design challenges. Following a bottom-up style, we briefly survey these classical approaches in design for reliability.