Abstract:Highly reliable computer systems are the foundation of QoS (Quality of Services) of IT services. Since the birth of ENIAC, the first electronic computer in history, reliability has become one of the major challenges in computer design. Fault tolerance serves as a major approach to high reliability. It is also a systematic science crossing multiple logical layers of the classical computing stacks. The design opportunity comes from the bottom device layer to the much higher application layer. Each logical layer faces specific design challenges. Following a bottom-up style, we briefly survey these classical approaches in design for reliability.