FAULT-TOLERANT

SEMINAR ON FAULT-TOLERANT: Embedded supercomputer is becoming an essential complex, scientific and industrial applications and computer-intensive systems in parallel to eliminate traditional single-processor platforms. Reliability and fault tolerance become critical to the performance of parallel systems. The faults are no longer in unpleasant situations, depending on the application, can be dangerous or even catastrophic. The transition from parallel application developers to offer new opportunities but also presents new dangers. feature of multilateral cooperation efficient than a parallel system can also be its fatal weakness. More processors means more mistakes and the failure of a processor can crash the entire system.
An important factor is communication. The inter-processor communication to coordinate the processors and increase their power, is the key to successful parallel system. distributed memory multiprocessor systems based on messages between nodes. messaging applications are based on synchronous (blocking) or asynchronous (blocking) communication for the coherence of parallel tasks. In synchronous mode, problems arise when communication links or communication cables is an erroneous state (broken links, threads in infinite loops, and so on). When such errors occur, communication threads are blocked because the communication can be started or completed.
Likewise, problems also arise in asynchronous communication, when the threads are in a state of communication in error, or when the mailbox of the mechanisms of dysfunction asynchronous communication support. It is clear that the mechanisms for fault-tolerant communication are key factors in the reliability of the parallel system and can unleash the potential of a system.
The systems approach is to make more reliable fault tolerance (FT) measurements at two levels: application-level operating system. Fortunately, there is a middle way. development solutions often come from the common requirements. These requirements can be classified and treated in an environment that is between the application and operating system. An application developer can select the desired level of FT FT and mechanism of action of the application and effort and shortening time to market.
More recent research efforts have studied the fault tolerance for embedded applications on distributed systems. This may include proposals for generic architectures for distributed computing reliable and predictable reliable distributed computer systems to methods of treatment equipment and software failures in real-time applications and software used in the FT-massive solutions parallel systems. In addition, expanded research focused on the development of distributed systems real-time operating with a fault-tolerant behavior. Meanwhile, complex models and frameworks with interest to examine the reliability of the system FT.
Eftos Esprit project (Embedded Fault Tolerant Supercomputing) develop a framework to integrate fault tolerance flexibly and easily distributed, embedded, high performance computing (HPC). The framework consists of reusable modules FT acting at different levels. Overhead and performance of the generic operating system and hardware-level mechanisms FT avoided, and application developers are not responsible for providing ad hoc programs FT. The integration of this functionality in real embedded applications have validated the approach and promising results.

No comments:

Post a Comment