Research on Three-mode Redundancy Fault Tolerance Technology Based on FPGA

SRAM-based FPGAs are very sensitive to space particle radiation and are prone to soft failures, so it is important to take fault-tolerant measures for FPGA-based electronic systems to prevent such failures. By using the three-mode redundancy (TMR) method for sensitive circuits and using the dynamic reconfigurability of FPGA, the anti-single event performance of FPGA can be effectively enhanced, and the soft fault caused by space particle radiation can be solved.

Research on Three-mode Redundancy Fault Tolerance Technology Based on FPGA

Triple Modular Redundancy (TMR) technology is a widely used fault-tolerant technology for Single-Event Upset (SEU) on FPGA, which can greatly improve the reliability of FPGA under the influence of SEU. However, due to the implementation of additional modules and wiring, it consumes a lot of hardware resources and power consumption, and the working speed is also affected, which limits the use of traditional TMRs. With the development of electronic technology, especially partially reconfigurable technology, a variety of improved TMR technologies have appeared, all of which have solved the problems existing in the traditional TMR method, and made the TMR technology developed.

1. Conventional TMR method and its problems

The principle of triple redundancy technology can be simply understood as copying the same circuit three times, and then arbitrating the outputs of the three circuits by “majority voting”, and taking at least two of the same output results as the final output.

TMR is a very effective technique for mitigating SEU, but this simple TMR technique fails when the energy of a single particle is sufficient to cause simultaneous SEU in two of the three cells, but this happens with a low probability. So TMR is a more effective and widely used fault-tolerant method, which is widely used to prevent the influence of SEU caused by radiation on the system.

These counts undoubtedly require resources, but occupying these resources is valuable. With these counts, the design can be read out by an external processor via the bus interface. So a “big data” graph of the internal design of the FPGA emerges.”

Basic structure diagram of conventional TMR method
Basic structure diagram of conventional TMR method
The traditional TMR method can effectively improve the reliability of the design, but it also has many shortcomings:

  1. It can’t repair faulty cells.
    When one of the three units is faulty, it just masks the fault through the majority voter, but the faulty unit module still exists. And general TMR cannot detect and locate errors so that the system can repair them. If the error that occurs is not fixed in time, the TMR will fail when the error occurs again.

  2.  Common TMR resource overhead is high and resource utilization is low
    The common TMR is to implement the third mock examination redundancy for the whole design or large modules, with large granularity, and its resource overhead is increased by 300% compared with the original circuit. Implementing TMR on the whole circuit or module will cause waste of resources.

  3. The power consumption increases due to the doubling of the circuit, and the speed decreases due to the presence of the voter and some other extra wiring.

  4. The voter itself may also be faulty, and the general TMR voter has no self-checking capability and no radiation resistance capability

  5. When the circuit drive using three-mode redundancy does not use a redundant circuit, a voter is required to combine the three signals into one signal. When no redundant circuit is used to drive a three-mode redundant circuit, it is necessary to expand one signal into three signals through additional wiring. This result reduces system reliability because both logic circuits and routing resources are SEU-sensitive.

2. Improved TMR method

1. Dynamic Reconfigurable Technology

Since TMR itself does not have the ability to repair the faulty module, if only one module has an error, the system function will not be affected, but if the faulty module cannot be repaired before another module has an error, the redundant method will fail.

Therefore, when an error occurs, the faulty module must be repaired in time. The faulty module can be repaired in time by using the FPGA local dynamic reconfigurable technology. Dynamic reconfigurable technology is to realize dynamic function transformation of all or part of the logic resources of FPGA based on SRAM programming technology when the system is running.

System reconfiguration can be divided into static system reconfiguration and dynamic system reconfiguration. The former refers to the static reloading of the logic function of the target system, that is, the FPGA chip function changes the logic function of the chip by re-downloading different target system data stored in the memory under the control of external logic. An FPGA programmed with a conventional SRAM can only be used to implement static system reconfiguration.

The latter refers to a digital logic system with timing changes. The occurrence of timing logic is not combined by calling different areas and different logic resources in the chip, but by performing local and global chip execution on FPGAs with special cache logic resources. The dynamic refactoring (or modification) of the logic is implemented quickly. Dynamically reconfigurable FPGA internal logic blocks and changes in interconnection can directly realize such logic reconstruction by reading different SRAM bit data. The time is often in nanoseconds, which helps to realize the dynamic function of FPGA system logic functions. Refactor.

Since soft faults such as SEU are the most serious for space electronic systems, and soft faults can be solved by reconstruction, periodically refreshing the configuration memory can repair such faults.

TMR circuit can design a voter with error detection and positioning functions. When a module fails, the signal of the voter directly triggers the reconstruction function, and dynamically reconstructs only the wrong part of the circuit. This can solve the time and power consumption problems caused by timed refresh, and provide a solution to prevent the accumulation of errors.

To prevent voter errors, the voter can be implemented with radiation-insensitive devices instead of SRAM-based materials, which improves the voter’s robustness. The improved voter no longer uses the majority voter to vote on the outputs of the three redundant modules, but passes the corresponding outputs of the three redundant modules through the tri-state buffer and the minority voter, respectively, by the three output tubes of the FPGA. pin output, and finally “wire-ored” as a signal on the printed circuit board (PCB). The minority voter circuit is responsible for judging whether the signal of the redundant module is a minority value. If it is a minority value, the corresponding buffer outputs a high impedance. If not, the corresponding signal is output normally.

Readback is developed on the basis of dynamic reconfiguration. It refers to comparing the readback of configuration data with the original configuration data, and reconstructing after finding errors. In addition, error correction codes can be used to protect configuration data. The data of each configuration frame is protected by a 12-bit see-dec Hamming code, and the identification code of each basic unit in the FPGA is different. After reading back the configuration file through ICAP (InternalConfiguraTIon Access Port), the error correction code can give errors. bit position.

The dynamic reconfigurable technology can repair the functional errors caused by the SEU in the LUT, routing matrix and CLB without interrupting the work of the circuit, effectively enhancing the anti-single event capability of the FPGA circuit.

2. Local Sensitive Circuit TMR Technology

With the advent of partial dynamic reconstruction techniques, locality-sensitive circuit TMR methods can be used. With a smaller granularity as the step size, a reasonable layout and routing are used to achieve TMR to achieve the required resource overhead and maximize reliability. Due to limited resources, in the case where global TMR cannot be achieved, TMR for local sensitive circuits is a better choice, which can improve the reliability of the system while using less resources.

Since not all modules are redundant, the implementation must focus on applying TMR technology to those modules that can improve system reliability relatively higher. At this time, the number and location of voters is also a problem that needs to be considered. This results in reduced system reliability since extra wiring is required before and after a module with three-mode redundancy, and logic circuits and wiring resources are both SEU-sensitive.

Schematic diagram of local sensitive circuit TMR
Schematic diagram of local sensitive circuit TMR

In order to select the modules that need to perform three-mode redundancy and carry out reasonable layout and wiring, the errors in the system are divided into persistent errors and non-sustained errors. Persistent errors refer to errors generated by the SEU that change the internal state of the circuit; non-persistent errors refer to errors that can be eliminated by FPGA reconfiguration, while persistent errors persist after reconfiguration.

Combined with the above analysis, the priorities for implementing some TMRs are as follows:

The first level is the part that produces persistent errors.

The second stage is a circuit that can cause errors in the part of the circuit that can generate continuity errors, to reduce the transition between TMR and non-TMR as a criterion.

The third stage is the forward part of the circuit that produces persistent errors, again with reduced transitions between TMR and non-TMR as a guideline.

The fourth level is a separate part from the part of the circuit that produces persistent errors.

Circuits can be partitioned by static analysis. The problem here is that in a standard global TMR, all inputs, outputs, and clocks are triple-modularly redundant. When using a partial TMR, redundancy for I/O and clocks may not be possible. Like logic circuits without TMR, clocks and I/Os without TMR can also generate undetectable errors.

According to the experimental results, since this method mainly focuses on the part of the circuit that can generate persistent errors, when the redundant resources used increase, the probability of persistent errors decreases quickly, and finally almost all of them are overcome. Therefore, the use of partial TMR can achieve a balance between resources and reliability, and maximize resource utilization under the condition that the reliability is minimally affected.

With the rapid development of FPGA, the integration level of the chip is getting higher and higher, and its working voltage is continuously reduced, which leads to the decline of the reliability of FPGA under radiation conditions, especially the influence of soft faults represented by SEU is increasing. When implementing systems with SRAM-based FPGAs, fault tolerance measures must be taken.

Based on the reliability advantages of traditional TMR, the spatial fault tolerance measures of applying local sensitive circuit TMR technology and FPGA local dynamic reconfigurable technology can effectively avoid the occurrence of circuit soft faults due to the influence of spatial particles. It will be a major development direction of TMR technology.

Haoxinshengic is a pprofessional FPGA and IC chip supplier in China. We have more than 15 years in this field。 If you need chips or other electronic components and other products, please contact us in time. We have an ultra-high cost performance spot chip supply and look forward to cooperating with you.

If you want to know more about FPGA or want to purchase related chip products, please contact our senior technical experts, we will answer relevant questions for you as soon as possible