# **Reconfigurable Space Computing**

### Dr. Brock J. LaMeres

Associate Professor Electrical & Computer Engineering Montana State University





College of ENGINEERING

# Outline



### 1. Research Statement

• Enabling Reconfigurable Computing for Aerospace

# 2. Radiation Effects in Electronics

- Sources & Types of Radiation
- Effects (TID, SEE, Displacement Damage)
- FPGA Specific Effects

# 3. Existing Mitigation Techniques

- Physical (Shielding, RHBD, RHBP)
- Architectural (TMR, Scrubbing, Error Correction Codes)

# 4. MSU's Approach

- Redundant Tiles (TMR+Spares+Scubbing)
- Prototyping
- Test Flights (Local Balloons, HASP, Future Sub-orbital)







### Support the Computing Needs of Space Exploration & Science

- Computation (2,000 MIPs)
- Power Efficiency (200 MIPs/Watt)
- Mass (\$100/lb by 2025)
- Reliability (99.99999% reliable, instant recovery during critical operation)



Space Launch System (SLS)









### **Provide a Radiation Tolerant Platform for Reconfigurable Computing**

- Reconfigurable Computing as a means to provide:
  - Increased Computation of Flights Systems
  - Reduced Power of Flight Systems
  - Reduced Mass of Flight Hardware
  - Mission Flexibility through Real-Time Hardware Updates
- Support FPGA-based Reconfigurable Computing through an underlying architecture with inherent radiation tolerance to Single Event Effects



The Future









### Let's start with what is <u>NOT</u> Reconfigurable Computing

- A CPU/GPU while you have flexibility via programming, the hardware is still fixed
  - o The instructions that can be executed are fixed in the sequence controller
  - o The size of memory is pre-defined
  - $\circ~$  The IO is pre-defined
- An ASIC the hardware is fixed during fabrication.

### Are There <u>Advantages</u> to these Conventional Systems?

- Yes, they are well understood and easy to program (particularly the single core model)
- Yes, when the task maps well to the hardware, they have high performance (e.g., GPU)
- Yes, they can handle a large array of tasks (albeit sometimes in a inefficient manner)

### Are There <u>Disadvantages</u> to these Conventional Systems?

- Yes, unless the task does not map directly to the hardware, they perform poorly.
- Yes, much of the hardware that allows them to handle a variety of tasks sits idle most of the time.





### A System That Alters Its Hardware as a Normal Operating Procedure

- This can be done in real-time or at compile time.
- This can be done on the full-chip, or just on certain portions.
- Changing the hardware allows it to be optimized for the application at hand.





# What Technology is used for RC?



### Field Programmable Gate Arrays (FPGA)

- Currently the most attractive option.
- SRAM-based FPGAs give the most flexibility
- Riding Moore's Law feature shrinkage









# What are the Advantages of RC?



### **Computational Performance**

- Optimizing the hardware for the task-at hand = architectural advantages
- Eliminating unused circuitry (minimize place/route area, reduces wiring delay)

### **Reduced Power**

- Implement only the required circuitry
- · Shutdown or un-program unused circuitry when not in use

### **Reduced Mass**

- Reuse a common platform to conduct multiple sequential tasks in flight systems
- This effect is compounded when considering each flight system has backup hardware
- · Mass is the dominant driver of cost for space applications
  - \$10,000/lb to get into orbit.
  - NASA's goal is \$100/lb by 2025
  - Shuttle cost ~\$300-\$500M per launch with 50,000 lb capacity





A Sequence of Unique Tasks



### **On Earth Our Computers are Protected**

- Our magnetic field deflects the majority of the radiation
- Our atmosphere attenuates the radiation that gets through our magnetic field

### **Our Satellites Operate In Trapped Radiation in the Van Allen Belts**

High flux of trapped electrons and protons

### In Deep Space, Nothing is Protected

- Radiation from our sun
- · Radiation from other stars
- Particles & electromagnetic



You Are Here





### Where Does Space Radiation Come From?

- Nuclear fusion in stars creates light and heavy ions + EM
- Stars consists of an abundant amount of Hydrogen (<sup>1</sup>H = 1 Proton) at high temperatures held in place by gravity
  - 1. The strong nuclear force pulls two Hydrogen (<sup>1</sup>H) atoms together overcoming the Columns force and fuses them into a new nucleus
    - The new nucleus contains 1 proton + 1 neutron
    - This new nucleus is called *Deuterium (D)* or *Heavy Hydrogen* (<sup>2</sup>H)
    - Energy is given off during this reaction in the form of a Positron and a Neutrino
  - 2. The Deuterium (<sup>2</sup>H) then fuses with Hydrogen (<sup>1</sup>H) again to form yet another new nucleus
    - This new nucleus contains 2 protons + 1 neutron
    - This nucleus is called *Tritium* or Hydrogen-3 (<sup>3</sup>H)
    - Energy is given off during this reaction in the form of a Gamma Ray
  - 3. Two Tritium nuclei then fuse to form a Helium nucleus
    - The new Helium nucleus (<sup>4</sup>H) contains 2 protons + 2 neutrons
    - Energy is given off in the form of Hydrogen (e.g., protons)









### **Radiation Categories**

- 1. Ionizing Radiation
  - Sufficient energy to remove electrons from atomic orbit
  - Ex. High energy photons, charged particles
- 2. Non-Ionizing Radiation
  - o Insufficient energy/charge to remove electrons from atomic orbit
  - Ex., microwaves, radio waves

### **Types of Ionizing Radiation**

- 1. Gamma & X-Rays (photons)
  - Sufficient energy in the high end of the UV spectrum
- 2. Charged Particles
  - Electrons, positrons, protons, alpha, beta, heavy ions
- 3. Neutrons
  - No electrical charge but ionize indirectly through collisions

### What Type are Electronics Sensitive To?

- · Ionization which causes electrons to be displaced
- Particles which collide and displace silicon crystal









### **Classes of Ionizing Space Radiation**







### **Classes of Ionizing Space Radiation**

- 1. Cosmic Rays
  - Originating for our sun (Solar Wind) and outside our solar system (Galactic)
  - o Mainly Protons and heavier ions
  - $\circ$  Low flux
- 2. Solar Particle Events
  - Solar flares & Coronal Mass Ejections
  - o Electrons, protons, alpha, and heavier ions
  - o Event activity tracks solar min/max 11 year cycle
- 3. Trapped Radiation
  - o Earth's Magnetic Field traps charged particles
  - Inner Van Allen Belt holds mainly protons (10-100's of MeV)
  - Outer Van Allen Belt holds mainly electrons (up to ~7 MeV)
  - o Heavy ions also get trapped











### Which radiation is of most concern to electronics?

# <u>Concern</u>

- Protons (<sup>1</sup>H)
  - Makes up ~85% of galactic radiation
  - Larger Mass than electron (1800x), harder to deflect
- Beta Particles (electrons & positrons)
  - Makes up ~1% of galactic space
  - o More penetrating than alphas
- Heavy lons
  - Makes up <1% of galactic radiation
  - High energy (up to GeV) so shielding is inefficient
- Neutrons
  - $\circ~$  Uncharged so difficult to stop



- Alpha Particles (He nuclei)
  - Makes up ~14% of galactic radiation
  - ~ 5MeV energy level & highly ionizing but...
  - Low penetrating power
     (50mm in air, 23um in silicon)
  - Can be stopped by a sheet of paper
- Gamma
  - $\circ~$  Highly penetrating but an EM wave
  - o Lightly ionizing





Hole Trapping

- Electrons recombine guicker due to faster mobility

Holes get "stuck" due to lower mobility
Lowers Vt by effectively "thinning" the oxide

- EHP formed by ionization

### What are the Effects?

- 1. Total lonizing Dose (TID)
  - o Cumulative long term damage due to ionization.
  - Primarily due to low energy protons and electrons due to higher, more constant flux, particularly when trapped
  - Problem #1 Oxide Breakdown
    - » Threshold Shifts
    - » Leakage Current
    - » Timing Changes





# - Vt eventually goes negative turning on MOS

### Interface Trapping

- The Si/Si02 interface typically contains Si/H bonds - This is due to the annealing process in hydrogen
- When this bond is severed, H will bond with itself
- This leaves a dangling Si bond with net positive charge
- This initially lowers Vt and then ultimately raises it.





### What are the Effects?

- 1. Total Ionizing Dose (TID) Cont...
  - Problem #2 –Leakage Current





### What are the Effects?

- 2. Single Event Effects (SEE)
  - o Electron/hole pairs created by a single particle passing through semiconductor
  - o Primarily due to heavy ions and high energy protons
  - Excess charge carriers cause current pulses
  - Creates a variety of destructive and non-destructive damage
  - The ionization *itself* does not cause damage, the damage is secondary due to parasitic circuits

"Critical Charge" = the amount of charge deposited to change the state of a gate







### What are the Effects?

2. Single Event Effects (SEE) - Non-Destructive (e.g., soft faults)







### What are the Effects?

2. Single Event Effects (SEE) - Non-Destructive (e.g., soft faults)







### What are the Effects?

2. Single Event Effects (SEE) – **Destructive** (e.g., hard faults)







### Shielding

- Shielding helps for protons and electrons <30MeV, but has diminishing returns after 0.25".
- This shielding is typically inherent in the satellite/spacecraft design.



### Shield Thickness vs. Dose Rate (LEO)





### Radiation Hardened by <u>Design</u> (RHBD)

- Uses commercial fabrication process
- Circuit layout techniques are implemented which help mitigate effects



- Reduces leakage between NMOS & PMOS devices due to hole trapping in Field Oxide (STI Region 2)
- Separation of device + body contacts
- Adds ~20% increase in area

- This oxide reduces probability of hold trapping.
- Process nodes <0.5um typically are immune to Vgs shift in the gate.





### Radiation Hardened by <u>Process</u> (RHBP)

- An insulating layer is used beneath the channels
- This significantly reduces the ion trail length and in turn the electron/hole pairs created
- The bulk can also be doped to be more conductive so as to resist hole trapping







### **Radiation Tolerance Through Architecture**

- 1. Triple Module Redundancy
  - o Triplicate each circuit
  - Use a majority voter to produces output
  - o Advantages
    - » Able to address faults in real-time
    - » Simple
  - o Disadvantages
    - » Takes >3x the area
    - » Voter needs to be triplicated also to avoid single-point-of-failure
    - » Doesn't handle Multiple-Bit-Upsets







### Radiation Tolerance Through <u>Architecture</u> Cont...

- 2. Scrubbing
  - Compare contents of a memory device to a "Golden Copy"
  - Golden Copy is contained in a radiation immune technology (fuse-based memory, MROM, etc...)
  - o Advantages
    - » Simple & Effective
  - o Disadvantages
    - » Sequential searching pattern can have latency between fault & repair







### **Effects Overview**

- Primary Concern is Heavy Ions & high energy protons
- All modern computer electronics experience TID and will eventually go out
- Heavy lons causing SEEs cannot be stopped and an architectural approach is used to handle them.













### **FPGAs are Uniquely Susceptible**

- 1. Total Ionizing Dose
  - o All gates and memory cells are susceptible to TID due to high energy protons
- 2. Single Event Effects
  - o SETs/SEUs in the logic blocks
  - $\circ~$  SETs in the routing
  - SEUs in the configuration memory for the logic blocks (SEFI)
  - SEUs in the configuration memory for the routing (SEFI)

Radiation Strikes in the Circuit Fabric

(Logic + Routing)

| Cartig                 | Cento      | Cooffa Logic         | Logic                  |
|------------------------|------------|----------------------|------------------------|
| SRAM Block             | SRAM Block | SRAM Block           | Block                  |
| Config                 | Config     | Config               | Config                 |
| SRAM                   | SRAM SRAM  | SRAM SRAM            | SRAM SRAM              |
| Config<br>SRAM Block   | SRAM Block | SRAM Block           | SRAM Logic<br>Block    |
| Config                 | Corfig     | Config               | Centlig                |
| SRAM                   | SRAM SRAM  | SRAM SRAM            | SRAM SRAM              |
| Ceefig Logic           | Config     | Config               | Confg Logic            |
| SRAM Block             | SRAM Block | SRAM Block           | SRAM Block             |
| Config                 | Config     | Config               | Contig                 |
| SRAM                   | SRAM SRAM  | SRAM SRAM            | SRAM SRAM              |
| Gento<br>SRAM<br>Block | SRAM Block | Config<br>SRAM Block | Control Logic<br>Block |

Radiation Strikes in the Configuration Memory

(Logic + Routing)





### What is needed for FPGA-Based Reconfigurable Computing

- 1. SRAM-based FPGAs
  - To support fast reconfiguration
- 2. A TID hardened fabric
  - Thin Gate Oxides to avoid hole trapping and threshold shifting (inherent in all processes)
  - Radiation Hardened by Design to provide SEL immunity (rings, layout, etc...)

### **Does This Exist?**

- 1. Yes, Xilinx Virtex-QV Space Grade FPGA Family
  - TID Immunity > 1Mrad
  - RHBD for SEL immunity
  - o CRC in configuration memory



### The Final Piece is SEE Fault Mitigation due to Heavy lons

- SEU will happen due to heavy ions, nothing can stop this.
- A computer architecture that expects and response to faults is needed.





### **A Many-Tile Architecture**

- The FPGA is divided up into *Tiles*
- A Tile is a quantum of resources that:
  - Fully contains a system (e.g., processor, accelerator)
  - Can be programmed via partial reconfiguration (PR)

### **Fault Tolerance**

- 1. TMR + Spares
- 2. Spatial Avoidance and Background Repair
- 3. Scrubbing



16 MicroBlaze Soft Processors on a Virtex-6





### 1. TMR + Spares

- 3 Tiles run in TMR with the rest reserved as spares.
- In the event of a fault, the damaged tile is replaced with a spare and foreground operation continues.

### 2. Spatial Avoidance & Repair

- The damaged Tile is "repaired" in the background via Partial Reconfiguration.
- The repaired tile is reintroduced into the system as an available spare.

# 3. Scrubbing

- A traditional scrubber runs in the background.
- Either blind or read-back.
- PR is technically a "blind scrub", but of a particular region of the FPGA.



Shuttle Flight Computer (TMR + Spare)





### Why do it this way?

### With Spares, it basically becomes a flow-problem:

- o If the repair rate is faster than the incoming fault rate, you're safe.
- If the repair rate is slightly slower than the incoming fault rate, spares give you additional time.
- The additional time can accommodate varying flux rates.
- Abundant resources on an FPGA enable dynamic scaling of the number of spares. (e.g., build a bigger tub in real time)









### **Practical Considerations**

- Foreground operation can continue while repair is conducted in the background. Since scrubbing/PR is typically slower than reinitializing a tile, foreground "down time" is minimized.
- Using PR tiles, the system doesn't need to track the exact configuration memory addresses. Partial bit streams contain all the necessary information about a tile configuration.
- PR of a tile also takes care of both SEUs in the circuit fabric & configuration SRAM so the system doesn't care which one occurred.
- The "spares" are held in reset to reduce power. This is as opposed to running in N-MR with every tile voting.







### **Modeling Our Approach**

- We need to compare our approach to a traditional TMR+scrubbing system
- We use a Markov Model to predict Mean-Time-Before-Failure
  - 16 tile MicroBlaze system on Virtex-6 (3+13)
  - $\circ~\lambda$  is fault rate
  - $\circ$   $\mu$  is repair rate







### **Modeling Our Approach: Fault & Repair Rates**

### Fault Rate ( $\lambda$ )

- Derived from CREME96 tool for 4 different orbits
- Used LET fault data from V4

### ORBITAL FAULT RATES FROM CREME96, IN FAULTS/DEVICE/SECOND

|     | Average   | Worst<br>Week | Peak 5 Minutes |
|-----|-----------|---------------|----------------|
| ISS | 0.0003479 | 3.544         | 72.96          |
| HEO | 0.08788   | 120.2         | 2398           |
| E1P | 0.003464  | 29.93         | 612.3          |
| GEO | .0002494  | 149.8         | 3059           |

### Repair Rate (µ)

- Measured empirically in lab on V6 system



| Clock Blind<br>Rate |          | k, Readback, |
|---------------------|----------|--------------|
|                     | undamage | ed damaged   |
| 25 MHz 2.97         | 5.31     | 6.35         |





### **Modeling Our Approach: Results**

### **Baseline System**

| MTBF FOR BASELINE TMR+SCRUBBING SYSTEM (IN SECONDS) |     |          |            |                |
|-----------------------------------------------------|-----|----------|------------|----------------|
|                                                     |     | Average  | Worst Week | Peak<br>5 Min. |
| Blind                                               | ISS | 1.08E+08 | 3.19E+00   | 1.07E-01       |
|                                                     | HEO | 1.77E+03 | 6.43E-02   | 3.20E-03       |
|                                                     | E1P | 1.09E+06 | 2.69E-01   | 1.25E-02       |
|                                                     | GEO | 2.09E+08 | 5.14E-02   | 2.50E-03       |
| RB                                                  | ISS | 6.00E+07 | 2.73E+00   | 1.06E-01       |
|                                                     | HEO | 1.03E+03 | 6.39E-02   | 3.20E-03       |
|                                                     | E1P | 6.07E+05 | 2.63E-01   | 1.25E-02       |
|                                                     | GEO | 1.17E+08 | 5.12E-02   | 2.50E-03       |
|                                                     |     |          |            |                |

### Proposed System

MTBF FOR TMR+SCRUBBING+SPARES SYSTEM (IN SECONDS)

|       |     | Average  | Worst Week | Peak<br>5 Min. |
|-------|-----|----------|------------|----------------|
| Blind | ISS | 3.57E+43 | 7.83E+01   | 1.25E+00       |
|       | HEO | 3.75E+11 | 7.41E-01   | 3.59E-02       |
|       | E1P | 4.46E+29 | 3.30E+00   | 1.41E-01       |
|       | GEO | 3.74E+45 | 5.90E-01   | 2.81E-02       |
| RB    | ISS | 8.26E+41 | 5.49E+01   | 1.23E+00       |
|       | HEO | 2.10E+10 | 7.33E-01   | 3.59E-02       |
|       | E1P | 1.08E+28 | 3.16E+00   | 1.41E-01       |
|       | GEO | 8.63E+43 | 5.85E-01   | 2.81E-02       |

### Improvement

|       | INCREASE   | E IN MTBF AFTER AI | DDITION OF SPARES | (%)            |   |              |
|-------|------------|--------------------|-------------------|----------------|---|--------------|
|       |            | Average            | Worst Week        | Peak<br>5 Min. |   |              |
|       | ISS<br>HEO | 3.31E+35%          | 2356.07%          | 1067.45%       |   |              |
| Blind |            | 2.12E+08%          | 1051.79%          | 1021.88%       |   |              |
| blind | E1P        | 4.10E+23%          | 1127.98%          | 1031.20%       |   |              |
|       | GEO        | 1.78E+37%          | 1047.86%          | 1024.00%       |   | Ok, it looks |
|       | ISS        | 1.38E+34%          | 1912.98%          | 1058.51%       |   | promising    |
| RB    | HEO        | 2.05E+07%          | 1046.32%          | 1021.88%       |   |              |
| КD    | E1P        | 1.78E+22%          | 1103.77%          | 1028.80%       | J |              |
|       | GEO        | 7.40E+35%          | 1042.38%          | 1024.00%       |   |              |





### Let's Build It

• Xilinx Evaluation Platforms (Virtex 4/5/6) for Lab Testing





Custom Virtex-6 platform for Flight Testing











### Let's Fly It

- Local Balloon Flights (MSGC Borealis)
- HASP Program
- Suborbital Vehicle



-4 Flights in MT to 100k ft in 2011/12 -Thermal evaluation of form-factor







- 2<sup>nd</sup> test flight planned Sept-13

Payload design training (June-12)Flight planned 2013



<sup>- 1&</sup>lt;sup>st</sup> test flight in Sept-12

# Conclusion

### What is Missing

- Faults in the routing
- MBUs
- Addressing Single-Point-of-Failure

## What's Next

- Collect flight data
- Address above mentioned issues















# **Questions?**



# References



### Content

- "Space Transportation Costs: Trends in Price Per Pound to Orbit 1990-2000. Fultron Inc Technical Report., September 6, 2002. Sammy Kayali, "Space Radiation Effects on Microelectronics", JPL, [Available Online]: <u>http://parts.jpl.nasa.gov/docs/Radcrs\_Final.pdf</u>.
- Holmes-Siedle & Adams, "Handbook of Radiation Effects", 2<sup>nd</sup> Edition, Oxford Press 2002.
- Thanh, Balk, "Elimination and Generation of Si-Si02 Interface Traps by Low Temperature Hydrogen Annealing", Journal of Electrochemical Society on Solid-State Science and Technology, July 1998.
- Sturesson TEC-QEC, "Space Radiation and its Effects on EEE Components", EPFL Space Center, June 9, 2009. [Available Online]:
  - http://space.epfl.ch/webdav/site/space/shared/industry\_media/07%20SEE%20Effect%20F.Sturesson.pdf
- Lawrence T. Clark, Radiation Effects in SRAM: Design for Mitigation", Arizona State University, [Available Online]: <u>http://www.cmoset.com/uploads/9B.1-08.pdf</u>
- K. Iniewski, "Radiation Effects in Semiconductors", CRC Press, 2011.

### Images

- If not noted, images provided by <u>www.nasa.gov</u> or MSU
- Displacement Image 1: Moises Pinada, http://moisespinedacaf.blogspot.com/2010\_07\_01\_archive.html
- Displacement Image 2/3: Vacancy and divacancy (V-V) in a bubble raft. Source: University of Wisconsin-Madison
- SRAM Images: Kang and Leblebici, "CMOS Digital Integrated Circuits" 3rd Edition. McGraw Hill, 2003
- SEB Images: Sturesson TEC-QEC, "Space Radiation and its Effects on EEE Components", EPFL Space Center, June 9, 2009.
- FPGA Images: <u>www.xilinx.com</u>, <u>www.altera.com</u>
- RHBD Images: Giovanni Anelli & Alessandro Marchioro, "The future of rad-tol electronics for HEP", CERN, Experimental Physics Division, Microelectronics Group, [Available Online]:

