One of the reasons I've gone a bit dark lately is that running CSCI 6974, RPI's experimental hardware reverse engineering class, has been eating up a lot of my time.
I wanted to make the final lab for the course a nice climax to the semester and do something that would show off the kinds of things that are possible if you have the right gear, so it had to be impressive and technically challenging. The obvious choice was a FIB circuit edit combined with invasive microprobing.
After slaving away for quite a while (this was started back in January or so) I've managed to get something ready to show off :) The work described here will be demonstrated in front of my students next week as part of the fourth lab for the class.
The first step was to pick a target. I was interested in the Xilinx XC2C32A for several reasons and was already using other parts of the chip as a teaching subject for the class. It's a pure-digital CMOS CPLD (no analog sense amps and a fairly regular structure) made on a relatively modern process (180 nm 4-metal UMC) but not so modern as to be insanely hard to work with. It was also quite cheap ($1.25 a pop for the slowest speed grade in VQG44 package on DigiKey) so I could afford to kill plenty of them during testing
The next step was to decap a few, label interesting pins, and draw up a die floorplan. Here's a view of the die at the implant layer after Dash etch; P-type doping shows up as brown. (John did all of the staining work and got great results. Thanks!)
|XC2C32A die floorplan after Dash etch|
The top half of the die is the actual programmable logic, laid out in a "butterfly" structure. The center spine is the ZIA (global routing, also referred to as the AIM in some datasheets), which takes signals from the 32 macrocell flipflops and 33 GPIO pins and routes them into the function blocks. To either side of the spine are the two FBs, which consist of an 80 x 56 AND array (simplifying a bit... the actual structure is more like 2 blocks x 20 rows x 2 interleaved cells x 56 columns), a 56 x 16 OR array, and 16 macrocells.
I wanted some interesting data to show my students so there were two obvious choices. First, I could try to defeat the code protection somehow and read bitstreams out of a locked device via JTAG. Second, I could try to read internal device state at run time. The second seemed a bit easier so I decided to run with it (although defeating the lock bits is still on my longer-term TODO.)
The obvious target for probing internal runtime state is the ZIA, since all GPIO inputs and flipflop states have to go through here. Unfortunately, it's almost completely undocumented! Here's the sum total of what DS090 has to say about it (pages 5-6):
The Advanced Interconnect Matrix is a highly connected low power rapid switch. The AIM is directed by the software to deliver up to a set of 40 signals to each FB for the creation of logic. Results from all FB macrocells, as well as, all pin inputs circulate back through the AIM for additional connection available to all other FBs as dictated by the design software. The AIM minimizes both propagation delay and power as it makes attachments to the various FBs.Thanks for the tidbit, Xilinx, but this really isn't gonna cut it. I need more info!
The basic ZIA structure was pretty obvious from inspection of the implant layer: 20 identical copies of the same logic. This suggested that each row was responsible for feeding two signals left and two right.
SEM imaging of the implant layer showed the basic structure to be largely, but not entirely, symmetric about the left-right axis. At the far outside a few cells of the PLA AND array can be seen. Moving toward the center is what appears to be a 3-stage buffer, presumably for driving the row's output into the PLA. The actual routing logic is at center.
The row appeared entirely symmetric top-to-bottom so I focused my future analysis on the upper half.
|Single row of the ZIA seen at the implant layer after Dash etch. Light gray is P-type doping, medium gray is N-type doping, dark gray is STI trenches.|
|Single row of the ZIA seen on metal 4|
Inspection of the configuration EEPROM for the ZIA showed it to be 16 bits wide by 48 rows high.
|ZIA configuration EEPROM (top few rows)|
Of the 16 bits in each row, 8 bits presumably controlled the left-hand output and 8 controlled the right. This didn't make a lot of sense at first: dense binary coding would require only 7 bits for 65 channels and one-hot coding would need 65 bits.
Reading documentation for related device families sometimes helps to shed some light on how a part was designed, so I took a look at some of the whitepapers for the older 350 nm CoolRunner XPLA3 series. They went into some detail on how full crossbar routing was wasteful of chip area and often not necessary to get sufficient routability. You don't need to be able to generate every 40! permutations of a given subset of signals as long as you can route every signal somehow. Instead, the XPLA3's designers connected only a handful of the inputs to each row and varied the input selection for each row so as to allow almost every possible subset to be selected somehow.
This suggested a 2-level hierarchy to the ZIA mux. Instead of being a 65:1 mux it was a 65:N hard-wired mux followed by a N:1 programmable mux feeding left and another N:1 feeding right. 6 seemed to be a reasonable guess for N, given the six groups of wires on metal 4.
|ZIA mux structure|
|ZIA M3-M4 vias|
I extracted the full via pattern by copying a tracing of M4 over the M3 image and using the power vias running down the left side as registration marks. (Pro tip: Using a high accelerating voltage, like 20 kV, in a SEM gives great results on aluminum processes with tungsten via plugs. You get backscatters from vias through the metal layer that you can use for aligning image stacks.) A few of the rows are shown above.
At this point I felt I understood most of the structure so the next step was full circuit extraction! I had John CMP a die down to each layer and send to me for high-res imaging in the SEM.
The output buffers were fairly easy. As I expected they were just a 3-stage inverter cascade.
|Output buffer poly/diffusion/contact tracing|
|Output buffer M1 tracing|
|Output buffer gate-level schematic|
|Individual cell schematics|
The one surprising thing about the output buffer was that the NMOS on the third stage had a substantially wider channel than the PMOS. This is probably something to do with optimizing output rise/fall times.
Looking at the actual mux logic showed that it was mostly tiles of the same basic pattern (a 6T SRAM cell, a 2-input NOR gate, and a large multi-fingered NMOS pass transistor) except for the far left side.
|Gate-level layout of mux area|
|Left side of mux area, gate-level layout|
After tracing M1, it became obvious what was going on.
|Left side of mux area, M1|
The upper and lower halves control the outputs to function blocks 1 and 2 respectively. The two SRAM bits allow each output (labeled MUXOUT_FBx) to be pulled high, low, or float. A global reset line of some sort, labeled OGATE, is used to gate all logic in the entire ZIA (and presumably the rest of the chip); when OGATE is high the SRAM bits are ignored and the output is forced high.
Here's what it looks like in schematic:
|Gate-level schematics of pullup/pulldown logic|
It's interesting to note that while almost all of the config bits in the circuit are active-low, PULLUP is active-high. This is presumably done to allow the all-ones state (a blank EEPROM array) to put the muxes in a well-defined state rather than floating.
Turning our attention to the rest of the mux array shows a 6:1 one-hot-coded mux made from NMOS pass transistors. This, combined with the 2 bits needed for the pull-high/pull-low module, adds up to the expected 8. The same basic pattern shown below is tiled three times.
|Basic mux tile, poly/implant|
|Basic mux tile, M1|
The resulting schematic:
|Schematic of muxes|
M2 was used for some short-distance routing as well as OGATE, power/ground busing, and the SRAM bit lines.
|M2 and M2-M3 vias|
M3 was used for OGATE, power busing, SRAM word lines, the mask-programmed muxes, and the tri-state bus within the final mux.
|M3 and M3-M4 vias|
And finally, M4. I never found out what the leftmost power line went to, it didn't appear to be VCCINT or ground but was obviously power distribution. There's no reason for VCCIO to be running down the middle of the array so maybe VCCAUX? Reversing the global config logic may provide the answer.
Now that I had good intel on the target, it was time to plan the strike!
Part 2, The Attack, is here.