Monday, March 31, 2014

Getting my feet wet with invasive attacks, part 1: Target recon

This is part 1 of a 2-part series. Part 2, The Attack, is here.

One of the reasons I've gone a bit dark lately is that running CSCI 6974, RPI's experimental hardware reverse engineering class, has been eating up a lot of my time.

I wanted to make the final lab for the course a nice climax to the semester and do something that would show off the kinds of things that are possible if you have the right gear, so it had to be impressive and technically challenging. The obvious choice was a FIB circuit edit combined with invasive microprobing.

After slaving away for quite a while (this was started back in January or so) I've managed to get something ready to show off :) The work described here will be demonstrated in front of my students next week as part of the fourth lab for the class.

The first step was to pick a target. I was interested in the Xilinx XC2C32A for several reasons and was already using other parts of the chip as a teaching subject for the class. It's a pure-digital CMOS CPLD (no analog sense amps and a fairly regular structure) made on a relatively modern process (180 nm 4-metal UMC) but not so modern as to be insanely hard to work with. It was also quite cheap ($1.25 a pop for the slowest speed grade in VQG44 package on DigiKey) so I could afford to kill plenty of them during testing

The next step was to decap a few, label interesting pins, and draw up a die floorplan. Here's a view of the die at the implant layer after Dash etch; P-type doping shows up as brown. (John did all of the staining work and got great results. Thanks!)

XC2C32A die floorplan after Dash etch
The bottom half of the die is support infrastructure with EEPROM banks for storing the configuration bitstream toward the center and JTAG/configuration stuff in a U-shape below and to either side of the memory array. (The EEPROM is mislabeled "flash" in this image because I originally assumed it was 1T NOR flash. Higher magnification imaging later showed this to be wrong; the bit cells are 2T EEPROM.)

The top half of the die is the actual programmable logic, laid out in a "butterfly" structure. The center spine is the ZIA (global routing, also referred to as the AIM in some datasheets), which takes signals from the 32 macrocell flipflops and 33 GPIO pins and routes them into the function blocks. To either side of the spine are the two FBs, which consist of an 80 x 56 AND array (simplifying a bit... the actual structure is more like 2 blocks x 20 rows x 2 interleaved cells x 56 columns), a 56 x 16 OR array, and 16 macrocells.

I wanted some interesting data to show my students so there were two obvious choices. First, I could try to defeat the code protection somehow and read bitstreams out of a locked device via JTAG. Second, I could try to read internal device state at run time. The second seemed a bit easier so I decided to run with it (although defeating the lock bits is still on my longer-term TODO.)

The obvious target for probing internal runtime state is the ZIA, since all GPIO inputs and flipflop states have to go through here. Unfortunately, it's almost completely undocumented! Here's the sum total of what DS090 has to say about it (pages 5-6):
The Advanced Interconnect Matrix is a highly connected low power rapid switch. The AIM is directed by the software to deliver up to a set of 40 signals to each FB for the creation of logic. Results from all FB macrocells, as well as, all pin inputs circulate back through the AIM for additional connection available to all other FBs as dictated by the design software. The AIM minimizes both propagation delay and power as it makes attachments to the various FBs.
Thanks for the tidbit, Xilinx, but this really isn't gonna cut it. I need more info!

The basic ZIA structure was pretty obvious from inspection of the implant layer: 20 identical copies of the same logic. This suggested that each row was responsible for feeding two signals left and two right.

SEM imaging of the implant layer showed the basic structure to be largely, but not entirely, symmetric about the left-right axis. At the far outside a few cells of the PLA AND array can be seen. Moving toward the center is what appears to be a 3-stage buffer, presumably for driving the row's output into the PLA. The actual routing logic is at center.

The row appeared entirely symmetric top-to-bottom so I focused my future analysis on the upper half.

Single row of the ZIA seen at the implant layer after Dash etch. Light gray is P-type doping, medium gray is N-type doping, dark gray is STI trenches.
Looking at the top metal layer revealed the expected 65 signals.

Single row of the ZIA seen on metal 4
The signals were grouped into six groups with 11, 11, 11, 11, 11, and 10 signals in them. This led me to suspect that there was some kind of six-fold structure to the underlying circuitry, a suspicion which was later proven correct.

Inspection of the configuration EEPROM for the ZIA showed it to be 16 bits wide by 48 rows high.

ZIA configuration EEPROM (top few rows)
Since the global configuration area in the middle of the chip was 8 rows high this suggested that each of the 40 remaining EEPROM rows configured the top or bottom half of a ZIA row.

Of the 16 bits in each row, 8 bits presumably controlled the left-hand output and 8 controlled the right. This didn't make a lot of sense at first: dense binary coding would require only 7 bits for 65 channels and one-hot coding would need 65 bits.

Reading documentation for related device families sometimes helps to shed some light on how a part was designed, so I took a look at some of the whitepapers for the older 350 nm CoolRunner XPLA3 series. They went into some detail on how full crossbar routing was wasteful of chip area and often not necessary to get sufficient routability. You don't need to be able to generate every 40! permutations of a given subset of signals as long as you can route every signal somehow. Instead, the XPLA3's designers connected only a handful of the inputs to each row and varied the input selection for each row so as to allow almost every possible subset to be selected somehow.

This suggested a 2-level hierarchy to the ZIA mux. Instead of being a 65:1 mux it was a 65:N hard-wired mux followed by a N:1 programmable mux feeding left and another N:1 feeding right. 6 seemed to be a reasonable guess for N, given the six groups of wires on metal 4.

ZIA mux structure
This hypothesis was quickly confirmed by looking at M3 and M3-M4 vias: Each row had six short wires on M3, one under each of the six groups of wires in the bus. Each of these short lines was connected by one via to one of the bus lines on M4. The via pattern varied from row to row as expected.

ZIA M3-M4 vias

I extracted the full via pattern by copying a tracing of M4 over the M3 image and using the power vias running down the left side as registration marks. (Pro tip: Using a high accelerating voltage, like 20 kV, in a SEM gives great results on aluminum processes with tungsten via plugs. You get backscatters from vias through the metal layer that you can use for aligning image stacks.) A few of the rows are shown above.

At this point I felt I understood most of the structure so the next step was full circuit extraction! I had John CMP a die down to each layer and send to me for high-res imaging in the SEM.

The output buffers were fairly easy. As I expected they were just a 3-stage inverter cascade.

Output buffer poly/diffusion/contact tracing

Output buffer M1 tracing
Output buffer gate-level schematic

Individual cell schematics
Nothing interesting was present on any of the upper layers above here, just power distribution.

The one surprising thing about the output buffer was that the NMOS on the third stage had a substantially wider channel than the PMOS. This is probably something to do with optimizing output rise/fall times.

Looking at the actual mux logic showed that it was mostly tiles of the same basic pattern (a 6T SRAM cell, a 2-input NOR gate, and a large multi-fingered NMOS pass transistor) except for the far left side.

Gate-level layout of mux area

Left side of mux area, gate-level layout
The same SRAM-feeding-NOR2 structure is seen, but this time the output is a small NMOS or PMOS pass transistor.

After tracing M1, it became obvious what was going on.

Left side of mux area, M1

The upper and lower halves control the outputs to function blocks 1 and 2 respectively. The two SRAM bits allow each output (labeled MUXOUT_FBx) to be pulled high, low, or float. A global reset line of some sort, labeled OGATE, is used to gate all logic in the entire ZIA (and presumably the rest of the chip); when OGATE is high the SRAM bits are ignored and the output is forced high.

Here's what it looks like in schematic:

Gate-level schematics of pullup/pulldown logic
Cell schematics
In the schematics I drew the NOR2v0x1 cell as its de Morgan dual (AND with inverted inputs) since this seemed to make more sense in the context of the circuit: the output is turned on when the active-low input is low and OGATE is turned off.

It's interesting to note that while almost all of the config bits in the circuit are active-low, PULLUP is active-high. This is presumably done to allow the all-ones state (a blank EEPROM array) to put the muxes in a well-defined state rather than floating.

Turning our attention to the rest of the mux array shows a 6:1 one-hot-coded mux made from NMOS pass transistors. This, combined with the 2 bits needed for the pull-high/pull-low module, adds up to the expected 8.  The same basic pattern shown below is tiled three times.
Basic mux tile, poly/implant
Basic mux tile, M1
(Sorry for the misalignment of the contact layer, this was a quick tracing and as long as I was able to make sense of the circuit I didn't bother polishing it up to look pretty!)

The resulting schematic:

Schematic of muxes

M2 was used for some short-distance routing as well as OGATE, power/ground busing, and the SRAM bit lines.

M2 and M2-M3 vias


M3 was used for OGATE, power busing, SRAM word lines, the mask-programmed muxes, and the tri-state bus within the final mux.



M3 and M3-M4 vias

And finally, M4. I never found out what the leftmost power line went to, it didn't appear to be VCCINT or ground but was obviously power distribution. There's no reason for VCCIO to be running down the middle of the array so maybe VCCAUX? Reversing the global config logic may provide the answer.

M4
A bit of trial and error poking bits in bitstreams was sufficient to determine the ordering of signals. From right to left we have FB1's GPIO pins, the input-only pin, FB2's GPIO pins, then FB1's flipflops and finally FB2's flipflops.

Now that I had good intel on the target, it was time to plan the strike!

Part 2, The Attack, is here.

No comments:

Post a Comment