Silicon Exposed

KiCAD - Sonnet interop flow

2020-12-31T20:43:00.003-08:00

I've been using KiCAD as my PCB design tool of choice for years, and as my designs grew in complexity I began using Sonnet for EM field simulation prior to manufacturing boards.

Unfortunately, interoperability between these tools was less than ideal.

KiCAD export formats

Gerber/Excellon: Stable and reliable
DXF: Generates one file per layer. Drill hits can be plotted as Xs, but there's no obvious way to plot hole boundaries. Seems to sometimes generate polygons which are not closed (missing the last vertex).

Sonnet import formats

DXF: Standard feature with all (paid) Sonnet editions. Expects one file, with a separate pen type for each conductor or drill layer. Minimal tolerance to malformed files.
GDSII: Extra-cost option, but reasonably priced (about 25% of the cost of L2 Basic)
Gerber/Excellon: Extra-cost option, absurdly priced (five digits). Only runs on Windows, not available in the Linux edition at all. Apparently this is because it's a third-party conversion tool they license as a binary, so they have no ability to port it or offer less extravagant pricing.

I spent a while tinkering and came up with a flow using KLayout that seems to work well, so I thought I'd do a quick writeup so other folks could benefit from it. This might be of interest to users of other PCB CAD tools as well. Although KLayout is nominally an IC mask layout editor, it can also read Gerber!

Step 1: Gerber generation

Nothing special here. Export your gerbers just like you would for manufacturing a board.

Step 2: KLayout import

Launch KLayout and select File | Import | Gerber PCB | New Project - Free Layer Mapping. Specify the location of the exported gerbers.

On the next page, if asked to automatically populate the project, say "no". Answering yes will result in all of your soldermask, silkscreen, etc files being imported as well, which you probably don't want.

Click the plus sign button in the top right of the dialog and select all of the copper and drill layers of interest for your simulation.

On the next page, you should see one GDS layer generated for each of the files you're importing. There's no need to touch any settings.

On the next page, map gerber files to GDS layer IDs in a logical order. I suggest top to bottom with blind/buried vias in sequence and through-board vias at the end. Each gerber file should be mapped to exactly one GDS layer.

Leave all settings on the next page as default.

On the next page, make sure the database unit is 1.0 micron. Coordinates are rounded to multiples of the database unit internally, so it needs to be small enough to avoid round-off errors. Microns are the most convenient unit to use with Sonnet's DXF importer.

Check the "merge polygons" box. This flattens all overlapping polygons, which is important because KiCAD likes to generate extra gerber flashes in vias and zone fills.

While not merging, or merging in Sonnet, will ultimately produce the same simulation result, Sonnet's geometry editor (xgeom) gets really slow if your geometry has too many polygons. Flattening before Sonnet sees the board avoids this.

Now the gerber conversion is done! Click "import" and you should see the PCB in the KLayout editor view.

Step 3: KLayout export

Select File | Save As. Choose file type DXF and pick a file name. Select "no compression", scaling factor 1.0, database unit 1 micron, and polygons as POLYLINE.

Step 4: Sonnet import

Create a new, empty Sonnet project and configure your desired stackup so that all layers and metal types are available for import. Select File | Import | DXF and choose the generated DXF file.

Choose "Import using present project as template".

On the next page, enter the layer mappings you selected in KLayout.

On the next page, select units as microns.

For the remainder of the Sonnet import wizard, enter settings as appropriate for your design. Converting vias to rectangular greatly reduces memory requirements for simulation and is normally OK as long as the vias are electrically small.

If you have large via fences, coplanar waveguides, or similar, the "simplify via arrays" option may be of use as well. This one needs to be used with caution because it sometimes merges more vias than desired. I generally prefer to import individual vias and merge manually to avoid unexpected results.

Experiments with RF cables

2020-07-22T16:39:00.000-07:00

Introduction

Over the years I've grown to have quite the collection of RF cables in my lab, some with better datasheets than others. But how good were they really? I decided to fire up the VNA and collect some data to see how a few of them really performed.

Experimental setup:

Data was collected on a Pico Technology PicoVNA 106 with current traceable calibration, at a controlled 21C ambient temperature with 45% RH.
SOLT user calibration was performed immediately before data collection.
All SMA connections were torqued to 5 lbf-in.
SMA female-female couplers were used between the VNA cables and the DUT cables. These were not de-embedded as I don't have a VNA cal kit with SMA male terminations.

All cables were SMA male at both ends. If available, I tested two identical cables of each type to see how well matched they were.

4001 S-parameter points were collected at even intervals from 10 MHz to 6 GHz. I used Sonnet's S-parameter viewer for analysis of the collected data,

Cables tested (all SMA male-male)

CD International

Crystek Microwave

Hand formable .086, 6 inch (x1)
Hand formable .086, 1 foot (x2)

Mini-Circuits

Flexible .086, 2 feet (x2)

Insertion loss

Results were mostly as expected - longer cables had higher loss, and higher quality cables had lower. I was a little disappointed to find that the Crystek semi-rigid cable didn't outperform the Mini-Circuits, though.

S21 of cables under test

After normalizing to cable length, results were:

Mini-Circuits .086: 0.71 dB/foot
Crystek Microwave .086: 0.76 dB/foot
CD International RG-188: 0.78 dB/foot
CD International RG-174: 1 dB/foot

Return Loss

Better quality cable assemblies definitely won here.

S11 of cables under test

It's a bit hard to see since the graph is so busy, but the CD International cables (red, blue, black) had consistently higher return loss than the Crystek (pink, cyan) and Mini-circuits (green).

The CD International RG188 cable got as high as -17 dB S11 at 4.77 GHz. The worst Crystek cable hit -25.1 dB at 2.63 GHz, and the worst Mini-Circuits cable hit -25.4 dB at 1.51 GHz.

Propagation Velocity

Nothing too surprising here.

Group delay of cables under test

Everything was pretty flat, and longer cables had longer delay. Calculated propagation velocities (assuming all cables were exactly the nominal length):

Crystek .086: 1.68 ns/foot (0.61 C)
Mini-Circuits .086: 1.53 ns/foot (0.66 C)
CD International RG-188: 1.49 ns/foot (0.68 C)
CD International RG-174: 1.60 ns/foot (0.64 C)

Phase Matching

For the cables I tested two of, I zoomed in to compare how tightly the propagation delays were matched.

Most of these measurements are a bit noisy because I'm pushing limits of phase resolution on my VNA.

CD International, 3 foot RG-174. Looks like one is about 50ps shorter?

CD International, 3-foot RG188. Can't see any skew at all.

Crystek Microwave 1-foot .086. Maybe 50ps of skew?

Mini-Circuits 2-foot flex .086. No observable skew.

Conclusions

There's a few takeaways from this little experiment.

First, at the speeds I currently work at, there appears to be no need to worry about buying expensive phase-matched cables. While different types of coax had significant skew between equal-length cable assemblies, skew between two units of the same SKU ranged from "at the edge of my ability to measure" to entirely undetectable.

Given that my oscilloscopes sample at 20 Gsps when not doing channel interleaving, the maximum skew between any of the identical cables tested would result in a single sample of phase error. This shouldn't be enough to cause problems with my measurements.

Second, all of my cables have non-negligible loss. Even the high-quality Mini-Circuits cables have 0.8 dB of loss at the 2 GHz bandwidth limit of my current flagship scope, and 1.1 dB at the 4 GHz bandwidth limit of the new one I have on the way. I'm definitely going to start thinking more seriously about de-embedding cables from my measurements moving forward.

High-speed probing experiments

2019-05-01T01:01:00.002-07:00

It's been two years but I'm back! I've had a lot going on and have been way too busy to post about it, but here's a sneak peek...

Long story short, I bought a house with my wife and have been renovating it into the lab of my dreams. When complete, it will have a dedicated wet chemistry bench, 400 square feet of ESD-floored lab space, 200 square feet of office/conference room space, over a mile of multimode fiber and copper Ethernet cabling, and lots of other goodies. There will be a post (or more likely a series) on the setup when it's done, but that's still a few months out.

Among the new toys I've acquired is a LeCroy WaveRunner 8104 oscilloscope - a beast of a scope with four 1 GHz channels at up to 20 Gsps.

New scope is on the right rack (left side scope is an older WaveSurfer 3034)

There's just one problem... probes.

The stock passive probes have a rated bandwidth of 500 MHz and, as with all passive R-C probes, have severe input impedance roll-off at higher frequencies. According to the datasheet, at ~150 MHz the input impedance drops below 100Ω and it only gets worse from there! As you might imagine, this isn't going to be healthy for signals on a 50Ω line.

LeCroy does sell active probes (such as the ZS1500) but they are not cheap - even secondhand, you're looking at a low four-figure price tag per channel. I might pick one up for comparison purposes eventually, but won't be evaluating them during this post.

In order to explore different probe designs, I built a characterization board on OSHPark's 4-layer process. It has an edge-launch SMA at either end of two 50Ω microstrip traces, a ground plane on layer 2, then the bottom layers are unused. (I could have done this as a 2-layer board, but the traces would have been massive given the increased distance to the ground plane.)

Probe characterization board. The little gold points and the clip are grounds.

The test board has two sets of microstrips to allow A/B comparison, or passing differential signals through the test setup. They had to be pretty far apart due to mechanical constraints of the SMA connectors, and I elected to route the signals loosely coupled to avoid unwanted parasitics during single-ended tests.

I didn't even bother testing the standard alligator-clip ground wires on the PP022 probes, because I knew these were way too inductive to be of any value whatsoever for high-speed signals. Plus, to make it a fair comparison I needed to give the stock probe a chance to put its best foot forward. All of this testing was done with the spring-tip ground.

The first test signal was a 1 GHz pure sine source built around a Crystek CCSO-914X-1000 oscillator with some additional filtering on the output to remove harmonics.


Test setup. The helping-hand is sitting on top of the coax leading to my scope so it doesn't flail around.

Closeup of the probe tip

Unsurprisingly, the signal was severely attenuated since it was well past the bandwidth limit of the probe. The peak-to-peak voltage measured via the SMA feed-through was 1930 mV while the PP022's output read 657 mV, or about -9.4 dB of insertion loss. The probe also loaded the signal down noticeably; when the probe was removed the SMA feed-through signal read 2192 mV.

SMA feedthrough (top) vs PP022 (bottom), same scale. I probably should have turned persistence on to smooth out the traces, but the amplitude difference is strikingly obvious.

All screenshots were gathered using the "glscopeclient" software I've been writing over the past few months. There will be a separate post about it once I've done a bit more testing and development.

The other probe being tested today was a custom 20:1 transmission line probe. I had high hopes for this design as I had used it with good results on slower circuits in the past, but this was my first time trying it on such fast signals.

Transmission line probe prototype

This probe has a nominal 1KΩ impedance at DC and should roll off fairly slowly, since it has a largely resistive input stage (953Ω into a 50Ω line). At frequencies in the GHz range the resistor begins to become slightly capacitive, although inductive parasitics are probably still negligible if the models I've looked at are accurate.

The test setup was pretty much the same as before, except I used the helping hands to hold the probe. (I have a 3D printed shell on order that fits the same Pico Tech probe positioners I used on the PP020, but it's not here yet.)

Experimental setup

At first glance, performance was quite encouraging. The loading was noticeably less (feed-through signal was 2003 mV, vs 2192 un-loaded and 1930 with the stock probe) and the measured signal amplitude was 946 mV, very close to the actual value. This suggests the experimental probe design has a -3 dB bandwidth right around 1 GHz, or double the bandwidth of the stock probe!

Feedthrough signal (top), prototype probe (bottom)

The obvious next step was to try a more aggressive test. I threw up a quick test bitstream for my AC701 board that generated a 1.25 Gbps SGMII/1000base-X idle sequence (8B/10B codes K28.5 D16.2 repeating forever).

The experimental setup was identical to before, except that both legs of the test board were used since the GTP transceiver has a differential output. Although I was only probing one leg, I wanted to run clock recovery off the differential signal to minimize jitter in my measurements.

The positive leg of the feed-through differential pair was plotted in the top left quadrant of the scope view, while the probe's output was plotted in the top right. An offscreen filter subtracted the two differential pair legs, then fed the differential signal to a CDR PLL. This same recovered clock was used as the clock source for eye patterns on both signals, displayed in the bottom quadrants below their corresponding inputs.

The PP020 performed surprisingly well.

SMA feedthrough (left), PP020 (right)

Although it did noticeably degrade the feed-through signal, and the eye was attenuated to oblivion (with a mere 110 mV opening), the individual bits were still (barely) visible in the waveform display. Not bad for gigabit data on a 500 MHz probe!

This failure was entirely expected - but what I did not expect was how poorly my own probe design performed on the same test signal. While it was less intrusive to the pass-through signal, the eye measured through the probe was practically nonexistent.

SMA feedthrough (left), my probe (right)

Something was clearly not right, but the eye was so garbled I had a hard time seeing exactly what. I recompiled the FPGA bitstream to send a much slower, and more repetitive, test pattern - 0xFF00 at the same 1.25 Gbps rate. This is equivalent to a 156.25 Mbps 1-0-1-0 pattern, or a 78.125 MHz squarewave.

With no probe, the signal looked quite nice. There was a small amount of ringing on the rising edge, but this didn't impair readability of the signal at all. I also tried another color ramp (matplotlib Viridis) for the eye patterns in hopes of seeing more detail.

Differential signal with no probe

With my probe, the results were less impressive. There was a noticeable step on the rising edge followed by what looked like a spike above the nominal voltage, followed by the same ringing present in the original signal.

My probe at 125.25 Mbps

By treating the rising edge as a TDR impulse and comparing amplitude to the final value, we can determine that the initial rising edge reaches an impedance of about 38Ω for ~600 ps, then 55Ω for a bit over 1000 ps, then stabilizes to its final value (the 50Ω termination at the scope). Given that the scope only has 1 GHz bandwidth I'm not entirely sure how accurate this data is, but it's all I had to go on.

The obvious hypothesis was that the 38Ω area was an impedance mismatch at either the probe needle or the MMCX connector. I discounted the needle as the mismatch initially, because an 0.51mm diameter needle shank 2mm from a ground lead has an impedance of 244Ω which didn't match the observed data whatsoever.

The next obvious target was the MMCX. I calculated 26Ω impedance for the 1.1mm wide center contact over the ground plane (I had somehow forgotten to put a ground cutout under it). After milling out a cavity, though, performance didn't seem to improve by any detectable amount.

Milled cavity under center pad

Without a 3D field solver, I have no way of knowing if this cavity was large enough. I might try and make it bigger at some point.

The other possibility is that the lack of a cutout under the needle is problematic. I initially thought this would be OK because the needle is about the same diameter (0.51mm vs 0.41mm) as the PCB trace, but later calculations for an 0.51mm wire above a ground plane using the OSHPark stackup suggest that the impedance of the needle mounting could be as low as 30-40Ω which is consistent with the observed impedance hump.

Finally, the 0.41mm microstrip appears to have been over-etched closer to 0.32mm. If the needle mounting is the impedance mismatch, this would be consistent with the observed 55Ω hump after it. I might try thickening the microstrip with solder to see if this helps bring impedance up closer to 50.

Quest for camp stove fuel

2017-04-27T23:53:00.003-07:00

For those of you who aren't keeping up with my occasional Twitter/Facebook posts on the subject, I volunteer with a local search and rescue unit. This means that a few times a month I have to grab my gear and run out into the woods on zero notice to find an injured hiker, locate an elderly person with Alzheimer's, or whatever the emergency du jour is.

Since I don't have time to grab fresh food on my way out the door when duty calls, I keep my pack and load-bearing vest stocked with shelf-stable foods like energy bars and surplus military rations. Many missions are short and intense, leaving me no time to eat anything but finger-food items (Clif bars and First Strike Ration sandwiches are my favorites) kept in a vest pocket.

My SAR vest. Weighs about 17 pounds / 7.7 kg once the Camelbak bladder is added.

On the other hand, during longer missions there may be opportunities to make hot food while waiting for a medevac helicopter, ground team with stretcher, etc - and of course there's plenty of time to cook a hot dinner during training weekends. Besides being a convenience, hot food and drink helps us (and the subject) avoid hypothermia so it can be a literal life-saver.

I've been using MRE chemical heaters for this, because they're small, lightweight (20 g / 0.7 oz each), and not too pricey (about $1 each from surplus dealers). Their major flaw is that they don't get all that hot, so during cold weather it's hard to get your food more than lukewarm.

I've used many kinds of camp stoves (propane and white gas primarily) over the course of my camping, but didn't own one small enough to use for SAR. My full 48-hour gear loadout (including water) weighs around 45 pounds / 20 kg, and I really didn't want to add much more to this. The MSR Whisperlite, for example, weighs in at 430 g / 15.2 oz for the stove, fuel pump, and wind shield. Add to this 150 g / 5.25 oz for the fuel bottle, a pot to cook in, and the fuel itself and you're looking at close to 1 kg / 2 pounds all told.

I have an aluminum camp frying pan that, including lid, weighs 121 g / 4.3 oz. It seemed hard to get much lighter for something large enough that you could squeeze an MRE entree into, so I kept it.

After a bit of browsing in the local Wal-Mart, I found a tiny sheet metal folding stove that weighed 112 g / 3.98 oz empty. It's designed to burn pellets of hexamine fuel.

The stove. Ignore the aluminum foil, it was there from a previous experiment.

In my testing it worked pretty well. One pellet brought 250 ml of water from 10C to boiling in six minutes, and held it at a boil for a minute before burning out. The fuel burned fairly cleanly and didn't leave that much soot on the pot either, which was nice.

What's not so nice, however, was the fuel. According to the MSDS, hexamine decomposes upon heating or contact with skin into formaldehyde, which is toxic and carcinogenic. Combustion products include such tasty substances as hydrogen cyanide and ammonia. This really didn't seem like something that I wanted to handle, or burn, in close proximity to food! Thus began my quest for a safer alternative.

My first thought was to use tea light candles, since I already had a case of a hundred for use as fire starters. In my testing, one tea light was able to heat a pot of water from 10C to 30C in a whopping 21 minutes before starting to reach an equilibrium where the pot lost heat as fast as it gained it. I continued the test out to 34 minutes, at which point it was a toasty 36C.

The stove was big enough to fit more than one tea light, so the obvious next step was to put six of them in a 3x2 grid. This heated significantly more, at the 36-minute mark my water measured a respectable 78C.

I figured I was on the right track, but needed to burn more wax per unit time. Some rough calculations suggested that a brick of paraffin wax the size of the stove and about as thick as a tea light contained 1.5 kWh of energy, and would output about 35 W of heat per wick. Assuming 25% energy transfer efficiency, which seemed reasonable based on the temperature data I had measured earlier, I needed to put out around 675 W to bring my pot to a boil in ten minutes. This came out to approximately 20 candle wicks.

I started out by folding a tray out of heavy duty aluminum foil, and reinforcing it on the outside with aluminum foil duct tape. I then bought a pack of tea light wicks on Amazon and attached them to the tray with double-sided tape.

Giant 20-wicked candle before adding wax

I made a water bath on my hot plate and melted a bunch of tea lights in a beaker. I wasn't in the mood to get spattered with hot wax so I wore long-sleeved clothes and a face shield. I was pretty sure that the water bath wouldn't get anywhere near the ignition point of the wax but did the work outside on a concrete patio and had a CO2 fire extinguisher on standby just in case.

Melting wax. Safety first, everyone!

The resulting behemoth of a candle actually looked pretty nice!

20-wick, 700W thermal output candle with tea lights for scale

After I was done and the wax had solidified I put the candle in my stove and lit it off. It took a while to get started (a light breeze kept blowing out one wick or another and I used quite a few matches to get them all lit), but after a while I had a solid flame going. At the six-minute mark my water had reached 37C.

A few minutes later, disaster struck! The pool of molten wax reached the flash point and ignited across the whole surface. At this point I had a massive flame - my pot went from 48 to 82C in two minutes! This translates to 2.6 kW assuming 100% energy transfer efficiency, so actual power output was probably upwards of 5 kW.

I removed the pot (using welding gloves since the flames were licking up the handle) and grabbed a photo of the fireball before thinking about how to extinguish the fire.

Pretty sure this isn't what a stove is supposed to look like

Since I was outside on a non-flammable surface the fire wasn't an immediate safety hazard, but I wanted to put it out non-destructively to preserve evidence for failure analysis. I opted to smother it with a giant candle snuffer that I rapidly folded out of heavy-duty aluminum foil.

The carnage after the fire was extinguished. Note the discolored wax!

It took me a while to clean up the mess - the giant candle had turned tan from incomplete combustion. It had also sprung a leak at some point, spilling a bit of wax out onto my patio.

On top of that, my pot was coal-black from all of the soot the super-rich flame was putting out. My wife wouldn't let it anywhere near the sink so I scrubbed it as best I could in the bathtub, then spent probably 20 minutes scrubbing all of the gray stains off the tub itself.

In order to avoid the time-consuming casting of wax, my next test used a slug of wax from a tea light that I drilled holes in, then inserted four wicks. I covered the top of the candle with aluminum foil tape to reflect heat back up at the pot, in a bid to increase efficiency and keep the melt puddle below the flash point.

Quad-wick tea light

This performed pretty well in my test. It got my pot up to 35C at the 12-minute mark, which was right about where I expected based on the x1 and x6 candle tests, and didn't flash over.

The obvious next step was to make five of them and see if this would work any better. It ignited more easily than the "brick" candle, and reached 83C at the 6-minute mark. Before T+ 7 minutes, however, the glue on the tape had failed from the heat, and the wax flashed. By the time I got the pot out of harm's way the water was boiling and it was covered in soot (again).

This time, it was a little bit breezier and my snuffer failed to exclude enough air to extinguish the flames. I ended up having to blast it with the CO2 extinguisher I had ready for just this situation. It wasn't hard to put out and I only used about two of the ten pounds of gas. (Ironically, I had planned to take the extinguisher in to get serviced the next morning because it was almost due for annual preventive maintenance. I ended up needing a recharge too...)

After cleaning off my pot and stove, and scraping some of the spilled wax off my driveway, it was back to the drawing board. I thought about other potential fuels I had lying around, and several obvious options came to mind.

Testing booze for flammability

I'm not a big drinker but houseguests have resulted in me having a few bottles of liquor around so I tested it out. Jack didn't burn at all, Captain Morgan white rum burned fitfully and left a sugary residue without putting out much heat. 100-proof vodka left a bit of starchy residue and was tricky to light.

A tea light cup full of 99% isopropyl alcohol brought my pot to 75C in five minutes before burning out, but was filthy and left soot everywhere. Hand sanitizer (about 60% ethanol) burned cleanly, but slower and cooler due to the water content - peak temperature of 54C and 12 minute burn time.

Ethanol seemed like a viable fuel if I could get it up to a higher concentration. I wanted to avoid liquid fuels due to difficulty of handling and the risk of spills, but a thick gel that didn't spill easily looked like a good option.

After a bit of research I discovered that calcium acetate (a salt of acetic acid) was very soluble in water, but not in alcohols. When a saturated solution of it in water is added to an alcohol it forms a stiff gel, commonly referred to as a "California snowball" because it burns and has a consistency like wet snow. I don't have any photos of my test handy, but here's a video from somebody else that shows it off nicely.

Two tea light cups full of the stuff brought my pot of water to a boil in 8 minutes, and held it there until burning out just before the 13-minute mark. I also tried boiling a FSR sandwich packet in a half-inch or so of water, and it was deliciously warm by the end. This seemed like a pretty good fuel!

Testing the calcium acetate fuel. I put a lid on the pot after taking this pic.

I filled two film-canister type containers with the calcium acetate + ethanol gel fuel and left it in my SAR pack. As luck would have it, I spent the next day looking for a missing hiker so it spent quite a while bouncing around driving on dirt roads and hiking.

When I got home I was disappointed to see clear liquid inside the bag that my stove and fuel were stored in. I opened the canisters only to find a thin whitish liquid instead of a stiff gel.

It seemed that the calcium acetate gel was not very stable, and over time the calcium acetate particles would precipitate out and the solution would revert to a liquid state. This clearly would not do.

Hand sanitizer seemed like a pretty good fuel other than being underpowered and perfumed, so I went to the grocery store and started looking at ingredient lists. They all seemed pretty similar - ethanol, water, aloe and other moisturizers, perfumes, maybe colorants, and a thickener. The thickener was typically either hydroxyethyl cellulose or a carbomer.

A few minutes on Amazon turned up a bag of Carbomer 940, a polyvinyl carboxy polymer cross-linked with esters of pentaerythritol. It's supposed to produce a viscosity of 45,000 to 70,000 CPS when added to water at 0.5% by weight. I also ordered a second bottle of Reagent Alcohol (90% ethanol / 5% methanol / 5% isopropanol with no bittering agents, ketones, or non-volatile ingredients) since my other one was pretty low after the calcium acetate failure.

Carbomer 940 is fairly acidic (pH 2.7 - 3.3 at 0.5% concentration) in its pure form and gel when neutral or alkaline, so it needs to be neutralized. The recommended base for alcohol-based gels was triethanolamine, so I picked up a bottle of that too.

Preparing to make carbomer-alcohol fuel gel

I made a 50% alcohol-water solution and added an 0.5% mass of carbomer. It didn't seem to fully dissolve, leaving a bunch of goopy chunks in the beaker.

Incompletely dissolved Carbomer 940 in 50/50 water/alcohol

I left it overnight to dissolve, blended it more, and then filtered off any big clumps with a coffee filter. I then added a few drops of triethanolamine, at which point the solution immediately turned cloudy. Upon blending, a rubbery white substance preciptated out of solution and stuck to my stick blender and the sidewalls of the beaker. This was not supposed to happen!


Rubbery goop on the blender head

Precipitate at the bottom of the beaker

I tried everything I could think of - diluting the triethanolamine and adding it slowly to reduce sudden pH changes, lowering the alcohol concentration, and even letting the carbomer sit in solution for a few days before adding the triethanolamine. Nothing worked.

I went back to square one and started reading more papers and watching process demonstration videos from the manufacturer. Eventually I noticed one source that suggested increasing the pH of the water to about 8 *before* adding the carbomer. This worked and gave a beautiful clear gel!

After a bit of tinkering I found a good process: Starting with 100 ml of water, titrate to pH 8 with triethanolamine. Add 1 g of carbomer powder and blend until fully gelled. Add 300 ml of reagent alcohol a bit at a time, mixing thoroughly after each addition. About halfway through adding the alcohol the gel started to get pretty runny so I mixed in a few more drops of triethanolamine and another 500 mg of carbomer powder before mixing in the rest of the alcohol. I had only a little more alcohol left in the bottle (maybe 50 ml) so I stirred that in without bothering to measure.

The resulting gel was quite stiff and held its shape for a little while after pouring, but could still be transferred between containers without muich difficulty.

Tea light can full of my final fuel

I left the beaker of fuel in my garage for several days and shook it around a bit, but saw no evidence of degradation. Since it's basically just turbo-strength hand sanitizer (~78% instead of the usual 30-60%) without all of the perfumes and moisturizers, it should be pretty stable. I had no trouble igniting it down to 10C ambient temperatures, but may find it necessary to mix in some acetone or other low-flash-point fuel to light it reliably in the winter.

The final batch of fuel filled two polypropylene specimen jars perfectly with just a little bit left over for a cooking test.

One of my two fuel jars

One tea light canister held 10.7 g / 0.38 oz of fuel, and I typically use two at a time, so 21.4 / 0.76 oz. One jar thus holds enough fuel for about five cook sessions, which is more than I'd ever need for a SAR mission or weekend camping trip. The final weight of my entire cooking system (stove, one fuel jar, tea light cans, and pot) comes out to 408 g / 14.41 oz, or a bit less than an empty Whisperlite stove (not counting the pot, fuel tank, or fuel)!

The only thing left was to try cooking on it. I squeezed a bacon-cheddar FSR sandwich into my pot, added a bit of water, and put it on top of the stove with two candle cups of fuel.

Nice clean blue flame, barely visible

By the six-minute mark the water was boiling away merrily and a cloud of steam was coming up around the edge of the lid. I took the pot off around 8 minutes and removed my snack.

Munching on my sandwich. You can't tell in this lighting, but the stove is still burning.

For those of you who haven't eaten First Strike Rations, the sandwiches in them are kind of like Hot Pockets or Toaster Strudels, except with a very thick and dense bread rather than a fluffy, flaky one. The fats in the bread are solid at room temperature and liquefy once it gets warm. This significantly softens the texture of the bread and makes it taste a lot better, so reaching this point is generally the primary goal when cooking one.

My sandwich was firmly over that line and tasted very good (for Army food baked two years ago). The bacon could have been a bit warmer, but the stove kept on burning until a bit after the ten-minute mark so I could easily have left it in the boiling water for another two minutes and made it even hotter.

Once I was done eating it was time to clean up. The stove had no visible dirt (beyond what was there from my previous experiments), and the tea light canisters were clean and fairly free of soot except in one or two spots around the edges. Almost no goopy residue was left behind.

Stove after the cook test

The pot was quite clean as well, with no black soot and only a very thin film of discoloration that was thin enough to leave colored interference fringes. Some of this was left over from previous testing, so if this test had been run on a virgin pot there'd be even less residue.

Bottom of the pot after the cook test

Overall, it was a long journey with many false steps, but I now have the ability to cook for myself over a weekend trip in less than a pound of weight, so I'm pretty happy.

EDIT: A few people have asked to see the raw data from my temperature-vs-time cook tests, so here it is.

Raw data (graph 1)

Raw data (graph 2)

STARSHIPRAIDER: Preparing for high-speed I/O characterization

2017-04-08T01:50:00.000-07:00

In my previous post, I characterized the STARSHIPRAIDER I/O circuit for high voltage fault transient performance, but was unable to adequately characterize the high speed data performance because my DSO (Rigol DS1102D) only has 100 MHz of bandwidth.

Although I did have some ideas on how to improve the performance of the current I/O circuit, it was already faster than I could measure so I had no way to know if my improvements were actually making it any better. Ideally I'd just buy an oscilloscope with several GHz of bandwidth, but I'm not made of money and those scopes tend to be in the "request a quote" price range.

The obvious solution was to build one. I already had a proven high-speed sampling architecture from my TDR project so all I had to do was repackage it as an oscilloscope and make it faster still.

The circuit was beautifully simple: an output from the FPGA drives a 50 ohm trace to a SMA connector, then a second SMA connector drives the positive input of an ADCMP572 through a 3 dB attenuator (to keep my signal within range). The negative input is driven by a cheap 12-bit I2C DAC. The comparator output is then converted from CML to LVDS and fed to the host FPGA board. Finally, a 3.3V CML output from the FPGA drives the latch enable input on the comparator.

The "ADC" algorithm is essentially the same as on my TDR. I like to think of it as an equivalent-time version of a flash ADC: rather than 256 comparators digitizing the signal once, I digitize the signal 256 times with one comparator (and of course 256 different reference voltages). The post-processing to turn the comparator outputs into 8-bit ADC codes is the same.

Unlike the TDR, however, I also do equivalent-time sampling in the time domain. The FPGA generates the sampling and PRBS clocks with different PLL outputs (at 250 MHz / 4 ns period), and sweeps the relative phase in 100 ps steps to produce an effective resolution of 10 Gsps / 100 ps timebase.

Without further ado here's a picture of the board. Total BOM cost including connectors and PCB was approximately $50.

Oscilloscope board (yes, it's PMOD form factor!)

After some initial firmware development I was able to get some preliminary eye renders off the board. They were, to say the least, not ideal.

250 Mbps: very bumpy rise

500 Mbps: significant eye closure even with increased drive strength

I spent quite a while tracking down other bugs before dealing with the signal integrity issues. For example, a low-frequency pulse train showed up with a very uneven duty cycle:

Duty cycle distortion

Someone suggested that I try a slow rise time pulse to show the distortion more clearly. Not having a proper arbitrary waveform generator, I made do with a squarewave and R-C lowpass filter.

Ever seen breadboarded passives interfacing to edge-launch SMA connectors before?

It appeared that I had jump discontinuities in my waveform every two blocks (color coding)

I don't have an EE degree, but I can tell this looks wrong!

Interestingly enough, two blocks (of 32 samples each) were concatenated into a single JTAG transfer. These two were read in one clock cycle and looked fine, but the junction to the next transfer seemed to be skipping samples.

As it turned out, I had forgotten to clear a flag which led to me reading the waveform data before it was done capturing. Since the circular buffer was rotating in between packets, some samples never got sent.

The next bug required zooming into the waveform a bit to see. The samples captured on the first few (the number seemed to vary across bitstream builds) of my 40 clock phases were showing up shifted by 4 ns (one capture clock).

Horizontally offset samples

I traced this issue to a synchronizer between clock domains having variable latency depending on the phase offset of the source and destination clocks. This is an inherent issue in clock domain crossing, so I think I'm just going to have to calibrate it out somehow. For the short term I'm manually measuring the number of offset phases each time I recompile the FPGA image, and then correcting the data in post-processing.

The final issue was a hardware bug. I was terminating the incoming signal with a 50Ω resistor to ground. Although this had good AC performance, at DC the current drawn from a high-level input was quite significant (66 mA at 3.3V). Since my I/O pins can't drive this much, the line was dragged down.

I decided to rework the input termination to replace the 50Ω terminator with split 100Ω resistors to 3.3V and ground. This should have about half the DC current draw, and is Thevenin equivalent to a 50Ω terminator to 1.65V. As a bonus, the mid-level termination will also allow me to AC-couple the incoming signal if that becomes necessary.

Mill out trace from ground via to on-die 50Ω termination resistor

Remove soldermask from ground via and signal trace

Add 100Ω 0402 low-side terminator

Add 100Ω 0402 high-side terminator, plus jumper trace to 3.3V bulk decoupling cap

Add 10 nF high speed decoupling cap to help compensate for inductance of long feeder trace

I cleaned off all of the flux residue and ran a second set of eye loopback tests at 250 and 500 Mbps. The results were dramatically improved:

Post-rework at 250 Mbps

Post-rework at 500 Mbps

While not perfect, the new eye openings are a lot cleaner. I hope to tweak my input stage further to reduce probing artifacts, but for the time being I think I have sufficient performance to compare multiple STARSHIPRAIDER test circuits and see how they stack up at relatively high speeds.

Next step: collect some baseline data for the current STARSHIPRAIDER characterization board, then use that to inform my v0.2 I/O circuit!

STARSHIPRAIDER: Input buffer rev 0.1 design and characterization

2017-02-05T01:27:00.001-08:00

Working as an embedded systems pentester is a lot of fun, but it comes with some annoying problems. There's so many tools that I can never seem to find the right one. Need to talk to a 3.3V UART? I almost invariably have an FTDI cable configured for 5 or 1.8V on my desk instead. Need to dump a 1.8V flash chip? Most of our flash dumpers won't run below 3.3. Need to sniff a high-speed bus? Most of the Saleae Logic analyzers floating around the lab are too slow to keep up with fast signals, and the nice oscilloscopes don't have a lot of channels. And everyone's favorite jack-of-all-trades tool, the Bus Pirate, is infamous for being slow.

As someone with no shortage of virtual razors, I decided that this yak needed to be shaved! The result was an ongoing project I call STARSHIPRAIDER. There will be more posts on the project in the coming months so stay tuned!

The first step was to decide on a series of requirements for the project:

32 bidirectional I/O ports split into four 8-pin banks.
This is enough to sniff any commonly encountered embedded bus other than DRAM. Multiple banks are needed to support multiple voltage levels in the same target.
Full support for 1.2 to 5V logic levels.This is supposed to be a "Swiss Army knife" embedded systems debug/testing tool. This voltage range encompasses pretty much any signalling voltage commonly encountered in embedded devices.
Tolerance to +/- 12V DC levels.Test equipment needs to handle some level of abuse. When you're reverse engineering a board it's easy to hook up ground to the wrong signal, probe a power rail, or even do both at once. The device doesn't have to function in this state (shutting down for protection is OK) but needs to not suffer permanent damage. It's also OK if the protection doesn't handle AC sources - the odds of accidentally connecting a piece of digital test equipment to a big RF power amplifier are low enough that I'm not worried.
500 Mbps input/output rate for each pin.This was a somewhat arbitrary choice, but preliminary math indicated it was feasible. I wanted something significantly faster than existing tools in the class.
Ethernet-based interface to host PC.I've become a huge fan of Ethernet and IPv6 as communications interface for my projects. It doesn't require any royalties or license fees, scales from 10 Mbps to >10 Gbps and supports bridging between different link speeds, supports multi-master topologies, and can be bridged over a WAN or VPN. USB and PCIe, the two main alternatives, can do few if any of these.
Large data buffer.Most USB logic analyzers have very high peak capture rates, but the back-haul interface to the host PC can't keep up with extended captures at high speed. Commodity DRAM is so cheap that there's no reason to not stick a whole SODIMM of DDR3 in the instrument to provide an extremely deep capture buffer.
Multiple virtual instruments connected to a crossbar.Any nontrivial embedded device contains multiple buses of interest to a reverse engineer. STARSHIPRAIDER needs to be able to connect to several at once (on arbitrary pins), bridge them out to separate TCP ports, and allow multiple testers to send test vectors to them independently.

The brain of the system will be fairly straightforward high-speed digital. It will be a 6-8 layer PCB with an Artix-7 FPGA in FGG484 package, a SODIMM socket for 4GB of DDR3 800, a KSZ9031 Gigabit Ethernet PHY, a TLK10232 10gbit Ethernet PHY, and a SFP+ cage, plus some sort of connector (most likely a Samtec Q-strip) for talking to the I/O subsystem on a separate board.

The challenging part of the design, from an architectural perspective, seemed to be the I/O buffer and input protection circuit, so I decided to prototype it first.

STARSHIPRAIDER v0.1 I/O buffer design

A block diagram of the initial buffer design is shown above. The output buffer will be discussed in a separate post once I've had a chance to test it; today we'll be focusing on the input stage (the top half of the diagram).

During normal operation, the protection relay is closed. The series resistor has insignificant resistance compared to the input impedance of the comparator (an ADCMP607), so it can be largely ignored. The comparator checks the input signal against a threshold (chosen appropriately for the I/O standard in use) and sends a differential signal to the host board for processing. But what if something goes wrong?

If the user accidentally connects the probe to a signal outside the acceptable voltage range, a Schottky diode connected to the +5V or ground rail will conduct and shunt the excess voltage safely into the power rails. The series resistor limits fault current to a safe level (below the diode's peak power rating). After a short time (about 150 µs with my current relay driver), the protection relay opens and breaks the circuit.

The relay is controlled by a Silego GreenPAK4 mixed-signal FPGA, running a small design written in Verilog and compiled with my open-source toolchain. The code for the GreenPAK design is on Github.

All well and good in theory... but does it work? I built a characterization board containing a single I/O buffer and loaded with test points and probe connectors. You can grab the KiCAD files for this on Github as well. Here's a quick pic after assembly:

STARSHIPRAIDER I/O characterization board

Initial test results were not encouraging. Positive overvoltage spikes were clamped to +8V and negative spikes were clamped to -1V - well outside the -0.5 to +6V absolute max range of my comparator.

Positive transient response

Negative transient response

After a bit of review of the schematics, I found two errors. The "5V" ESD diode I was using to protect the high side had a poorly controlled Zener voltage and could clamp as high as 8V or 9V. The Schottky on the low side was able to survive my fault current but the forward voltage increased massively beyond the nominal value.

I reworked the board to replace the series resistor with a larger one (39 ohms) to reduce the maximum fault current, replaced the low-side Schottky with one that could handle more current, and replaced the Zener with an identical Schottky clamping to the +5V rail.

Testing this version gave much better results. There was still a small amount of ringing (less than five nanoseconds) a few hundred mV past the limit, but the comparator's ESD diodes should be able to safely dissipate this brief pulse.

Positive transient response, after rework

Negative transient response, after rework

Now it was time to test the actual signal path. My first iteration of the test involved cobbling together a signal path from an FPGA board through the test platform and to the oscilloscope without any termination. The source of the signal was a BNC-to-minigrabber flying lead test clip! Needless to say, results were less than stellar.

PRBS31 eye at 80 Mbps through protection circuit with flying leads and no terminator

After ordering some proper RF test supplies (like an inline 50 ohm BNC terminator), I got much better signal quality. The eye was very sharp and clear at 100 Mbps. It was visibly rounded at 200 Mbps, but rendering a squarewave at that rate requires bandwith much higher than the 100 MHz of my oscilloscope so results were inconclusive.

PRBS31 eye at 100 Mbps through protection circuit with proper cabling

PRBS31 eye at 200 Mbps, limited by oscilloscope bandwidth

I then hooked the protection circuit up to the comparator to test the entire inbound signal chain. While the eye looked pretty good at 100 Mbps (plotting one leg of the differential since my scope was out of channels), at 200 Mbps horrible jitter appeared.

PRBS31 eye at 100 Mbps through full input buffer

PRBS31 eye at 200 Mbps through full input buffer

After quite a bit of scratching my head and fumbling with datasheets, I realized my oscilloscope was the problem by plotting the clock reference I was triggering on. The jitter was visible in this clock as well, suggesting that it was inherent in the oscilloscope's trigger circuit. This isn't too surprising considering I'm really pushing the limits of this scope - I need a better one to do this kind of testing properly.

PRBS31 eye at 200 Mbps plus 200 MHz sync clock

At this point I've done about all of the input stage testing I can do with this oscilloscope. I'm going to try and rig up a BER tester on the FPGA so I can do PRBS loopback through the protection stage and comparator at higher speeds, then repeat for the output buffer and the protection run in the opposite direction.

I still have more work to do on the protection circuit as well... while it's fine at 100 Mbps, the 2x 10pF Schottky diode parasitic capacitance is seriously degrading my rise times (I calculated an RC filter -3dB point of around 200 MHz, so higher harmonics are being chopped off). I have some ideas on how I can cut this down much less but that will require a board respin and another blog post!

Open Verilog flow for Silego GreenPak4 programmable logic devices

2016-05-08T00:55:00.003-07:00

I've written a couple of posts in the past few months but they were all for the blog at work so I figured I'm long overdue for one on Silicon Exposed.

So what's a GreenPak?

Silego Technology is a fabless semiconductor company located in the SF Bay area, which makes (among other things) a line of programmable logic devices known as GreenPak. Their 5th generation parts were just announced, but I started this project before that happened so I'm still targeting the 4th generation.

GreenPak devices are kind of like itty bitty PSoCs - they have a mixed signal fabric with an ADC, DACs, comparators, voltage references, plus a digital LUT/FF fabric and some typical digital MCU peripherals like counters and oscillators (but no CPU).

It's actually an interesting architecture - FPGAs (including some devices marketed as CPLDs) are a 2D array of LUTs connected via wires to adjacent cells, and true (product term) CPLDs are a star topology of AND-OR arrays connected by a crossbar. GreenPak, on the other hand, is a star topology of LUTs, flipflops, and analog/digital hard IP connected to a crossbar.

Without further ado, here's a block diagram showing all the cool stuff you get in the SLG46620V:

SLG46620V block diagram (from device datasheet)

They're also tiny (the SLG46620V is a 20-pin 0.4mm pitch STQFN measuring 2x3 mm, and the lower gate count SLG46140V is a mere 1.6x2 mm) and probably the cheapest programmable logic device on the market - $0.50 in low volume and less than $0.40 in larger quantities.

The Vdd range of GreenPak4 is huge, more like what you'd expect from an MCU than an FPGA! It can run on anything from 1.8 to 5V, although performance is only specified at 1.8, 3.3, and 5V nominal voltages. There's also a dual-rail version that trades one of the GPIO pins for a second power supply pin, allowing you to interface to logic at two different voltage levels.

To support low-cost/space-constrained applications, they even have the configuration memory on die. It's one-time programmable and needs external Vpp to program (presumably Silego didn't want to waste die area on charge pumps that would only be used once) but has a SRAM programming mode for prototyping.

The best part is that the development software (GreenPak Designer) is free of charge and provided for all major operating systems including Linux! Unfortunately, the only supported design entry method is schematic entry and there's no way to write your design in a HDL.

While schematics may be fine for quick tinkering on really simple designs, they quickly get unwieldy. The nightmare of a circuit shown below is just a bunch of counters hooked up to LEDs that blink at various rates.

Schematic from hell!

As if this wasn't enough of a problem, the largest GreenPak4 device (the SLG46620V) is split into two halves with limited routing between them, and the GUI doesn't help the user manage this complexity at all - you have to draw your schematic in two halves and add "cross connections" between them.

The icing on the cake is that schematics are a pain to diff and collaborate on. Although GreenPak schematics are XML based, which is a touch better than binary, who wants to read a giant XML diff and try to figure out what's going on in the circuit?

This isn't going to be a post on the quirks of Silego's software, though - that would be boring. As it turns out, there's one more exciting feature of these chips that I didn't mention earlier: the configuration bitstream is 100% documented in the device datasheet! This is unheard of in the programmable logic world. As Nick of Arachnid Labs says, the chip is "just dying for someone to write a VHDL or Verilog compiler for it". As you can probably guess by from the title of this post, I've been busy doing exactly that.

Great! How does it work?

Rather than wasting time writing a synthesizer, I decided to write a GreenPak technology library for Clifford Wolf's excellent open source synthesis tool, Yosys, and then make a place-and-route tool to turn that into a final netlist. The post-PAR netlist can then be loaded into GreenPak Designer in order to program the device.

The first step of the process is to run the "synth_greenpak4" Yosys flow on the Verilog source. This runs a generic RTL synthesis pass, then some coarse-grained extraction passes to infer shift register and counter cells from behavioral logic, and finally maps the remaining logic to LUT/FF cells and outputs a JSON-formatted netlist.

Once the design has been synthesized, my tool (named, surprisingly, gp4par) is then launched on the netlist. It begins by parsing the JSON and constructing a directed graph of cell objects in memory. A second graph, containing all of the primitives in the device and the legal connections between them, is then created based on the device specified on the command line. (As of now only the SLG46620V is supported; the SLG46621V can be added fairly easily but the SLG46140V has a slightly different microarchitecture which will require a bit more work to support.)

After the graphs are generated, each node in the netlist graph is assigned a numeric label identifying the type of cell and each node in the device graph is assigned a list of legal labels: for example, an I/O buffer site is legal for an input buffer, output buffer, or bidirectional buffer.

Example labeling for a subset of the netlist and device graphs

The labeled nodes now need to be placed. The initial placement uses a simple greedy algorithm to create a valid (although not necessarily optimal or even routable) placement:

Loop over the cells in the netlist. If any cell has a LOC constraint, which locks the cell to a specific physical site, attempt to assign the node to the specified site. If the specified node is the wrong type, doesn't exist, or is already used by another constrained node, the constraint is invalid so fail with an error.
Loop over all of the unconstrained cells in the netlist and assign them to the first unused site with the right label. If none are available, the design is too big for the device so fail with an error.

Once the design is placed, the placement optimizer then loops over the design and attempts to improve it. A simulated annealing algorithm is used, where changes to the design are accepted unconditionally if they make the placement better, and with a random, gradually decreasing probability if they make it worse. The optimizer terminates when the design receives a perfect score (indicating an optimal placement) or if it stops making progress for several iterations. Each iteration does the following:

Compute a score for the current design based on the number of unroutable nets, the amount of routing congestion (number of nets crossing between halves of the device), and static timing analysis (not yet implemented, always zero).
Make a list of nodes that contributed to this score in some way (having some attached nets unroutable, crossing to the other half of the device, or failing timing).
Remove nodes from the list that are LOC'd to a specific location since we're not allowed to move them.
Remove nodes from the list that have only one legal placement in the device (for example, oscillator hard IP) since there's nowhere else for them to go.
Pick a node from the remainder of the list at random. Call this our pivot.
Find a list of candidate placements for the pivot:

Consider all routable placements in the other half of the device.
If none were found, consider all routable placements anywhere in the device.
If none were found, consider all placements anywhere in the device even if they're not routable.

Pick one of the candidates at random and move the pivot to that location. If another cell in the netlist is already there, put it in the vacant site left by the pivot.
Re-compute the score for the design. If it's better, accept this change and start the next iteration.
If the score is worse, accept it with a random probability which decreases as the iteration number goes up. If the change is not accepted, restore the previous placement.

After optimization, the design is checked for routability. If any edges in the netlist graph don't correspond to edges in the device graph, the user probably asked for something impossible (for example, trying to hook a flipflop's output to a comparator's reference voltage input) so fail with an error.

The design is then routed. This is quite simple due to the crossbar structure of the device. For each edge in the netlist:

If dedicated (non-fabric) routing is used for this path, configure the destination's input mux appropriately and stop.
If the source and destination are in the same half of the device, configure the destination's input mux appropriately and stop.
A cross-connection must be used. Check if we already used one to bring the source signal to the other half of the device. If found, configure the destination to route from that cross-connection and stop.
Check if we have any cross-connections left going in this direction. If they're all used, the design is unroutable due to congestion so fail with an error.
Pick the next unused cross-connection and configure it to route from the source. Configure the destination to route from the cross-connection and stop.

Once routing is finished, run a series of post-PAR design rule checks. These currently include the following:

If any node has no loads, generate a warning
If an I/O buffer is connected to analog hard IP, fail with an error if it's not configured in analog mode.
Some signals (such as comparator inputs and oscillator power-down controls) are generated by a shared mux and fed to many loads. If different loads require conflicting settings for the shared mux, fail with an error.

If DRC passes with no errors, configure all of the individual cells in the netlist based on the HDL parameters. Fail with an error if an invalid configuration was requested.

Finally, generate the bitstream from all of the per-cell configuration and write it to a file.

Great, let's get started!

If you don't already have one, you'll need to buy a GreenPak4 development kit. The kit includes samples of the SLG46620V (among other devices) and a programmer/emulation board. While you're waiting for it to arrive, install GreenPak Designer.

Download and install Yosys. Although Clifford is pretty good at merging my pull requests, only my fork on Github is guaranteed to have the most up-to-date support for GreenPak devices so don't be surprised if you can't use a bleeding-edge feature with mainline Yosys.

Download and install gp4par. You can get it from the Github repository.

Write your HDL, compile with Yosys, P&R with gp4par, and import the bitstream into GreenPak Designer to program the target device. The most current gp4par manual is included in LaTeX source form in the source tree and is automatically built as part of the compile process. If you're just browsing, there's a relatively recent PDF version on my web server.

If you'd like to see the Verilog that produced the nightmare of a schematic I showed above, here it is.

Be advised that this project is still very much a work in progress and there are still a number of SLG46620V features I don't support (see the manual for exact details).

I love it / it segfaulted / there's a problem in the manual!

Hop in our IRC channel (##openfpga on Freenode) and let me know. Feedback is great, pull requests are even better,

You're competing with Silego's IDE. Have they found out and sued you yet?

Nope. They're fully aware of what I'm doing and are rolling out the red carpet for me. They love the idea of a HDL flow as an alternative to schematic entry and are pretty amazed at how fast it's coming together.

After I reported a few bugs in their datasheets they decided to skip the middleman and give me direct access to the engineer who writes their documentation so that I can get faster responses. The last time I found a problem (two different parts of the datasheet contradicted each other) an updated datasheet was in my inbox and on their website by the next day. I only wish Xilinx gave me that kind of treatment!

They've even offered me free hardware to help me add support for their latest product family, although I plan to get GreenPak4 support to a more stable state before taking them up on the offer.

So what's next?

Better testing, for starters. I have to verify functionality by hand with a DMM and oscilloscope, which is time consuming.

My contact at Silego says they're going to be giving me documentation on the SRAM emulation interface soon, so I'm going to make a hardware-in-loop test platform that connects to my desktop and the Silego ZIF socket, and lets me load new bitstreams via a scriptable interface. It'll have FPGA-based digital I/O as well as an ADC and DAC on every device pin, plus an adjustable voltage regulator for power, so I can feed in arbitrary mixed-signal test waveforms and write PC-based unit tests to verify correct behavior.

Other than that, I want to finish support for the SLG46620V in the next month or two. The SLG46621V will be an easy addition since only one pin and the relevant configuration bits have changed from the 46620 (I suspect they're the same die, just bonded out differently).

Once that's done I'll have to do some more extensive work to add the SLG46140V since the architecture is a bit different (a lot of the combinatorial logic is merged into multi-function blocks). Luckily, the 46140 has a lot in common architecturally with the GreenPak5 family, so once that's done GreenPak5 will probably be a lot easier to add support for.

My thanks go out to Clifford Wolf, whitequark, the IRC users in ##openfpga, and everyone at Silego I've worked with to help make this possible. I hope that one day this project will become mature enough that Silego will ship it as an officially supported extension to GreenPak Designer, making history by becoming the first modern programmable logic vendor to ship a fully open source synthesis and P&R suite.

New GPG key

2015-10-27T22:37:00.003-07:00

Hi everyone,

I've been busy lately and haven't had a chance to post much. There will be a pretty good sized series coming up in a month or two (hopefully) on my next-gen FPGA cluster and JTAG stuff but I'm holding off until I have something better to write about.

In the meantime, I've decided that my circa 2009 GPG key is long overdue for replacement so I've issued a new one and am posting the fingerprints in multiple public locations (this being one).

The new key fingerprint is:

859B A7BA DE9C 0BD5 EC01  FF36 3461 7AB9 B31C 7D7C

Verification message signed with my old key:
http://thanatos.virtual.antikernel.net/unlisted/new-key-notes.txt.asc

Graduating, TDR prototype, lab move, and a conference talk

2015-05-23T21:42:00.001-07:00

So, it's been a busy couple of months and I haven't had time to post anything. Here's a few quick updates:

I successfully defended my Ph.D thesis few weeks ago and will be graduating next weekend. You can download the thesis and browse the code if you're so inclined. I plan to continue developing the project in my spare time and using it as the basis for future embedded gadgets so expect more posts on it over the coming months.

The TDR board came in and I assembled the first prototype. It had a few bugs (which I'll detail in a future post) but after reworking them all of the major subsystems seem functional and I'm working on the firmware. Expect another post over the summer once I've made more progress.

TDR prototype during bring-up.
This was my improvised jig for holding probes on small test points.

Now that I'm done with school I'm moving across the country in a week to start work at my new job. My lab is currently living in 125 cardboard boxes weighing just shy of 1900 pounds. (This doesn't count my desktop computer and monitors, test equipment, microscope bench, or the servers and rack as they're all still in use and being packed up over the next few days.) The lab will be totally down for 1-2 months.

The current state of my lab

Finally, I will be speaking on CPLD reverse engineering at REcon this June. The topic of the talk is the XC2C32A, which my regular readers may remember from a previous post. I'll be describing the reverse engineering process, how far I got, and showcasing bitstreams generated by my toolchain running on actual silicon. If any you are going, by all means introduce yourself :)

How to lose my business permanently

2015-02-04T22:39:00.001-08:00

This post is a bit different from my usual ones in that it discusses the business side of the semiconductor industry, not the technical side. The issue has been getting more and more problematic lately so I figured I'd write up a few quick thoughts on it.

Let's suppose you're a large semiconductor company who is currently making a large amount of money selling chips to a couple of major customers. You've decided that your business is too big and you have no desire to get new customers, now or in the future. In fact, you don't even want these companies to use your products in new designs. What are some ways you can get rid of these pesky engineers trying to throw money at you?

Make your parts hard to find. Ask major distributors like Digi-Key and AVnet to discontinue stocking them.
If someone does manage to find an authorized sales partner, pester them with questions even if they're just looking for a budgetary price quote for a feasibility study. Ask for a project name, description, business plan, names of the team members, color of the soldermask, logo, and anything else you can think of. If it looks like an initial proof of concept that the customer isn't yet confident will become a high-volume product, or a one-off test fixture/lab tool, badger them by asking about annual sales volume and volume ramp-up dates until they lose interest and buy from a competitor.
Just in case anyone actually succeeds in buying your part, make it useless to them. Keep the datasheet locked up in a steel vault in your corporate headquarters. Promise would-be customers that you'll let them see it if they sign away their firstborn son and sacrifice a golden lamb on an altar made of FPGAs, but hide the actual NDA contract behind so many redirect pages and broken links that nobody can actually sign it, much less see the actual datasheet. Bonus points if your chip is something commodity like a gigabit Ethernet PHY that has nothing even remotely sensitive in the datasheet.

If you follow these rules properly, congratulations! I'll do my part to further your goals by making sure you will never get design wins in any projects I'm involved in, especially high-volume ones for large companies. Your shareholders will be overjoyed.

TDR updates

2015-01-20T23:05:00.001-08:00

A few months ago, I wrote about a project I had been thinking of for a while but not had time to work on: a time-domain reflectometer (TDR) for testing twisted pair Ethernet cables.

TDR background

The basic theory of operation is simple: Send a pulse down a transmission line and measure the reflected voltage over time to get a plot of impedance discontinuities over time. Unfortunately, doing this with sufficient temporal resolution (sub-nanosecond) requires extremely high analog sampling rates, and GHz A/D converters are (to say the least) not cheap: the least expensive 1 GSA/s ADC on Digi-Key is the HMCAD1520, which sells for $120 each at the time of this writing. Higher sampling rates cost even more, the 1.5 GSa/s ADC081500CIYB is listed at $347.

One possible architecture would consist of a pre-amplifier for each channel, a 4:1 RF mux, and a single high-speed ADC sampled by an FPGA. This would work, but seemed quite expensive and I wanted to explore lower-cost options.

ADC architecture

After thinking about the problem for a while, I realized that the single most expensive component in a classical TDR was probably the ADC - but there was no easy way to make it cheaper. What if I could eliminate the ADC entirely?

I drew inspiration from the successive-approximation-register (SAR) ADC architecture, which essentially converts a DAC into an ADC by binary searching. The basic operating algorithm is as follows:

For each point T in time

vstart = 0
vend = Vref
Set DAC to (Vstart + Vend) / 2
Compare Vin against Vdac
If Vin > Vdac

Set current bit of sample to 1
Set Vstart to Vdac, update ADC, repeat

else

Set current bit of sample to 0
Set Vend to Vdac, update ADC, repeat

The problem with SAR for high speeds is that N-bit analog resolution at M samples per second requires a DAC that can run at O(M log N) samples per second - hardly an improvement!

In order to work around this problem, I began to think about ways to represent the data generated by a SAR ADC. I ended up modeling a simplified SAR ADC which performed a linear, rather than binary, search. We can represent the intermediate data as a matrix of 2^N rows by M columns, one for each of M data points.

The sampling algorithm for this simplified ADC works as follows:

For each point T in time

Set DAC to 0
Compare Vin against Vdac

If Vin > Vdac

Set column[Vdac] to 1
Increment Vdac
Repeat comparison

Otherwise stop and capture the next sample

Once we have this matrix, we can simply sum the number of 1s in each column to calculate the corresponding sample value.

While this approach will clearly work, it is exponentially slower than the conventional SAR ADC since it requires 2^N samples instead, instead of N, for N-bit precision. So why is it useful?

Now consider what happens if we acquire the data from a transposed version of the same matrix:

For each Vdac from 0 to Vref

For each point T in time

Compare Vin against Vdac

If Vin > Vdac

Set row Vin of column T to 1

Increment Vref
Go back in time and loop over the signal again

This version clearly captures the same data, since matrix[T][V] is set to true iff sample T is less than V. We simply switch the inner and outer loops.

It also has a very interesting property for cost optimization: Since it only updates the DAC after sampling the entire signal, we can now use a much slower (and cheaper) DAC than with a conventional SAR. In addition, the comparator can now update at the sampling frequency instead of 2^N times the sampling frequency.

There's just one problem: It requires time travel! Why are we wasting our time analyzing a circuit that can't actually be built?

Well, as it turns out we can solve this problem too - with "parallel universes". Since the impedance of the cable is (hopefully) fairly constant over time, if we send multiple pulses they should return identical reflection waveforms. We can thus send out a pulse, test one candidate Vdac value against this waveform, then increment Vdac and repeat.

The end result is that with a cheap SPI DAC, a high-speed comparator, and an FPGA with a high-speed digital input we can digitize a repetitive signal to arbitrary analog precision, with sampling rate limited only by comparator bandwidth and FPGA input buffer performance!

Pulse generation

The first step in any TDR, of course, is to generate the pulse.

I spent a while looking over FPGAs and ended up deciding on the Xilinx Kintex-7 series, specifically the XC7K70T. The -1 speed grade can do 1.25 Gbps LVDS in the high-performance I/O banks (matching the -2 and -3 speed Artix-7 devices) and the higher speed grades can go up to 1.6 Gbps.

The pulse is generated by using the OSERDES of the FPGA to produce a single-cycle LVDS 1 followed by a long series of 0s. The resulting LVDS pulse is fed into a Micrel SY58603U LVDS-to-CML buffer. This slightly increases the amplitude of the output pulse and sharpens the rise time up to 80ps.

The resulting pulse is then sent through the RJ45 connector onto the cable being tested.

Output buffer

Input preamplifier

The reflected signal coming off the differential pair is AC coupled with a pair of capacitors to prevent bus fights between the (unequal) common-mode voltages of the output buffer and the preamplifier. It is then fed into a LMH6881 programmable-gain preamplifier.

This is by far the most pricey analog component I've used in a design: nearly $10 for a single amplifier. But it's a very nice amplifier (made on a SiGe BiCMOS process)- very high linearity, 2.4 GHz bandwidth, and gain from 6 to 26 dB programmable over SPI in 0.25 dB steps.

Input preamplifier

The optional external terminator (R25) is intended to damp out any reflections coming off of the preamplifier if they present a problem; during the initial assembly I plan to leave it unpopulated. Since this is my first high-speed mixed signal design I'm trying to make it easy to tweak if I screwed up somehow :)

The output of the preamplifier is a differential signal with 2.5V common mode offset.

Differential to single-ended conversion

The next step is to convert the differential voltage into a single-ended voltage that we can feed into the comparator. I use an AD8045 unity gain voltage feedback amplifier for this, configured to compute CH1_VDIFF = (CH1_BUF_P - CH1_BUF_N) + 2.5V.

Digitization

The single-ended voltage is compared against the DAC output (AFE_VREF) using one half of an LMH7322 dual comparator.

The output supply of the comparator is driven by a 2.5V supply to produce LVDS-compatible differential output voltage levels.

PCB layout

The board was laid out in KiCAD using the new push-and-shove router. All of the differential pairs were manually length-matched to 0.1mm or better.

The upper left corner of the board contains four copies of the AFE. The AD8045s are on the underside of the board because the pinout made routing easier this way. Hopefully the impedance discontinuities from the vias won't matter at these signal speeds...

AFE layout, front side

AFE layout, back side

The rest of the board isn't nearly as complex: the lower left has a second RJ45 connector and a RGMII PHY for interfacing to the host PC, the power supply is at the upper right, and the FPGA is bottom center.

The power supply is divided into two regions, digital and analog. The digital supply is on the far right side of the board, safely isolated from the AFE. It uses an LTC3374 to generate an 1.0V 4A rail for the FPGA core, a 1.2V 2A rail for the FPGA transceivers and Ethernet PHY, a 1.8V 1A rail for digital I/O, and a 2.5V rail for the CML buffers and Ethernet analog logic.

The analog supply was fairly close to the AFE so I put a guard ring around it just to be safe. It consists of a LTC3122 boost converter to push the 5V nominal input voltage up to 6V, followed by a 5V LDO to give a nice clean analog rail. I ran the output of the LDO through a pi filter just to be extra safe.

The TDR subsystem didn't use any of the four 6.6 Gbps GTX serial transceivers on the FPGA because they are designed to recover their clock from the incoming signal and don't seem to support use of an external reference clock. It seemed a shame to waste them, though, so I broke them (as well as 20 0.95 Gbps LVDS channels) out to a Samtec Q-strip header for use as high-speed GPIO.

Without further ado, here's the full layout. I could have made the board quite a bit smaller in the vertical axis but I needed to keep a constant 100mm high so it would fit in the card guides on my Eurocard rack.

The board is at fabs for quotes now and I'll make another post once the boards come back.

Layer 1


Layer 2 (ground)

Layer 3 (power)

Layer 4, flipped to make text readable

GERALD microscope control system

2015-01-15T07:24:00.000-08:00

While my optical microscopes are capable of sufficient resolution for imaging larger-process ICs, taking massive die images (this one, for a comparatively small 3.2mm^2 die, is about 0.6 gigapixels) has been beyond my capability because I have better things to do with my time than sit in the lab for a week turning the stage knob a little, clicking a button to take a picture, turning the knob a little...

John has a computer-controlled microscope that he uses for large imaging jobs and it seemed like it'd be a good idea to make one of my own. The first step was to write some control software. John's user interface is very much a "programmer's GUI" and I figured something with a bit more eye candy wouldn't be too hard to do.

The result was a system called GERALD. (I couldn't think of a name for it, asked my girlfriend for suggestions, and this was the first she came up with...)

The prototype is using my old AmScope microscope because I didn't want to take my main Olympus out of service for an extended period of time during development. Yes, I'm aware that the "support" structure under the stage isn't very rigid... this is a software development testbed and I won't be using this microscope in the final deployment so there's no sense wasting time machining nice aluminum brackets. I just have to be careful not to move around too much when using it ;)

Prototype GERALD system on my desk. The breadboarded MCU is generating step/direction pulses from USB-serial commands and sending them to the stepper controller under the textbook.

The camera in use is an AmScope MD1900. It uses a proprietary USB protocol and only has Windows driver support, and I try to run a 100% Linux shop. This was a problem... In keeping with John's unofficial motto "Open source by any means necessary", I corrected the problem ;)

A bit of Wireshark and libusb coding later, I had a basic working driver. The protocol is closely related (but not identical) to the MU800 that John reversed a few months ago, which helped me get my feet in the door. There's a bit of trivial obfuscation (XORing control transactions with 0x55aa or 0x5aa5) which I fail to understand... The average user isn't going to notice anything, and anyone with the skills to reverse engineer USB transactions or a kernel driver will see right through it, so why bother?

I plan to try making a kernel V4L2 driver in the future but for now it works so it's not a huge priority.

The current GERALD system is very much a WIP, most basic features are there but automated image capture and some other things aren't implemented yet.

GERALD UI overview

The basic UI layout is modeled in large part on the FEI Versa FIB's control system, which is currently my favorite control system for either optical or electron microscopes.

The upper left panel is an overview of the current camera feed, scaled down to fit. The upper right view, meanwhile, shows the center of the feed at native (1:1 pixel) resolution. Both views have a footer that includes a scale bar, date/time, the objective in use, and magnification. (The magnification is the actual value based on calibration of the camera and my 24" 1080p display.)

The lower right view is a webcam pointed at the sample stage. I haven't had the time to make a bracket to hold it in the right spot so it's just sitting on my desk for now. This is the equivalent of a SEM chamber cam and I hope to eventually use it to avoid crashing the sample into the objective in full-remote-control operation.

The lower left view is a navigation display showing the current sample. Unlike the Versa's nav-cam (a single static image taken of the stage when the sample is loaded) my navigation view is a composite of actual microscope images. Every time a new video frame comes in when the stage isn't moving (to avoid motion blur) it is plotted in the navigation view at the current physical position of the stage.

As of now the navigation view is non-interactive; all it does is show the current field of view on the sample. My plan is to support clicking to move the stage to a specific point, as well as drawing to define a rectangular area for step-and-repeat imaging.

In typical use I envision the user moving to the upper left and bottom right corner of the sample manually with the joystick, then selecting an objective and drawing a box around the sample to initiate a high-resolution imaging run.

In order for all of this to work properly, the system must be calibrated. The first step is to calibrate the camera so that it knows how large each pixel is. I've done this manually in the past by taking pictures of a calibration slide and counting pixels, but it's high time I automated the process.

GERALD pixel size calibration

The algorithm is quite simple and is designed to work with the pattern on my calibration slide (10um pitch short lines and 50um pitch long lines) only. As of now the are limited sanity checks so if there's no calibration slide in view the results can be somewhat strange :)

Convert the image to grayscale using NTSC color weights
Compute a gray-level histogram
Median-filter the histogram to smooth out spikes and find peaks
The distribution should be approximately bimodal (white lines on dark background). Take the mean of the peaks and threshold the image to binary.
Do a median filter on the binary image to smooth out noise
Scan across the image horizontally, one scan line at a time. Note the start and end locations of each white area. If the width is too small (less than two pixels) or too large (more than 10% of the screen) discard it. Otherwise save the width and centroid as a slice down a potential line.
For each slice, check if the next row down has a line slice within a couple of pixels. If so, add it to the growing polyline and remove it from the list of candidates.
Fit a line segment to the points in each polyline using least-squares, set its width to the mean of all slices used
Extend each line segment until it hits the edge of the image. If the projected line gets within a very short distance of BOTH endpoints of a second segment, the two are collinear so merge them. (This will smooth out gaps in the detected lines from dust specks etc.) The resulting lines are plotted in green in the calibration view.
Project the lines onto the X axis and sort them from left to right.
For each line from left to right, measure the perpendicular distance to the next one over (displayed in red in the calibration view).
Compute the median of these line lengths. This is considered to be the pitch of the lines, in pixels.
Find the median width of the lines.
If the pitch is less than three times the line width, we're looking at fine-pitch lines (10 μm pitch). Otherwise the fine pitch lines were too small to resolve and we're looking at coarse pitch (50 μm).
Compute the physical size of one camera pixel from the known physical pitch and pixel pitch.

Now that that the camera has been calibrated, we know how big a camera pixel is - but we don't have the scaling and rotation factors needed to transform between camera and stage coordinates. This is the second half of the calibration.

Rather than trying to compute a full transformation matrix, the current code simply represents a motor step for each motor as a 2-vector describing the distance (in nanometers, referenced to the camera axes) the stage moves during one motor step. We can then easily compute the distance moved by any number of motor steps as a linear combination of these two 2-vectors.

This algorithm is built on top of the same CV code used for the camera calibration.

Find the rightmost line on the calibration slide.
Locate the midpoint of it. (This is marked with a blue dot in the debug overlay.)
Check if this keypoint is in the left 1/4 of the camera's FOV.
If not, move left 50 steps, take a picture, and repeat until it is.
Record the position of the keypoint.
Move right, 50 steps at a time, until the keypoint is in the right 1/4 of the FOV.
Record the 2D distance the keypoint moved and divide by the number of steps taken to get the X axis step vector.
Repeat for the Y axis, except using 1/3 as the threshold instead of 2/3 since the Y axis FOV is smaller than X.

The system isn't quite finished but it's coming along nicely. Hopefully I'll have time in the next couple of months to finish the software, make a PCB for the control circuit, and machine brackets to hold all of the parts onto my Olympus scope.

FPGA cluster updates

2015-01-15T06:04:00.001-08:00

Although the "raised floor" design for my FPGA cluster looked cool, it really didn't scale. My entire desk was full, there was very limited room for new hardware, and the boards kept getting dusty. To make matters worse, long wires were needed to connect everything and it was difficult to manage them all.

Original FPGA cluster

I ended up moving forward with the plan I came up with a few months ago and my FPGA cluster is now living on the 19" rack in my living room.

The first step was to laser-cut acrylic frames for each board (or several boards, if they were small enough) that would slide into the card guides.

In the photo below you can see nodes lx9mini0 and lx9mini2 (lx9mini1 was being used for something else at the time so I put it on another card later on) on a clear 1/16" acrylic sheet cut to standard Eurocard dimensions. The clear faceplate was later replaced with an opaque black one because I think it looks better that way.

Spartan-6 FPGA boards on a 3U blade card

My existing USB hubs weren't well suited to the rackmount form factor so I built a new one. This is a ten-port USB 2.0 hub with two front-panel ports (for keyboard and mouse) and eight back-side ports for internal connections. It consists of three 4-port Cypress hub chips in a tree, plus the associated PMICs. For extra fun I threw on an XC2C128 CPLD with a SPI headers so that I could potentially toggle power to individual ports remotely over SPI, but as of now this functionality isn't being used.

As with my previous hub designs all ports are overcurrent protected. I also added external ESD clamp diodes (RCLAMP0514M) to each data line after an killing a previous hub by zapping it while plugging my phone in to charge.

The port indicator LEDs are off in the idle state, green when a device has enumerated successfully, blinking green when a device is detected but fails to enumerate, and red for an overcurrent fault.

3U x 4HP USB hub blade

I went with my original plan of racking the Atlys boards in one of my empty 1U cases and the AC701 in another. The boards are screwed into standoffs which are attached to the case by cyanoacrylate adhesive.

Nodes atlys0 and atlys1 being installed in a 1U server case. JTAG and Ethernet cables are not yet installed.

I then installed my PDU board, the BeagleBone Black, the USB hub, and all of the smaller FPGA boards I was using for my research in the rack. Since I was moving the second 24-port switch from my desk to the rack I hooked the two together with a short run of multimode fiber. Fiber was overkill for the application but I had SFP ports sitting around unused and I was running out of copper interfaces...

The standard I've decided on for all new designs in the near future is as follows:

Boards are to be 100mm tall (3U Eurocard height)
Component keepouts along the top and bottom edges as specified by the Eurocard standard for card guides
Faceplate mounting holes on the front panel as specified by the Eurocard standard
5V DC center-high power via standard 5.5mm barrel jack on the back edge
Network jacks and indicator LEDs on the front edge
JTAG, USB, and other I/O on the back edge

My current rack

The current FPGA cluster

Back side of the FPGA cluster midway through rack install. The USB cable coming out the bottom was temporary for testing until I could find a right-angle one.

All of the cards have nice shiny black faceplates except my 4-port switch prototype from a few years ago - the laser cutter on campus was unavailable the day I wanted to get the faceplate made and I never got around to doing it.

I still have quite a bit more gear to rack - several Spartan-6 boards I'm not actively using in my research, a Raspberry Pi, a Parallella, and a large number of PIC MCU development boards of various types. This will fill the current card rack and then some, so I left 3U of empty space below this one for expansion. The Parallella runs hot so I ordered a 1U fan tray which will be installed later today.

The 1U blank above the card rack is there for airflow reasons (the fan will blow upwards and exhaust air needs some space to blow out) but I can still fit stuff that doesn't block airflow. Early next week I'll be replacing it with a 16-port LC-LC patch panel so that I can terminate fiber runs to various devices on the rack (such as the AC701 board, which currently has a 2m run of multimode going around the side of the rack because there's no good way to get fiber from back to front).

New microscope bench

2015-01-15T05:51:00.002-08:00

For a long time, my microscopes have lived on a folding plastic table in the corner of my lab. It was wobbling and causing blurry images, but I never got a chance to do something about it... until now.

It's been replaced it with a custom-built wooden workbench. I made it big and heavy on purpose to reduce vibrations: The tabletop is 3/4" flooring-grade plywood and the legs are 4x4" posts resting on top of rubber pads. (I actually made two of them, because one of my roommates liked the design and wanted one for himself. )

I don't have any photos of the early build process but here's one of the tabletop and legs before the shelving was installed. It was annoying having to stain and assemble it in the middle of my living room... I can't wait to have a garage or basement to work in!

EDIT: A friend who helped me with the build just sent me this one.

Test-fitting some of the lumber

New workbench, partially assembled

After installing shelves

In its final resting place

The stain turned out a little darker than I anticipated but it actually matched the paneling in my apartment surprisingly well so no complaints there. The 16 square feet of shelf space let me tidy up the lab and remove a lot of miscellaneous junk that had been sitting around with nowhere to go.

With equipment installed

It was a lot of fun to make - I had almost forgotten how much I enjoyed woodworking. Maybe my next piece of furniture will be something more pretty and less utilitarian in nature? (Making it out of hardwood instead of construction lumber, and having better tools, would help too...)

Hello 2015 - New job, new projects, and more!

2015-01-15T05:41:00.001-08:00

A few months ago, I wrote about some of my pending projects. Several of them are actively running, some are done, and a few new ones cropped up :) I'll be making a bunch of short posts in the next day or so on some of them.

It's been a busy fall so I haven't had time to post anything lately. I'm working hard on my thesis and plan to defend this spring. It's 84 pages and counting...

After I graduate I'll be packing up my lab and moving from Troy, NY to somewhere near Seattle. I'll be working as a senior security consultant in the hardware lab at IOActive, doing board, firmware, and silicon level reversing and security auditing, as well as original research... so you can expect some posts by me on their lab blog in the coming months. My personal lab will likely have a bit of downtime due to the move, although I'd like to get back up and running by the end of the summer at the latest.

I've also submitted a talk based on my CPLD reverse engineering work to REcon. I hinted earlier about the project but there's a lot more to it which hasn't been released publicly... details to come if the talk is accepted ;)

Why Apple's iPhone encryption won't stop NSA (or any other intelligence agency)

2014-10-04T16:59:00.001-07:00

Recent news headlines have made a big deal of Apple encrypting more of the storage on their handsets, and claiming to not have a key. Depending on who you ask this is either a huge win for privacy, or a massive blow to intelligence collection and law enforcement capabilities. I'm going to try avoiding expressing any opinions of government policy here and focus on the technical details of what is and is not possible - and why disk encryption isn't as much of a major game-changer as people seem to think.

Matthew Green at Johns Hopkins wrote a very nice article on the subject recently, but there are a few points I feel it's worth going into more detail on.

The general case here is that of two people, Alice and Bob, communicating with iPhones while a third party, Eve, attempts to discover something about their communications.

First off, the changes in iOS 8 are encrypting data on disk. Voice calls, SMS, and Internet packets still cross the carrier's network in cleartext. These companies are legally required (by CALEA in the United States, and similar laws in other countries) to provide a means for law enforcement or intelligence to access this data.

In addition, if Eve can get within radio range of Alice or Bob, she can record the conversation off the air. Although the radio links are normally encrypted, many of these cryptosystems are weak and can be defeated in a reasonable amount of time by cryptanalysis. Numerous methods are available for executing man-in-the-middle attacks between handsets and cell towers, which can further enhance Eve's interception capabilities.

Second, if Eve is able to communicate with Alice or Bob's phone directly (via Wi-Fi, SMS, MITM of the radio link, MITM further upstream on the Internet, physical access to the USB port, or using spearphishing techniques to convince them to view a suitably crafted e-mail or website) she may be able to use an 0day exploit to gain code execution on the handset and bypass any/all encryption by reading the cleartext out of RAM while the handset is unlocked. Although this does require that Eve have a staff of skilled hackers to find an 0day, or deep pockets to buy one, when dealing with a nation/state level adversary this is hardly unrealistic.

Although this does not provide Eve with the ability to exfiltrate the device encryption key (UID) directly, this is unnecessary if cleartext can be read directly. This is a case of the general trend we've been seeing for a while - encryption is no longer the weakest link, so attackers figure out ways to get around it rather than smash through.

Third, in many cases the contents of SMS/voice are not even required. If the police wish to geolocate the phone of a kidnapping victim (or a suspect) then triangulation via cell towers and the phone's GPS, using the existing e911 infrastructure, may be sufficient. If intelligence is attempting to perform contact tracing from a known target to other entities who might be of interest, then the "who called who when" metadata is of much more value than the contents of the calls.

There is only one situation where disk encryption is potentially useful: if Alice or Bob's phone falls into Eve's hands while locked and she wishes to extract information from it. In this narrow case, disk encryption does make it substantially more difficult, or even impossible, for Eve to recover the cleartext of the encrypted data.

Unfortunately for Alice and Bob, a well-equipped attacker has several options here (which may vary depending on exactly how Apple's implementation works; many of the details are not public).

If the Secure Enclave code is able to read the UID key, then it may be possible to exfiltrate the key using software-based methods. This could potentially be done by finding a vulnerability in the Secure Enclave (as was previously done with the TrustZone kernel on Qualcomm Android devices to unlock the bootloader). In addition, if Eve works for an intelligence agency, she could potentially send an NSL to Apple demanding that they write firmware, or sign an agency-provided image, to dump the UID off a handset.

In the extreme case, it might even be possible for Eve to compromise Apple's network and exfiltrate the certificate used for signing Secure Enclave images. (There is precedent for this sort of attack - the authors of Stuxnet appear to have stolen a driver-signing certificate from Realtek.)

If Apple did their job properly, however, the UID is completely inaccessible to software and is locked up in some kind of on-die hardware security module (HSM). This means that even if Eve is able to execute arbitrary code on the device while it is locked, she must bruteforce the passcode on the device itself - a very slow and time-consuming process.

In this case, an attacker may still be able to execute an invasive physical attack. By depackaging the SoC, etching or polishing down to the polysilicon layer, and looking at the surface of the die with an electron microscope the fuse bits can be located and read directly off the surface of the silicon.

E-fuse bits on polysilicon layer of an IC (National Semiconductor DMPAL16R). Left side and bottom right fuses are blown, upper right is conducting. (Note that this is a ~800nm process, easily readable with an optical microscope. The Apple A7 is made on a 28nm process and would require an electron microscope to read.) Photo by John McMaster, CC-BY

Since the key is physically burned into the IC, once power is removed from the phone there's no practical way for any kind of self-destruct to erase it. Although this would require a reasonably well-equipped attacker, I'm pretty confident based on my previous experience that I could do it myself, with equipment available to me at school, if I had a couple of phones to destructively analyze and a few tens of thousands of dollars to spend on lab time. This is pocket change for an intelligence agency.

Once the UID is extracted, and the encrypted disk contents dumped from the flash chips, an offline bruteforce using GPUs, FPGAs, or ASICs could be used to recover the key in a fairly short time. Some very rough numbers I ran recently suggest that an 6-character upper/lowercase alphanumeric SHA-1 password could be bruteforced in around 25 milliseconds (1.2 trillion guesses per second) by a 2-rack, 2500-chip FPGA cluster costing less than $250,000. Luckily, the iPhone uses an iterated key-derivation function which is substantially slower.

The key derivation function used on the iPhone takes approximately 50 milliseconds on the iPhone's CPU, which comes out to about 70 million clock cycles. Performance studies of AES on a Cortex-A8 show about 25 cycles per byte for encryption plus 236 cycles for the key schedule. The key schedule setup only has to be done once so if the key is 32 bytes then we have 800 cycles per iteration, or about 87,500 iterations.

It's hard to give exact performance numbers for AES bruteforcing on an FPGA without building a cracker, but if pipelined to one guess per clock cycle at 400 MHz (reasonable for a modern 28nm FPGA) an attacker could easily get around 4500 guesses per second per hash pipeline. Assuming at least two pipelines per FPGA, the proposed FPGA cluster would give 22.5 million guesses per second - sufficient to break a 6-character case-sensitive alphanumeric password in around half an hour. If we limit ourselves to lowercase letters and numbers only, it would only take 45 seconds instead of the five and a half years Apple claims bruteforcing on the phone would take. Even 8-character alphanumeric case-sensitive passwords could be within reach (about eight weeks on average, or faster if the password contains predictable patterns like dictionary words).

Threat modeling for FPGA software backdoors

2014-09-17T07:42:00.000-07:00

I've been interested in the security of compilers and related toolchains ever since I first read about Ken Thompson's compiler backdoor many years ago. In a nutshell, this famous backdoor does two things:

Whenever the backdoored C compiler compiles the "login" command, it adds a second code path that accepts a hard-coded default password in addition to the user's actual password
Whenever the backdoored C compiler compiles the unmodified source code of itself, it adds the backdoor to the resulting binary.

The end result is a compiler that looks fine at the source level, silently backdoors a critical system file at compilation time, and reproduces itself.

Recently, there has also been a lot of concern over backdoors in integrated circuits (either added at the source code level by a malicious employee, or at the netlist/layout level by a third-party fab). DARPA even has a program dedicated to figuring out ways of eliminating or detecting such backdoors. A 2010 paper stemming from the CSAW Embedded Systems Challenge presents a detailed taxonomy of such hardware Trojans.

As far as I can tell, the majority of research into hardware Trojans has been focused on detecting them, assuming the adversary has managed to backdoor the design in some way that provides him with a tactical or strategic advantage. I have had difficulty finding detailed threat modeling research quantifying the capability of the adversary under particular scenarios.

When we turn our attention to FPGAs, things quickly become even more interesting. There are several major differences between FPGAs and ASICs from a security perspective which may grant the adversary greater or lesser capability than with an ASIC.

Attacks at the IC fab

The function of an ASIC is largely defined at fab time (except for RAM-based firmware) while FPGAs are extremely flexible. When trying to backdoor FPGA silicon the adversary has no idea what product(s) the chip will eventually make it into. They don't even know which pins on the die will be used as inputs and which as outputs.

I suspect that this places substantial bounds on the capability of an attacker "Malfab" targeting FPGA silicon at the fab (or pre-fab RTL/layout) level since the actual RTL being targeted does not even exist yet. To start, we consider a generic FPGA without any hard IP blocks:

Malfab does not know which flipflops/SRAM/LUTs will eventually contain sensitive data, nor what format this data may take.
Malfab does not know which I/O pins may be connected to an external communications interface useful for command-and-control.

As a result, his only option is to create an extremely generic backdoor. At this level, the only thing that makes sense is to connect all I/O pins (perhaps through scan logic) to a central malware logic block which provides the ability to read (and possibly modify) all state in the device. This most likely would require two major subsystems:

A detector, which searches I/Os for a magic sync sequence
A connection from that detector to the FPGA's internal configuration access port (ICAP), used for partial reconfiguration and readback.

The design of this protocol would be very challenging since the adversary does not know anything about the external interfaces the pin may be connected to. The FPGA could be in a PLC or similar device whose only external contact is RS-232 serial on a single input pin. Perhaps it is in a network router/switch using RGMII (4-bit parallel with double data rate signalling).

I am not aware of any published work on the feasibility of such a backdoor however I am skeptical that a sufficiently generic Trojan could be made simple enough to evade even casual reverse engineering of the I/O circuitry, and fast enough to not seriously cripple performance of the device.

Unfortunately for our defender Alice, modern FPGAs often contain hard IP blocks such as SERDES and RAM controllers. These present a far more attractive target to Malfab as their functionality is largely known in advance.

It is not hard to imagine, for example, a malicious patch to the RAM controller block which searches each byte group for a magic sync sequence written to consecutive addresses, then executes commands from the next few bytes. As long as Malfab is able to cause the target's system to write data of his choice to consecutive RAM addresses (perhaps by sending it as the payload of an Ethernet frame, which is then buffered in RAM) he can execute arbitrary commands on the backdoor engine. If one of these commands is "write data from SLICE_X37Y42.A5FF to RAM address 0xdeadbeef", and Malfab can predict the location of a transmit buffer of some sort, he now has the ability to exfiltrate arbitrary state from Alice's hardware.

I thus conjecture that the only feasible way to backdoor a modern FPGA at fab time is through hard IP. If we ensure that the JTAG interface (the one hard IP block whose use cannot be avoided) is not connected to attacker-controlled interfaces, use off-die SERDES, and use softcore RAM controllers on non-standard pins, it is unlikely that Malfab will be able to meaningfully affect the security of the resulting circuit.

Attacks on the toolchain

We now turn our attention to a second adversary, Maldev - the malicious development tool. Maldev works for the FPGA vendor, has compromised the source repository for their toolchain, has MITMed the download of the toolchain installer, or has penetrated Alice's network and patched the software on her computer.

Since FPGAs are inherently closed systems (more so than ASICs, in which multiple competing toolchains exist), Alice has no choice but to use the FPGA vendor's binary-blob toolchain. Although it is possible in theory for a dedicated team with sufficient time and budget to reverse engineer the FPGA and/or toolchain and create a trusted open-source development suite, I discount the possibility for the scope of this section since a fully trusted toolchain is presumably free of interesting backdoors ;)

Maldev has many capabilities lacking to Malfab since he can execute arbitrary code on Alice's computer. Assuming that Alice is (justifiably) paranoid about the provenance of her FPGA software and runs it on a dedicated machine in a DMZ (so that it cannot infect the remainder of her network), this is equivalent to having full access to her RTL and netlist at all stages of design.

If Alice gives her development workstation Internet access, Maldev now has the ability to upload her RTL source and/or netlist, modify it at will on his computer, and then push patches back. This is trivially equivalent to a full defeat of the entire system.

Things become more interesting when we cut off command-and-control access. This is a realistic scenario if Alice is a military/defense user doing development on a classified network with no Internet connection.

The simplest attack is for Maldev to store a list of source file hashes and patches in the compromised binary. While this is very limited (custom-developed code cannot be attacked at all), many design teams are likely to use a small set of stock communications IP such as the Xilinx Tri-Mode Ethernet MAC, so patching these may be sufficient to provide him with an attack vector on the target system. Looking for AXI interconnect IP provides Maldev with a topology map of the target SoC.

Another option is graph-based analytics on the netlist at various stages of synthesis. For example, by looking for a 32-bit register initialized to 0x67452301, which is in a strongly connected component with three other registers initialized to 0xefcdab89, 0x98badcfe, and 0x10325476, Maldev can say with a high probability that he has found an implementation of MD5 and located the state registers. By looking for a 128-bit comparator between these values and another 128-bit value, a hash match check has been found (and a backdoor may be inserted). Similar techniques may be used to look for other cryptography.

Conclusions

If FPGA development is done using silicon purchased before the start of the project, on an air-gapped machine, and without using any pre-made IP, then some bounds can clearly be placed on the adversary's capability.

I have not seen any formal threat modeling studies on this subject, although I haven't spent a ton of time looking for them due to research obligations. If anyone is aware of published work in this field I'm extremely interested.

Updates and pending projects

2014-08-26T16:51:00.000-07:00

It's been a while since I've written anything here so here's a bit of a brain-dump on upcoming stuff that will find its way here eventually.

Thesis stuff

This has been eating the bulk of my time lately. I just submitted a paper to ACM Computing Surveys and am working on a conference paper for EDSC that's due in two weeks or so. With any luck the thesis itself will be finished by May and I can graduate.

Lab improvements

I'm in the process of fixing up my lab to solve a bunch of the annoying things that have been bugging me. Most/all of these will be expanded into a full post once it's closer to completion.

Racking the FPGA cluster
The "raised floor" FPGA cluster was a nice idea but the 2D structure doesn't scale. I've filled almost all of it and I really need the desk space for other things.

I ordered a 3U Eurocard subrack from Digikey and once it arrives will be making laser-cut plastic shims to load all of my small boards into it. The first card made for the subrack is already inbound: a 3U x 4HP 10-port USB hub to replace several of the 4-port hubs I'm using now. It will be hosted by my Beaglebone Black, which will function as a front-end node bridging the USB-UART and USB-JTAG ports out to Ethernet.

The AC701 board is huge (well over 3U on the shortest dimension) so I may end up moving it into one of the two empty 1U Sun "pizza box" server cases I have lying around. If this happens the Atlys boards may accompany it since they won't fit comfortably in 3U either.
Ethernet - JTAG card
FTDI-based JTAG is simple and easy but the chips are pricey and to run in a networked environment you need a host PC/server. I'm in the early stages of designing an XC6SLX45 based board with a gigabit Ethernet port, IPv6 TCP offload engine, and 16 buffered, level-shifted JTAG ports. It will speak the libjtaghal jtagd protocol directly, without needing a CPU or operating system, for ultra-low latency and near zero jitter.
Logo
I've gone long enough without having a nice logo to put on my boards, enclosures, etc. At some point I should come up with one...

Test equipment

I've gradually grown fed up with current test equipment. Why would I want to fiddle with knobs and squint at a tiny 320x240 LCD when I could view the signal on my 7040x1080 quad-screen setup or, better yet, the triple 4K displays I'm going to buy when prices come down a bit? Why waste bench space on dials and buttons when I could just minimize or close the control application when it's not in use? As someone who spends most of his time sitting in front of a computer I'd much prefer a "glass cockpit" lab with few physical buttons.

I'm now planning to make a suite of test equipment based on the Unix philosophy: do one thing and do it well. Each board will be a 3U Eurocard with a power input on the back and Ethernet + probe/signal connections on the front. They will implement the low-level signal capture/generation, buffering, and trigger logic but then leave all of the analysis and configuration UI to a PC-based application, connected over 1- or 10-gigabit Ethernet depending on the tool. Projects are listed in the approximate order that I plan to build them.

4-channel TDR for testing cat5e cable installs
This design will be based on the same general concept as a SAR ADC, with the sampling matrix transposed. Instead of gradually refining one sample before proceeding to the next, the entire waveform will be sampled once, then gradually refined over time.

Each channel of the TDR will consist of a high-speed 100-ohm differential output from a Spartan-6 FPGA to generate a pulse with very fast rise time, AC coupled into one pair of a standard RJ45 jack which will plug into the cable under test.

On the input stage, the differential signals will be subtracted by an opamp, then the single-ended differential voltage compared against a reference voltage produced by a DAC using a LMH7324SQ or similar ultra-fast comparator. The comparator will have LVDS outputs driving a differential input on the Spartan-6, which can sample DDR LVDS at up to 1 GHz. This will produce a single horizontal slice across a plot of impedance mismatch/reflection intensity vs time/distance.

By sending multiple pulses in sequence with successively increasing reference voltages from the DAC, it should be possible to reconstruct an entire TDR trace to at least 8 bits of precision for a fraction of the cost of even a single 1 GSa/s ADC.

Given the 5ns/m nominal propagation delay of cat5 cable (10us/m after round trip delay), the theoretical spatial resolution limit is 10cm although I expect noise and sampling issues to reduce usable positioning accuracy down to 20-50, and the TDR will need to be calibrated with a known length of cable from the same lot if exact propagation delays are needed to compute the precise location of a fault.
10-channel DC power supply

Offshoot of the PDU. Ten-channel buck converter stepping 24 VDC down to an adjustable output voltage, operating frequency around 1.5 MHz. Digital feedback loop with support for soft-start, state machine based current limiting and overcurrent shutdown, etc.

More details TBD once I have time to flesh out the concept a bit.
Gigabit Ethernet protocol analyzer
Spartan-6 connected to three 1gbaseT PHYs. Packets coming in port A are sent out port B unchanged, and vice versa. All traffic going either way is buffered in some kind of RAM, then encapsulated inside a TCP stream and sent out port C to an analysis computer which can record stats, write a pcap, etc.

The capture will be raw layer-1 and include the preamble, FCS, metadata describing link state changes and autonegotiation status, and cycle-accurate timestamps. Error injection may be implemented eventually if needed.
128-channel logic analyzer
This will be based on RED TIN, my existing FPGA-based ILA, but with more features and an external 4GB DDR3 SODIMM for buffering packet data. A 64-bit data bus at 1066 MT/s should be more than capable of pushing 32 channels at 1 GHz, 64 at 500 MHz, or 128 at 250 MHz. The input standards planned to be supported are LVCMOS from 1.5 to 3.3V, LVDS, SSTL, and possibly 5V LVTTL if the input buffer has sufficient range. I haven't looked into CML yet but may add this as well.

The FPGA board will connect to the host PC via a 10gbit Ethernet link using SFP+ direct attach cabling. Dumping 4GB (32 Gb) of data over 10gbe should take somewhere around 4 seconds after protocol overhead, or less if the capture depth is set to less than the maximum.

The FPGA board will connect via matched-impedance 100-ohm parallel cables (perhaps something like DigiKey 670-2626-ND)) to eight active probe cards. Each probe card will have a MICTOR or similar connector to the DUT providing numerous grounds, optional SSTL Vref, 16 digital inputs, and two clock/strobe inputs with optional complement inputs for differential operation. An internal DAC will allow generation of a threshold voltage for single-ended LVCMOS inputs.

The probe card input stage will consist of the following for each channel:

Unity-gain buffer to reduce capacitive load on the DUT
Low-speed precision analog mux to select external Vref (for SSTL) or internal Vref (for LVCMOS). This threshold voltage may be shared across several/all channels in the probe card, TBD.
High-speed LVDS-output comparator to compare single-ended inputs against the muxed Vref.
2:1 LVDS mux for selecting single-ended or differential inputs. Input A is the LVDS output from the comparator, input B is the buffered differential input from this and the adjacent channel. To reduce bit-to-bit skew all channels will have this mux even though it's redundant for odd-numbered channels.

LA input stage for two single-ended or one differential channel

4-channel DSO
This will use the same FPGA + DDR3 + 10gbe back end as the LA, but with the digital input stage replaced by an AFE and two of TI's 1.5 GSa/s dual ADCs with interleaving support.

This will give me either two channels at 3 GSa/s with a target bandwidth of 500 MHz, or four channels at 1.5 GSa/s with a target bandwidth of 250 MHz. The resulting raw data rate will be 3 GSa/s * 8 bits * 2 channels or 48 Gbps, and should comfortably fit within the capacity of a 64-bit DDR3 1066 interface.

I have no more details at this point as my mixed-signal-fu is not yet to the point that I can design a suitable AFE. This will be the last project on the list to be done due to both the cost of components and the difficulty.

Getting my feet wet with invasive attacks, part 2: The attack

2014-03-31T14:45:00.001-07:00

This is part 2 of a 2-part series. Part 1, Target Recon, is here.

Once I knew what all of the wires in the ZIA did, the next step was to plan an attack to read signals out.

I decapped an XC2C32A with concentrated sulfuric acid and soldered it to my dev board to verify that it was alive and kicking.

Simple CR-II dev board with integrated FTDI USB-JTAG

After testing I desoldered the sample and brought it up to campus to introduce it to some 30 keV Ga+ ions.

I figured that all of the exposed packaging would charge, so I'd need to coat the sample with something. I normally used sputtered Pt but this is almost impossible to remove after deposition so I decided to try evaporated carbon, which can be removed nicely with oxygen plasma among other things.

I suited up for the cleanroom and met David Frey, their resident SEM/FIB expert, in front of the Zeiss 1540 FIB system. He's a former Zeiss engineer who's very protective of his "baby" and since I had never used a FIB before there was no way he was going to let me touch his, so he did all of the work while I watched. (I don't really blame him... FIB chambers are pretty cramped and it's easy to cause expensive damage by smashing into something or other. Several SEMs I've used have had one detector or another go offline for repair after a more careless user broke something.)

The first step was to mill a hole through the 900 nm or so of silicon nitride overglass using the ion beam.

Newly added via, not yet filled

Once the via was drilled and it appeared we had made contact with the signal trace, it was time to backfill with platinum. The video below is sped up 10x to avoid boring my readers ;)

Metal deposition in a FIB is basically CVD: a precursor gas is injected into the chamber near the sample and it decomposes under the influence of beam-generated secondary electrons.

Once the via was filled we put down a large (20 μm square) square pad we could hit with an electrical probe needle.

Probe pad

Once everything was done and the chamber was vented I removed the carbon coating with oxygen plasma (the cleanroom's standard photoresist removal process), packaged up my sample, went home, and soldered it back to the board for testing. After powering it up... nothing! The device was as dead as a doornail, I couldn't even get a JTAG IDCODE from it.

I repeated the experiment a week or two later, this time soldering bare stub wires to the pins so I could test by plugging the chip into a breadboard directly. This failed as well, but watching my benchtop power supply gave me a critical piece of information: while VCCINT was consuming the expected power (essentially zero), VCCIO was leaking by upwards of 20 mA.

This ruled out beam-induced damage as I had not been hitting any of the I/O circuitry with the ion beam. Assuming that the carbon evaporation process was safe (it's used all the time on fragile samples, so this seemed a reasonably safe assumption for the time being), this left only the plasma clean as the potential failure point.

I realized what was going on almost instantly: the antenna effect. The bond wire and leadframe connected to each pad in the device was acting as an antenna and coupling some of the 13.56 MHz RF energy from the plasma into the input buffers, blowing out the ESD diodes and input transistors, and leaving me with a dead chip.

This left me with two possible ways to proceed: removing the coating by chemical means (a strong oxidizer could work), or not coating at all. I decided to try the latter since there were less steps to go wrong.

Somewhat surprisingly, the cleanroom staff had very limited experience working with circuit edits - almost all of their FIB work was process metrology and failure analysis rather than rework, so they usually coated the samples.

I decided to get trained on RPI's other FIB, the brand-new FEI Versa 3D. It's operated by the materials science staff, who are a bit less of the "helicopter parent" type and were actually willing to give me hands-on training.

FEI Versa 3D SEM/FIB

The Versa can do almost everything the older 1540 can do, in some cases better. Its one limitation is that it only has a single-channel gas injection system (platinum) while the 1540 is plumbed for platinum, tungsten, SiO2, and two gas-assisted etches.

After a training session I was ready to go in for an actual circuit edit.

FIB control panel

The Versa is the most modern piece of equipment I've used to date: it doesn't even have the classical joystick for moving the stage around. Almost everything is controlled by the mouse, although a USB-based knob panel for adjusting magnification, focus, and stigmators is still provided for those who prefer to turn something with their fingers.

Its other nice feature is the quad-image view which lets you simultaneously view an ion beam image, an e-beam image, the IR camera inside the chamber (very helpful for not crashing your sample into a $10,000 objective lens!), and a navigation camera which displays a top-down optical view of your sample.

The nav-cam has saved me a ton of time. On RPI's older JSM-6335 FESEM, the minimum magnification is fairly high so I find myself spending several minutes moving my sample around the chamber half-blind trying to get it under the beam. With the Versa's nav-cam I'm able to set up things right the first time.

I brought up both of the beams on the aluminum sample mounting stub, then blanked them to try a new idea: Move around the sample blind, using the nav-cam only, then take single images in freeze-frame mode with one beam or the other. By reducing the total energy delivered to the sample I hoped to minimize charging.

This strategy was a complete success, I had some (not too severe) charging from the e-beam but almost no visible charging in the I-beam.

The first sample I ran on the Versa was electrically functional afterwards, but the probe pad I deposited was too thin to make reliable contact with. (It was also an XC2C64A since I had run out of 32s). Although not a complete success, it did show that I had a working process for circuit edits.

After another batch of XC2C32As arrived, I went up to campus for another run. The signal of interest was FB2_5_FF: the flipflop for function block 2 macrocell 5. I chose this particular signal because it was the leftmost line in the second group from the left and thus easy to recognize without having to count lines in a bus.

The drilling went flawlessly, although it was a little tricky to tell whether I had gone all the way to the target wire or not in the SE view. Maybe I should start using the backscatter detector for this?

Via after drilling before backfill

I filled in the via and made sure to put down a big pile of Pt on the probe pad so as to not repeat my last mistake.

The final probe pad, SEM image

Seen optically, the new pad was a shiny white with surface topography and a few package fragments visible through it.

Probe pad at low mag, optical image

At higher magnification a few slightly damaged CMP filler dots can be seen above the pad. I like to use filler metal for focusing and stigmating the ion beam at milling currents before I move to the region of interest because it's made of the same material as my target, it's something I can safely destroy, and it's everywhere - it's hard to travel a significant distance on a modern IC without bumping into at least a few pieces of filler metal.

Probe pad at higher magnification, optical image. Note damaged CMP filler above pad.

I soldered the CPLD back onto the board and was relieved to find out that it still worked! The next step was to write some dummy code to test it out:

`timescale 1ns / 1ps
module test(clk_2048khz, led);

 //Clock input
 (* LOC = "P1" *) (* IOSTANDARD = "LVCMOS33" *)
 input wire clk_2048khz;
 
 //LED out
 (* LOC = "P38" *) (* IOSTANDARD = "LVCMOS33" *)
 output reg led = 0;
 
 //Don't care where this is placed
 reg[17:0] count = 0;
 always @(posedge clk_2048khz)
  count <= count + 1;
  
 //Probe-able signal on FB2_5 FF at 2x the LED blink rate
 (* LOC = "FB2_5" *) reg toggle_pending = 0;
 always @(posedge clk_2048khz) begin
  if(count == 0)
   toggle_pending <= !toggle_pending;
 end
 
 //Blink the LED
 always @(posedge clk_2048khz) begin
  if(toggle_pending && (count == 0))
   led <= !led;
 end

endmodule

This is a 20-bit counter that blinks a LED at ~2 Hz from a 2048 KHz clock on the board. The second-to-last stage of the counter (so ~4 Hz) is constrained to FB2_5, the signal we're probing.

After making sure things still worked I attached the board's plastic standoffs to a 4" scrap silicon wafer with Gorilla Glue to give me a nice solid surface I could put on the prober's vacuum chuck.

Test board on 4" wafer

Earlier today I went back to the cleanroom. After dealing with a few annoyances (for example, the prober with a wide range of Z axis travel, necessary for this test, was plugged into the electrical test station with curve tracing capability but no oscilloscope card) I landed a probe on the bond pad for VCCIO and one on ground to sanity check things. 3.3V... looks good.

Moving carefully, I lifted the probe up from the 3.3V bond pad and landed it on my newly added probe pad.

Landing a probe on my pad. Note speck of dirt and bent tip left by previous user. Maybe he poked himself mounting the probe?

It took a little bit of tinkering with the test unit to figure out where all of the trigger settings were, but I finally saw a ~1.8V, 4 Hz squarewave. Success!

Waveform sniffed from my probe pad

There's still a bit of tweaking needed before I can demo it to my students (among other things, the oscilloscope subsystem on the tester insists on trying to use the 100V input range, so I only have a few bits of ADC precision left to read my 1.8V waveform) but overall the attack was a success.

Getting my feet wet with invasive attacks, part 1: Target recon

2014-03-31T13:35:00.001-07:00

This is part 1 of a 2-part series. Part 2, The Attack, is here.

One of the reasons I've gone a bit dark lately is that running CSCI 6974, RPI's experimental hardware reverse engineering class, has been eating up a lot of my time.

I wanted to make the final lab for the course a nice climax to the semester and do something that would show off the kinds of things that are possible if you have the right gear, so it had to be impressive and technically challenging. The obvious choice was a FIB circuit edit combined with invasive microprobing.

After slaving away for quite a while (this was started back in January or so) I've managed to get something ready to show off :) The work described here will be demonstrated in front of my students next week as part of the fourth lab for the class.

The first step was to pick a target. I was interested in the Xilinx XC2C32A for several reasons and was already using other parts of the chip as a teaching subject for the class. It's a pure-digital CMOS CPLD (no analog sense amps and a fairly regular structure) made on a relatively modern process (180 nm 4-metal UMC) but not so modern as to be insanely hard to work with. It was also quite cheap ($1.25 a pop for the slowest speed grade in VQG44 package on DigiKey) so I could afford to kill plenty of them during testing

The next step was to decap a few, label interesting pins, and draw up a die floorplan. Here's a view of the die at the implant layer after Dash etch; P-type doping shows up as brown. (John did all of the staining work and got great results. Thanks!)

XC2C32A die floorplan after Dash etch

The bottom half of the die is support infrastructure with EEPROM banks for storing the configuration bitstream toward the center and JTAG/configuration stuff in a U-shape below and to either side of the memory array. (The EEPROM is mislabeled "flash" in this image because I originally assumed it was 1T NOR flash. Higher magnification imaging later showed this to be wrong; the bit cells are 2T EEPROM.)

The top half of the die is the actual programmable logic, laid out in a "butterfly" structure. The center spine is the ZIA (global routing, also referred to as the AIM in some datasheets), which takes signals from the 32 macrocell flipflops and 33 GPIO pins and routes them into the function blocks. To either side of the spine are the two FBs, which consist of an 80 x 56 AND array (simplifying a bit... the actual structure is more like 2 blocks x 20 rows x 2 interleaved cells x 56 columns), a 56 x 16 OR array, and 16 macrocells.

I wanted some interesting data to show my students so there were two obvious choices. First, I could try to defeat the code protection somehow and read bitstreams out of a locked device via JTAG. Second, I could try to read internal device state at run time. The second seemed a bit easier so I decided to run with it (although defeating the lock bits is still on my longer-term TODO.)

The obvious target for probing internal runtime state is the ZIA, since all GPIO inputs and flipflop states have to go through here. Unfortunately, it's almost completely undocumented! Here's the sum total of what DS090 has to say about it (pages 5-6):

The Advanced Interconnect Matrix is a highly connected low power rapid switch. The AIM is directed by the software to deliver up to a set of 40 signals to each FB for the creation of logic. Results from all FB macrocells, as well as, all pin inputs circulate back through the AIM for additional connection available to all other FBs as dictated by the design software. The AIM minimizes both propagation delay and power as it makes attachments to the various FBs.

Thanks for the tidbit, Xilinx, but this really isn't gonna cut it. I need more info!

The basic ZIA structure was pretty obvious from inspection of the implant layer: 20 identical copies of the same logic. This suggested that each row was responsible for feeding two signals left and two right.

SEM imaging of the implant layer showed the basic structure to be largely, but not entirely, symmetric about the left-right axis. At the far outside a few cells of the PLA AND array can be seen. Moving toward the center is what appears to be a 3-stage buffer, presumably for driving the row's output into the PLA. The actual routing logic is at center.

The row appeared entirely symmetric top-to-bottom so I focused my future analysis on the upper half.

Single row of the ZIA seen at the implant layer after Dash etch. Light gray is P-type doping, medium gray is N-type doping, dark gray is STI trenches.

Looking at the top metal layer revealed the expected 65 signals.

Single row of the ZIA seen on metal 4

The signals were grouped into six groups with 11, 11, 11, 11, 11, and 10 signals in them. This led me to suspect that there was some kind of six-fold structure to the underlying circuitry, a suspicion which was later proven correct.

Inspection of the configuration EEPROM for the ZIA showed it to be 16 bits wide by 48 rows high.

ZIA configuration EEPROM (top few rows)

Since the global configuration area in the middle of the chip was 8 rows high this suggested that each of the 40 remaining EEPROM rows configured the top or bottom half of a ZIA row.

Of the 16 bits in each row, 8 bits presumably controlled the left-hand output and 8 controlled the right. This didn't make a lot of sense at first: dense binary coding would require only 7 bits for 65 channels and one-hot coding would need 65 bits.

Reading documentation for related device families sometimes helps to shed some light on how a part was designed, so I took a look at some of the whitepapers for the older 350 nm CoolRunner XPLA3 series. They went into some detail on how full crossbar routing was wasteful of chip area and often not necessary to get sufficient routability. You don't need to be able to generate every 40! permutations of a given subset of signals as long as you can route every signal somehow. Instead, the XPLA3's designers connected only a handful of the inputs to each row and varied the input selection for each row so as to allow almost every possible subset to be selected somehow.

This suggested a 2-level hierarchy to the ZIA mux. Instead of being a 65:1 mux it was a 65:N hard-wired mux followed by a N:1 programmable mux feeding left and another N:1 feeding right. 6 seemed to be a reasonable guess for N, given the six groups of wires on metal 4.

ZIA mux structure

This hypothesis was quickly confirmed by looking at M3 and M3-M4 vias: Each row had six short wires on M3, one under each of the six groups of wires in the bus. Each of these short lines was connected by one via to one of the bus lines on M4. The via pattern varied from row to row as expected.

ZIA M3-M4 vias

I extracted the full via pattern by copying a tracing of M4 over the M3 image and using the power vias running down the left side as registration marks. (Pro tip: Using a high accelerating voltage, like 20 kV, in a SEM gives great results on aluminum processes with tungsten via plugs. You get backscatters from vias through the metal layer that you can use for aligning image stacks.) A few of the rows are shown above.

At this point I felt I understood most of the structure so the next step was full circuit extraction! I had John CMP a die down to each layer and send to me for high-res imaging in the SEM.

The output buffers were fairly easy. As I expected they were just a 3-stage inverter cascade.

Output buffer poly/diffusion/contact tracing

Output buffer M1 tracing

Output buffer gate-level schematic

Individual cell schematics

Nothing interesting was present on any of the upper layers above here, just power distribution.

The one surprising thing about the output buffer was that the NMOS on the third stage had a substantially wider channel than the PMOS. This is probably something to do with optimizing output rise/fall times.

Looking at the actual mux logic showed that it was mostly tiles of the same basic pattern (a 6T SRAM cell, a 2-input NOR gate, and a large multi-fingered NMOS pass transistor) except for the far left side.

Gate-level layout of mux area

Left side of mux area, gate-level layout

The same SRAM-feeding-NOR2 structure is seen, but this time the output is a small NMOS or PMOS pass transistor.

After tracing M1, it became obvious what was going on.

Left side of mux area, M1

The upper and lower halves control the outputs to function blocks 1 and 2 respectively. The two SRAM bits allow each output (labeled MUXOUT_FBx) to be pulled high, low, or float. A global reset line of some sort, labeled OGATE, is used to gate all logic in the entire ZIA (and presumably the rest of the chip); when OGATE is high the SRAM bits are ignored and the output is forced high.

Here's what it looks like in schematic:

Gate-level schematics of pullup/pulldown logic

Cell schematics

In the schematics I drew the NOR2v0x1 cell as its de Morgan dual (AND with inverted inputs) since this seemed to make more sense in the context of the circuit: the output is turned on when the active-low input is low and OGATE is turned off.

It's interesting to note that while almost all of the config bits in the circuit are active-low, PULLUP is active-high. This is presumably done to allow the all-ones state (a blank EEPROM array) to put the muxes in a well-defined state rather than floating.

Turning our attention to the rest of the mux array shows a 6:1 one-hot-coded mux made from NMOS pass transistors. This, combined with the 2 bits needed for the pull-high/pull-low module, adds up to the expected 8. The same basic pattern shown below is tiled three times.

Basic mux tile, poly/implant

Basic mux tile, M1

(Sorry for the misalignment of the contact layer, this was a quick tracing and as long as I was able to make sense of the circuit I didn't bother polishing it up to look pretty!)

The resulting schematic:

Schematic of muxes

M2 was used for some short-distance routing as well as OGATE, power/ground busing, and the SRAM bit lines.

M2 and M2-M3 vias

M3 was used for OGATE, power busing, SRAM word lines, the mask-programmed muxes, and the tri-state bus within the final mux.

M3 and M3-M4 vias

And finally, M4. I never found out what the leftmost power line went to, it didn't appear to be VCCINT or ground but was obviously power distribution. There's no reason for VCCIO to be running down the middle of the array so maybe VCCAUX? Reversing the global config logic may provide the answer.

A bit of trial and error poking bits in bitstreams was sufficient to determine the ordering of signals. From right to left we have FB1's GPIO pins, the input-only pin, FB2's GPIO pins, then FB1's flipflops and finally FB2's flipflops.

Now that I had good intel on the target, it was time to plan the strike!

Part 2, The Attack, is here.

Microchip PIC32MZ process vs PIC32MX

2014-03-24T15:06:00.002-07:00

Those of you keeping an eye on the MIPS microcontroller world have probably heard of Microchip's PIC32 series parts: MIPS32 CPU cores licensed from MIPS Technologies (bought by Imagination Technologies recently) paired with peripherals designed in-house by Microchip.
Although they're sold under the PIC brand name they have very little in common with the 8/16 bit PIC MCUs. They're fully pipelined processors with quite a bit of horsepower.

The PIC32MX family was the first to be introduced, back in 2009 or so. They're a MIPS M4K core at up to 80 MHz and max out at 128 KB of SRAM and 512 KB of NOR flash plus a fairly standard set of peripherals.

PIC32MX microcontroller

Somewhat disappointingly, the PIC32MX MMU is fixed mapping and there is no external bus interface. Although there is support for user/kernel privilege separation, all userspace code shares one address space. Another minor annoyance is that all PIC32MX parts run from a fixed 1.8V on-die LDO which normally cannot (the 300 series is an exception) be disabled or bypassed to run from an external supply.

The PIC32MZ series is just coming out now. They're so new, in fact that they show as "future product" on Microchip's website and you can only buy them on dev boards, although I'm told by around Q3-Q4 of this year they'll be reaching distributors. They fix a lot of the complaints I have with PIC32MX and add a hefty dose of speed: 200 MHz max CPU clock and an on-die L1 cache.

PIC32MZ microcontroller

On-chip memory in the PIC32MZ is increased to up to 512 KB of SRAM and a whopping 2 MB of flash in the largest part. The new CPU core has a fully programmable MMU and support for an external bus interface capable of addressing up to 16MB of off-chip address space.

I'm a hacker at heart, not just a developer, so I knew the minute I got one of these things I'd have to tear it down and see what made it tick. I looked around for a bit, found a $25 processor module on Digikey, and picked it up.

The board was pretty spartan, which was fine by me as I only wanted the chip.

PIC32MZ processor module

Less than an hour after the package had arrived, I had the chip desoldered and simmering away in a beaker of sulfuric acid. I had done a PIC32MX340F512H a few days previously to provide comparison shots.

Without further ado, here's the top metal shots:

PIC32MX340F512H

PIC32MZ2048ECH

These photos aren't to scale, the MZ is huge (about 31.9 mm²). By comparison the MX is around 20.

From an initial impression, we can see that although both run at the same core voltage (1.8V) the MZ is definitely a new, significantly smaller fab process. While the top layer of the MX is fine-pitch signal routing, the top layer of the MZ is (except in a few blocks which appear to contain analog circuitry) completely filled with power distribution routing.

Top layer closeups of MZ (left), MX (right), same scale

Thick power distribution wiring on the top layer is a hallmark of deep-submicron processes, 130 nm and below. Most 180 nm or larger devices have at least some signal routing on the top layer.

Looking at the mask revision markings gives a good hint as to the layer count and stack-up.

Mask rev markings on MZ (left), MX (right), same scale

The MZ appears to be one thick aluminum layer and five thin copper layers for a total of six, while the MX is four layers and probably all aluminum.

Enough with the top layer... time to get down! Both samples were etched with HF until all metal and poly was removed.

The first area of interest was the flash.

NOR flash on MZ (left), MX (right), different scales

Both arrays appear to be the same standard NOR structure, although the MZ's array is quite a bit denser: the bit cell pitch is 643 x 270 nm (0.173 μm²/bit) while the MX's is 1015 x 676 nm (0.686 μm²/bit). The 3.96x density increase suggests a roughly 2x process shrink.

The white cylinders littering the MX die are via plugs, most likely tungsten, left over after the HF etch. The MZ appears to use a copper damascene process without via plugs, although since no cross section was performed details of layer thicknesses etc are unavailable.

The next target was the SRAM.

6T SRAM on MZ (left), MX (right), different scales

Here we start to see significant differences. The MX uses a fairly textbook 6T "doughnut + H" SRAM structure while the MZ uses a more modern lithography-optimized pattern made of all straight lines with no angles, which is easier to etch. This kind of bit cell is common in leading-edge processes but this is the first time I've seen it in a commodity MCU.

Cell pitch for the MZ is 1345 x 747 nm (1.00 μm²/bit) while the MX is 1895 x 2550 nm (4.83 μm²/bit). This is a 4.83x increase in density.

The last area of interest was the standard cell array for the CPU.

Closeup of standard cells on MZ (left), MX (right), different scales

Channel length was measured at 125-130 nm for the MZ and 250-260 nm for the MX.

Both devices also had a significant number of dummy cells in the gate array, suggesting that the designs were routing-constrained.

Dummy cells in MZ

Dummy cells in MX

In conclusion, the PIC32MZ is a significantly more powerful 130 nm upgrade to the slower 250 nm PIC32MX family. If Microchip fixes most of the silicon bugs before they launch I'll definitely pick up a few and build some stuff with them.

I wasn't able to positively identify the fab either device was made on however the fill patterns and power distribution structure on the MZ are very similar of the TI AM1707 which is fabricated by TSMC so they're my first guess.

For more info and die pics check out the SiliconPr0n pages for the two chips:

Process overview: UMC 180nm eNVM

2014-02-04T00:23:00.000-08:00

I've been reverse engineering a programmable logic device (Xilinx XC2C32A) made on UMC's 180nm eNVM process for the last few months and have been a little light on blog posts. I'm a big fan of the process writeups Chipworks does so I figured I'd try my hand at one ;)

The target devices were packaged in a 32-pin QFN. The first part of the analysis was to sanding the entire package down to the middle of the device and polishing with diamond paste to get a quick overview of the die and packaging stack. (Shops with big budgets normally use X-ray systems for this.) There were a few scratches in the section from sanding, but since the closeups were going to be done on another die it wasn't necessary to polish them out.

Packaged device cross section

Total die thickness including BEOL was just over 300 μm. From the optical image, four layers of metal can be seen. The whitish color hinted that it was aluminum rather than copper, but there's no sense jumping to conclusions since the next die was about to hit the SEM.

A second specimen was depackaged using sulfuric acid, sputtered in platinum to reduce charging, and sectioned using a gallium ion FIB at a slight angle to the east-west routing.

FIB cross section of metal stack

From this image, it is easy to get some initial impressions of the process:

The overglass consists of two layers of slightly different compositions, possibly an oxide-nitride stack.
The process is planarized up to metal 4, but not including overglass.
Metal has an adhesion/barrier layer at the top and bottom and not the sides, and is wider at the bottom than the top. This rules out damascene patterning and suggests that the metal layers are dry-etched aluminum.
Silicide layers are visible above the polysilicon gates and at the source/drain implants.
Vias have a much higher atomic number than the metal layers, probably tungsten.
Stacked vias are allowed and used frequently.
Well isolation is STI.

EDS spectra confirmed all these initial impressions to be correct.

M1 to M3 have pretty much identical stackups except for slight thickness differences: 100nm of Ti-based adhesion/barrier layer, 400-550 nm of aluminum conductor, then another barrier layer of similar composition. M4 is slightly thicker (850 nm aluminum) and the same barrier thickness.

The first overglass layer is the same material (silicon dioxide) as ILD; thickness ranges from slightly below the top of M4 to to 630 nm above the top. The second overglass layer has a slightly higher backscatter yield (EDIT: confirmed by EDS to be silicon nitride) and is about 945 nm thick.

M1-3 pitch is just over 600 nm, while the smallest observed M4 pitch is 1 μm.

EDS spectrum of wire on M4

A closer view of M3 shows the barrier metals in more detail. The barrier is a bit over 100 nm thick normally but thins to about 45 nm underneath upward-facing vias, suggesting that the ILD etch for drilling via holes also damages the barrier material. A small amount (around 30 nm) of sagging is present over the top of downward-facing vias.

Via sidewalls are coated with barrier metal as well, however it is significantly thinner (20 nm vs 100) than the metal layer barrier. The vias themselves are polycrystalline tungsten. Grain structure is clearly visible in the secondary electron image below.

(Note: The structure at left of the image is the edge of the FIB trench and stray material deposited by the ion beam and is not part of the actual device. The lower via is at a slight angle to the section so it was not entirely sliced open.)

M3 with upward/downward vias in cross section.

EDS spectrum of M1-M2 via area

The metal aspect ratio ranges from 3:1 on M1 to 1.5:1 on M4.

Now for the most interesting area - the transistors themselves!

The cross section was taken down the center of the CPLD's PLA OR array between two rows of 6T SRAM cells. Two PMOS transistors from each of two SRAM cells are visible in the closeup below.

Contacted gate pitch is 920 nm, for total cell width (including the 1180 nm of STI trench) of 2.9 μm. Plan view imaging shows total cell dimensions to be 2.9 x 3.3 μm or 9.5 μm². This is a bit large for the 180 nm node and probably reflects the need to leave space on M1 and M2 for routing SRAM cell output to the programmable logic array.

SRAM cell structure and PLA AND array after metal/poly removal and Dash etch. (P-type implants are raised due to oxide growth from stain.)

Some variability in etch depth and sidewall slope is clearly visible on M1.

The polysilicon layer was hard to see in this view but is probably around 50 nm thick, topped by about 135 nm of cobalt silicide. (Gate oxide thickness isn't visible under SEM at the 180 nm node and I haven't yet had time to prepare a TEM sample.)

Source/drain contacts are made with a 70 nm thick cobalt silicide layer. All vias in the entire process appear to be about the same size (300 nm diameter) however the silicide contact pads are larger (465 nm).

Gate length is almost exactly 180 nm - measurement of the SEM image shows 175 nm +/- 12 nm.

Active area contacts and PMOS transistors

EDS spectrum of active-M1 contact

Closeup of PLA AND array after Dash etch showing PMOS and NMOS channels

Overall, the process seems fairly typical except for its use of aluminum for interconnect. It was a fun analysis and if I have time I may try to do a TEM cross section of both PMOS and NMOS transistors in the future. My main interest in the chip is netlist extraction, though, so this isn't a high priority.

I may also do a second post on the Flash portion of the chip.

EDIT: Decided to post a plan view SEM image of the flash array active area. This is after Dash etch; P-type areas have oxide grown over them. Poly has been stripped. The left-hand flash area is ten bits wide and stores configuration for function block 2's macrocells plus a "valid" bit. The right-hand area stores configuration for FB2's PLA (including both the AND and OR arrays, but not global routing).

Plan view SEM of flash

Finally, I would like to give special thanks to David Frey at the RPI cleanroom for assistance with the FIB cross section.

Hardware reverse engineering class

2014-01-23T15:16:00.002-08:00

So, it's been a while since I've posted anything and I figured I'd throw up a quick update.

I've been super busy over the winter break working on my thesis, as well as something new: My advisor and I are running a brand-new, experimental course, CSCI 4974/6974 Hardware Reverse Engineering, at Rensselaer Polytechnic Institute (RPI) this spring!

I gave the first lecture for the class last Tuesday and it was very well received. We have a full class - 12 undergraduates and 4 graduates as of this writing. As the TA for the class I'm responsible for (among other things) running labs and preparing samples. I've been running all over campus getting trained on various pieces of equipment, booking lab time for the class, and generally making sure this is going to be an awesome semester for all involved.

Lecture notes are available online on the course website for anyone who wishes to follow along.

Finally, a few peeks at my microprobing setup. I think I need new micropositioners, the backlash on these is pretty terrible. Whenever I adjust the left-right axis, the probe needle rotates by a degree or two.

Bug hunting

2013-11-29T14:18:00.000-08:00

This is the story of the hunt for a bug that I've been chasing, on and off, for the last month.

After my last post on the PDU, I began doing more exhaustive testing. I left Munin polling all of the stats every 5 minutes and kept the GUI open for maybe half an hour, turning channels on and off, and everything seemed fine...

Then I went away to do something else, came back to my desk, and saw that the GUI had frozen. The board wasn't responding to ping (or any network activity at all), and I had no idea why.

Other than confirming that the interconnect fabric was still working (by resolving the addresses of a few cores by the name server) there wasn't a ton I could do without adding some diagnostic code.

I then resynthesized the FPGA netlist with the gdb bridge enabled on the CPU. (The FPGA is packed pretty full; I don't leave the bridge in all the time because it substantially increases place-and-route runtime and sometimes requires decreasing the maximum clock rate). After waiting around half an hour for the build to complete I reloaded the FPGA with the new code, fired up the GUI, and went off to tidy up the lab.

After checking in a couple of times and not seeing a hang, I finally got it to crash a couple hours later. A quick inspection in gdb suggested that the CPU was executing instruction normally, had not segfaulted, and there was no sign of trouble. In each case the program counter was somewhere in RecvRPCMessage(), as would be expected when the message loop was otherwise idle. So what was the problem?

The next step was to remove the gdb bridge and insert a logic analyzer core. (As mentioned above the FPGA is filled to capacity and it's not possible to use both at the same time without removing application logic.)

After another multi-hour build-and-wait-for-hang cycle, I managed to figure out that the CPU was popping the inbound-message FIFO correctly and seemed to be still executing instructions. None of the error flags were set.

I thought for a while and decided to check the free-memory-page counter in the allocator. A few hours later, I saw that the free-page count was zero... a telltale sign of a memory leak.

I wasted untold hours and many rebuild cycles trying to find the source of the leak before sniffing the RPC link between the CPU and the network. As soon as I saw packets arriving and not being sent, I knew that the leak wasn't the problem. It was just another symptom. The CPU was getting stuck somewhere and never processing new Ethernet frames; as soon as enough frames had arrived to fill all memory then all processing halted.

Unfortunately, at this point I still had no idea what was causing the bug. I could reliably trigger the logic analyzer after the bug had happened and the CPU was busy-waiting (by triggering when free_page_count hit 1 or 0) but had no way to tell what led up to it.

RPC packet captures taken after the fault condition showed that the new-frame messages from the Ethernet MAC were arriving to the CPU just fine. The CPU could be clearly seen popping them from the hardware FIFO and storing them in memory immediately.

Eventually, I figured out just what looked funny about the RPC captures: the CPU was receiving RPC messages, issuing memory reads and writes, but never sending any RPC messages whatsoever. This started to give me a hint as to what was happening.

I took a closer look at the execution traces and found that the CPU was sitting in a RecvRPCMessage() call until a message showed up, then PushInterrupt()ing the message and returning to the start of the loop.

/**
 @brief Performs a function call through the RPC network.
 
 @param addr  Address of target node
 @param callnum The RPC function to call
 @param d0  First argument (only low 21 bits valid)
 @param d1  Second argument
 @param d2  Third argument
 @param retval Return value of the function
 
 @return zero on success, -1 on failure
 */
int __attribute__ ((section (".romlibs"))) RPCFunctionCall(
 unsigned int addr, 
 unsigned int callnum,
 unsigned int d0,
 unsigned int d1,
 unsigned int d2,
 RPCMessage_t* retval)
{
 //Send the query
 RPCMessage_t msg;
 msg.from = 0;
 msg.to = addr;
 msg.type = RPC_TYPE_CALL;
 msg.callnum = callnum;
 msg.data[0] = d0;
 msg.data[1] = d1;
 msg.data[2] = d2;
 SendRPCMessage(&msg);
 
 //Wait for a response
 while(1)
 {
  //Get the message
  RecvRPCMessage(retval);
  
  //Ignore anything not from the host of interest; save for future processing
  if(retval->from != addr)
  {
   //TODO: Support saving function calls / returns
   //TODO: Support out-of-order function call/return structures
   if(retval->type == RPC_TYPE_INTERRUPT)
    PushInterrupt(retval);
   continue;
  }
   
  //See what it is
  switch(retval->type)
  {
   //Send it again
   case RPC_TYPE_RETURN_RETRY:
    SendRPCMessage(&msg);
    break;
    
   //Fail
   case RPC_TYPE_RETURN_FAIL:
    return -1;
    
   //Success, we're done
   case RPC_TYPE_RETURN_SUCCESS:
    return 0;
    
   //We're not ready for interrupts, save them
   case RPC_TYPE_INTERRUPT:
    PushInterrupt(retval);
    break;
    
   default:
    break;
  }

 }
}

I spent most of a day repeatedly running the board until it hung to collect a sampling of different failures. A pattern started to emerge: addr was always 0x8003, the peripheral controller. This module contains a couple of peripherals that weren't big enough to justify the overhead of a full RPC router port on their own:

One ten-signal bidirectional GPIO port (debug/status LEDs plus a few reserved for future expansion)
One 32-bit timer with interrupt on overflow (used for polling environmental sensors for fault conditions, as well as socket timeouts)
One I2C master port (for talking to the DACs)
Three SPI master ports (for talking to the ADCs)

The two most common values for callnum in the hang state were PERIPH_SPI_SEND_BYTE and PERIPH_SPI_RECV_BYTE, but I saw a PERIPH_SPI_DEASSERT_CS call once. The GPIO and I2C peripherals aren't used during normal activity and are only touched when someone changes a breaker's trip point or the network link flaps, so I wasn't sure if the hang was SPI-specific or related to the peripheral controller in general.

After not seeing anything obviously amiss in the peripheral controller Verilog, I added one last bit of instrumentation: logging the last message successfully processed by the peripheral controller.

The next time the board froze, the CPU was in the middle of the first PERIPH_SPI_RECV_BYTE call in the function below (reading one channel of a MCP3204 quad ADC) but the peripheral controller was idle and had most recently processed the PERIPH_SPI_SEND_BYTE call on the line before.

unsigned int ADCRead(unsigned char spi_channel, unsigned char adc_channel)
{
 //Get the actual sensor reading
 RPCMessage_t rmsg;
 unsigned char opcode = 0x30;
 opcode |= (adc_channel << 1);
 opcode <<= 1;
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_ASSERT_CS, 0,  spi_channel, 0, &rmsg);
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_SEND_BYTE, opcode, spi_channel, 0, &rmsg); //Three dummy bits first
                      //then request read of CH0
                      //(single ended)
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_RECV_BYTE, 0,   spi_channel, 0, &rmsg); //Read first 8 data bits
 unsigned int d0 = rmsg.data[0];
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_RECV_BYTE, 0,  spi_channel, 0, &rmsg); //Read next 4 data bits
                      //followed by 4 garbage bits
 unsigned int d1 = rmsg.data[0];
 RPCFunctionCall(g_periphAddr, PERIPH_SPI_DEASSERT_CS, 0,  spi_channel, 0, &rmsg);
 
 return ((d0 << 4) & 0xFF0) | ( (d1 >> 4) & 0xF);
}

Operating under the assumption that my well-tested interconnect IP didn't have a bug that could make it drop packets randomly, the only remaining explanation was that the peripheral controller was occasionally ignoring an incoming RPC.

I took another look at the code and found the bug near the end of the main state machine:

//Wait for RPC transmits to finish
STATE_RPC_TXHOLD: begin
 if(rpc_fab_tx_done) begin
  rpc_fab_rx_done <= 1;
  state <= STATE_IDLE;
 end
end //end STATE_RPC_TXHOLD

I was setting the "done" flag to pop the receive buffer every time I finished sending a message... without checking that I was sending it in response to another message. The only time this was ever untrue is when sending a timer overflow interrupt.

As a result, if a new message arrived at the peripheral controller between the start and end of sending the timer overflow message, it would be dropped. The window for doing this is only four clock cycles every 50ms, which explains the extreme rarity of the hang.

EDIT: Just out of curiosity I ran a few numbers to calculate the probability of a hang:

At the 30 MHz CPU speed I was using for testing, the odds of any single RPC transaction hanging are 1 in 375,000.
Reading each of the 12 ADC channels requires 5 SPI transactions, or 60 in total. The odds of at least one of these triggering a hang is 1 in 6250.
The client GUI polls at 4 Hz.
The chance of a hang occurring within the first 15 minutes of runtime is 43%.
The chance of a hang occurring within the first half hour is 68%.
There is about a 10% chance that the board will run for over an hour without hanging, and yet the bug is still there.

Managed DC PDU

2013-11-01T01:46:00.000-07:00

As I mentioned in my last post, powering all of the prototyping boards on my desk presents some unique challenges. With only one exception (the Xilinx AC701 board), each of the 22 boards requires 5VDC at somewhere between 0.1 and 2 amps. Some are strictly USB powered, some have a 5.5/2.1mm barrel jack, and some can be powered by either USB or a barrel jack.

Powered USB hubs would reduce the number of power sources required, so I did just that. Lots of cables would get in the way so I designed a custom "backplane" USB hub with male mini-B ports which could plug directly into small prototyping boards. (As a side note, the connectors for this board were nearly impossible to find. There are very few uses for a male mini-B connector that mounts to a PCB rather than being attached to a cable so nobody makes them!)

USB backplane hub

These reduced the problem, but did not come close to eliminating it. I still had to power three backplane hubs, six standalone FPGA boards, and four standalone MCU/SoC dev boards. All needed 5V except for the AC701 (which runs on 12V) but I wanted additional 12V capability for the future if I expanded into higher-power design.

The obvious first idea was an ATX supply. My calculations of peak power for the apparatus (including room for growth) were fairly high, though, and most ATX supplies put the bulk of their output on the 12V rail and have fairly limited (well under 100W) 5V capacity.

The next thing I considered was an off-the-shelf 5V supply. This looked like a nice idea, but (as with an ATX supply) the high output current capability would represent a fire hazard if something shorted. I would obviously need overcurrent protection.

Thinking a bit more, I realized that fusing was probably not the best option. Fuses need to be replaced once blown and in a lab environment overcurrent events happen fairly often. Classical current limiting techniques would be problematic as well since many of my boards have switching power supplies. Since a switcher is a nonlinear load, reducing the input voltage doesn't actually reduce the current. Instead, load current actually increases to maintain the output voltage, which can lead to runaway failure conditions. The safer way to handle overcurrent on a switcher is to shut it down entirely.

I also wanted the ability to power cycle boards on command to reset a stuck board or test power-up behavior. While jiggling cables may work in a hands-on lab environment, it isn't a viable option in the remote-controlled "embedded cloud" platform I'm trying to build.

This would obviously require some intelligence on the part of the power management system. The natural solution was a managed power distribution unit (PDU) of the sort commonly used in datacenters for feeding power to racks of servers. Managed PDUs often include current metering as well, which could be very useful to me when trying to minimize power consumption in a design.

There's just one problem: As far as I can tell, nobody makes managed PDUs for 5V loads. The only ones I saw were for 12/24/48V supplies and massively overpriced: this 8-channel 12V unit costs a whopping $1,757.

What to do? Build one myself, of course!

The first step was to come up with the requirements:

Remote control via SNMP
Ten DC outputs fed by external supply
4A max load for any single channel, 20A max for entire board
Independent overcurrent shutdown for each channel with adjustable threshold
Inrush timers for overcurrent shutdown to prevent false positives during powerup
Remote switching
Current metering
Thermal shutdown
Under/overvoltage shutdown
Input reverse voltage protection
Able to operate at 5V or 12V (jumper selected)

Now that I had a good idea of what I was building, it was time to start the actual design. I decided to use an FPGA instead of a MCU since the parallel nature made it easy to meet the hard-realtime demands of the overcurrent protection system. I also wanted an opportunity to field-test my softcore gigabit-Ethernet MAC, one of my CPU designs, and several other components of my thesis architecture under real-world load conditions.

PDU block diagram

The output stage is key to the entire circuit so it was very important that it be designed correctly. I put quite a bit of effort into component selection here... perhaps a bit too much, as I missed a few bugs elsewhere on the board! More on that later.

Output stage

Working from the output terminal (right side, VOUT_1) we first encounter a a 5 mΩ 4-terminal shunt resistor which feeds the overcurrent shutdown circuit and current metering. This is followed by a an LC filter to smooth the output power and reduce coupling of noise between downstream devices.

The fuse is provided purely as a second line of defense in the event that the soft overcurrent protection fails. As a firmware/HDL developer I know all too well what bugs are capable of, so I like to include passive safeguards whenever reasonably possible. Assuming that my code works correctly, this fuse should never blow even if the output of the PDU was connected to a dead short. (This of course requires that my protection mechanism trip faster than the fuse. Given the 1ms response time of typical fuses to small overcurrents, this isn't a very difficult task.)

Power switching is done by a high-side P-channel MOSFET connected to VOUT (the main high-current power rail). The logic-level input from the control subsystem is shifted up to VOUT level by an N-channel MOSFET. A pullup and pulldown resistor ensure that the output is kept safely in the "off" state when the system is booting.

Current monitoring

The monitoring stage is even simpler: the shunt voltage is amplified by a TI INA199A2 instrumentation amplifier, then fed to an ADC (not shown in this schematic) for metering. A comparator checks the amplified voltage against a reference voltage set by a DAC (also not shown) and if the threshold is exceeded the overcurrent alarm output is asserted.

A module in the FPGA controls the output enables based on the overcurrent flags and internal state. When an output is first turned on the overcurrent flag is ignored for a programmable delay (usually a few ms) in order to avoid false triggering from inrush spikes. After this period, if the overcurrent flag is ever asserted the channel is turned off and placed in the "error-disable" state. In order to clear an error condition the channel must be manually cycled, much like a conventional circuit breaker.

Here's a view of the finished first-run prototype. As you can see the first layout revision had a few bugs ;) The dead-bugged oscillator turned out to not be necessary but it would have been more work to remove it so I'm keeping it until I do a respin with all of these fixes incorporated.

PDU board on my desk

The SNMP interface and IP protocol stack runs on a custom softcore CPU of my own design. The CPU is named GRAFTON, in keeping with my tradition of naming my processors after nearby towns. It is fairly similar to MIPS-1 at the ISA level and can be targeted by mips-linux-gnu gcc with carefully chosen flags, but does not implement unaligned load/store, interrupts, or the normal coprocessors. Coprocessor 0 exists but is used to interface with the RPC network.

GRAFTON's programming model is largely event-driven, in a model that will be somewhat familiar to anyone who has done raw Windows API programming. The CPU sleeps until an RPC interrupt packet shows up, then it is processed and it goes back to sleep. Unlike classical interrupt handling, user code running on GRAFTON cannot be pre-empted by an interrupt; it just sits in the queue until retrieved.

int main()
{
	//Do one-time setup
	Initialize();

	//Main message loop
	RPCMessage_t rmsg;
	while(1)
	{
		GetRPCInterrupt(&rmsg);
		ProcessInterrupt(&rmsg);
	}
	
	return 0;
}

RPCFunctionCall(), a simple C wrapper around the low-level SendRPCMessage and RecvRPCMessage() functions, abstracts the RPC network with a blocking C function call semantics. Any messages other than return values of the pending call are queued for future processing.

In the example below, I'm initializing the SPI modules for the A/D converters with a clock divisor computed on the fly from the system clock rate.

void ADCInitialize()
{
	//SPI clock = 250 KHz
	RPCMessage_t rmsg;
	RPCFunctionCall(g_sysinfoAddr, SYSINFO_GET_CYCFREQ, 0, 250 * 1000, 0, &rmsg);
	int spiclk = rmsg.data[1];
	for(unsigned int i=0; i<3; i++)
		RPCFunctionCall(g_periphAddr, PERIPH_SPI_SET_CLKDIV, spiclk, i, 0, &rmsg);
}

The firmware is about 4300 lines of C in total, including comments but not the 1165 lines of C and assembly in my C runtime library shared by all GRAFTON designs. It implements IPv4, UDP, DHCP, ARP, ICMP echo, and SNMPv2c. SNMPv3 security and IPv6 are planned but are on hold until I move firmware out of block RAM and into flash so I have some space to work in. Other than that, it's essentially feature-complete and I've been using the PDU in my lab for a while while working on my flash controller and some support stuff.

The PC-side UI, intended to control several PDUs, is written in C++ using gtkmm and communicates with the board over SNMP. One tab (not shown) contains summary information with one graph trace per PDU.

PDU control panel

With a few minutes of PHP scripting I was also able to get my Munin installation to connect to the PDU and collect long-term logs even when I don't have the panel up.

Munin logs of PDU

The board runs quite cool, the spikes of heat caused by my furnace kicking in are quite visible and dwarf thermal variations caused by changes in load.

It needs a little bit more work to be fully production-ready but is already saving me time around the lab.

My desk with the PDU installed

Here's a look at my desk after deploying the PDU. The power cable mess is almost completely gone :) I do need to tidy up the Ethernet cables at some point, though...