Silicon Exposed: June 2012

Earlier this month I was working on a Spartan-6 based design using the on-chip DDR SDRAM controller (MCB) and ran into some problems.

For those of you not familiar with the MCB, it's a pipelined DDR/DDR2/DDR3 controller that exposes several (up to 2 full-duplex and 4 half-duplex, or 4 full-duplex) distinct ports to your RTL. Each port can be independently clocked and the order of operations between ports may be unpredictable due to this (a round-robin arbitration scheme is used under the hood after synchronization) but operations on a single port execute in FIFO order. My design only uses port 0 at the moment, 1/2/3 will probably be used for DMA at some point in the future.

Each port consists of three FIFOs, which are capable of independent clocking but in practice typically share the same clock:

Command FIFO, which specifies the address, opcode (read, write, read with auto precharge, write with auto precharge), and burst length (number of words to write)
Write data FIFO, which stores data to be written plus a byte write-enable mask
Read data FIFO, which stores data read from RAM

To issue a write, push one or more data words and masks onto the write data FIFO and then push a write command onto the command FIFO. To read, push a read command and then read from the read data FIFO as it becomes available.

The read and write FIFOs are each 128 words deep, enough to accommodate most needs. Unfortunately, the command FIFO is much shallower - 4 commands.

In deeply pipelined designs that do a lot of random-access operations such as my softcore CPU this isn't nearly enough! It's only exacerbated by the poor performance of the quick-and-dirty direct-mapped write-through L1 cache I'm using (I plan to replace it with a 2-way set-associative write-back cache at some point).

The obvious fix was to put a bigger FIFO in front of the MCB. This didn't work for some reason and I found myself reading garbage from memory. It wasn't clear where the bug was from reading the code, so I tried to simulate my design in ISim and take a closer look.

Being a hobbyist operating on a grad student's stipend doesn't leave me a lot of cash to spend on software so I try to stick with open source, or at least freeware, tools to the extent possible. The >$3K-per-seat ISE Design Suite Logic Edition is obviously not affordable so I use the freeware ISE WebPack for all of my FPGA designs.

WebPack includes the "Lite Edition" of ISim instead of the full version. According to the ISim FAQ page:

There is only one limitation. When the user design + testbench exceeds 50,000 lines of HDL code, the simulator will start to derate the performance of the simulator for that invocation... The line count includes both testbench and source code lines. The Xilinx libraries are not included in the line count.

I figured I was fine... my line-count script showed 10,129 lines of Verilog in my design directory (not all of which was being used in the simulation) plus another ~15k in the generated MCB code - well below 50k. As soon as I fired up the simulator, though, it slowed to a crawl and displayed a nag screen in the console:

This is a limited version of the ISE Simulator. The current design has exceeded the design size limit for this version and the performance of the simulation will be derated. Please contact your nearest sales office at www.xilinx.com/company/contact.htm or visit the Xilinx on-line store at www.xilinx.com/onlinestore/design_resources.htm if interested in purchasing the full, unlimited version of this simulator.

A binary search of my RTL (commenting out half of the file at a time) revealed that the MCB primitive was causing the line count to be hit. Despite the FAQ claiming that the Xilinx libraries are not included in the count, the MCB's behavioral model clearly is included.

Since the derated simulation was unusably slow (the millisecond or two required for calibration to complete would have taken several days of wall-clock time, before even starting to simulate my test suite) this was clearly not a viable option.

I started to look around for alternatives. The DDR chip was being clocked at 80 MHz DDR'd to 160 MT/s, too fast for my 100 MHz oscilloscope/LA to be of any help. Since the RAM was dynamic I couldn't slow it down. I only had 20 GPIO channels on the board, not nearly enough to bring out all of the signals of interest for probing (not that this was really a problem as the LA only had 16 channels).

The best option at this point looked to be Xilinx's ChipScope internal logic analyzer, which supports programmable triggers and can store a large number (their documentation isn't clear on exactly how many) of channels of data to on-chip memory, then stream to a PC over JTAG or Ethernet (or possibly other IO standards too). Unfortunately ChipScope is not included in a license. My options were to spend $695 on a node-locked license, $850 on a floating license, or $3000+ on a full Logic Edition ISE license.

Not having that kind of cash, I realized I was out of easy options - I was going to have to write my own tool. I sat down and started coding on an internal logic analyzer of my own, dubbed RED TIN (two words picked randomly from a dictionary - I'm terrible at coming up with names!).

A week and 2,063 lines of code later I had an alpha version ready for testing at http://code.google.com/p/red-tin-logic-analyzer/. It consists of four parts:

RedTinLogicAnalyzer, a Verilog core which handles the core capture/buffering code (currently hard-coded to 128 channels and 512 samples, will be parameterizable later on).
RedTinUARTWrapper, a wrapper around RedTinLogicAnalyzer that implements the board-to-PC protocol (the intention is to eventually support JTAG and other interfaces as well).
The redtin application, a C++/gtkmm GUI application which functions as the control panel of the logic analyzer
A third-party waveform viewer that can read standard vcd-format files. gtkwave is hard-coded in the alpha build but I'll provide an interface for using other viewers later on.

The capture code is portable Verilog and should work on any FPGA, though I've only tested on Spartan-6. The PC-side application is cross-platform except for the UART code, which is Linux-specific (I haven't yet had time to write a portable wrapper around that).

Using the core is pretty simple - just insert it into the design, feed up to 128 inputs into the "din" input, supply a clock, and connect the UART pins up to top-level pins on the FPGA. Note the 50 bits of zeros at the end to pad the total width up to 128 bits.

Without further ado, here's a screenshot of it in action!

It's a little hard to see in the small view (I have a quad-monitor setup on my desk, this is the left half) but the left side shows the control panel application (with iMPACT in the background) and the right side is data captured from one of my boards.

After a bit of poking around I found my bug!

While I was pushing write commands onto the extended FIFO, I had somehow forgotten to do so with reads. This resulted in a race condition where reads could either be stepped on by writes or, even worse, occur before the write had committed!

The above screenshot shows a write (p0_cmd_instr = 3'b000) executing after a read (p0_cmd_instr = 3'b001) even though the write was issued first (compare rising edge of wr to rising edge of read_active). As a result the read data was complete garbage (32'hDFBDBFEF instead of 32'hFEEDFACE) because nothing had been written to that address yet.

Once I managed to assemble a test vehicle full of 0201 passives, I started to wonder if I could manage to go any smaller with hand assembly. The final milestone was 01005, the smallest size passive component in mass production.

Not being one to duck out of a challenge, I bought a tape of 20kΩ resistors and began designing a test vehicle. The board consists of four identical arrays of footprints; each array contains eight paralleled resistors with a test point on each side. If you do the math this works out to 2.5kΩ across the entire array if all joints are good. Significantly higher or lower would mean one of them was an open or short.

The first problem (which I had anticipated ahead of time) was that my existing tweezers were simply too large to pick up a component of this size.

01005 passive on 0.5mm TQFP footprint, seen next to my normal tweezers

I decided to buy a set of Dumont-brand Swiss watchmaker's tweezers with 50 micron points, which are significantly smaller than my old ones (but much more fragile - I've already slightly bent one of the tips despite being quite careful!)

The same component next to my new tweezers

Paste application used the same tried-and-true method as my earlier 0201 test - squirting a tiny dab of paste onto a microscope slide and using a scalpel blade as a "putty knife". In the interests of time I only hooked up a single column.

To give a better idea of how small the components (and the entire board) is, I put an 8-pin SOIC on it.

Solder paste applied to the second column with SOIC-8 for scale. The exposed copper rings around the probe pads are the result of an error in the solder mask pattern.

Paste volume was highly varied, suggesting I need to work on either better control of the scalpel technique or replacing it entirely.

Beginning component placement. Note heavy variability in paste volume.

After finishing component placement I took a closer look under the Olympus scope to see how things turned out.

Darkfield image of placed components before reflow. Focal plane is just above the top of the solder mask.

I ran the board through my standard reflow profile in the toaster oven and let it cool. From a quick inspection it looked pretty good - no tombstoning or obvious shorts/opens.

This was confirmed with an ohmmeter check - 2.50KΩ exactly. Looks like a complete success :)

Board post-reflow. Note probe scrub marks on test pads.

Darkfield closeup of a single component. Slight misalignment is visible on the one at the top edge, but electrical test passes.

Assembled board seen next to an 14-pin PDIP for scale.

It was a rather labor intensive process but the end result was a complete success. While I certainly am not about to go and use all 01005 components on my future boards, if it becomes necessary to stick something in a tight place it's good to know my process can handle it.

Silicon Exposed

Friday, June 29, 2012

ISim bugs and introducing the RED TIN internal logic analyzer

Monday, June 18, 2012

The final frontier - 01005 passives

Popular Posts

About Me