For those of you not familiar with the MCB, it's a pipelined DDR/DDR2/DDR3 controller that exposes several (up to 2 full-duplex and 4 half-duplex, or 4 full-duplex) distinct ports to your RTL. Each port can be independently clocked and the order of operations between ports may be unpredictable due to this (a round-robin arbitration scheme is used under the hood after synchronization) but operations on a single port execute in FIFO order. My design only uses port 0 at the moment, 1/2/3 will probably be used for DMA at some point in the future.
Each port consists of three FIFOs, which are capable of independent clocking but in practice typically share the same clock:
- Command FIFO, which specifies the address, opcode (read, write, read with auto precharge, write with auto precharge), and burst length (number of words to write)
- Write data FIFO, which stores data to be written plus a byte write-enable mask
- Read data FIFO, which stores data read from RAM
The read and write FIFOs are each 128 words deep, enough to accommodate most needs. Unfortunately, the command FIFO is much shallower - 4 commands.
In deeply pipelined designs that do a lot of random-access operations such as my softcore CPU this isn't nearly enough! It's only exacerbated by the poor performance of the quick-and-dirty direct-mapped write-through L1 cache I'm using (I plan to replace it with a 2-way set-associative write-back cache at some point).
The obvious fix was to put a bigger FIFO in front of the MCB. This didn't work for some reason and I found myself reading garbage from memory. It wasn't clear where the bug was from reading the code, so I tried to simulate my design in ISim and take a closer look.
Being a hobbyist operating on a grad student's stipend doesn't leave me a lot of cash to spend on software so I try to stick with open source, or at least freeware, tools to the extent possible. The >$3K-per-seat ISE Design Suite Logic Edition is obviously not affordable so I use the freeware ISE WebPack for all of my FPGA designs.
WebPack includes the "Lite Edition" of ISim instead of the full version. According to the ISim FAQ page:
There is only one limitation. When the user design + testbench exceeds 50,000 lines of HDL code, the simulator will start to derate the performance of the simulator for that invocation... The line count includes both testbench and source code lines. The Xilinx libraries are not included in the line count.
I figured I was fine... my line-count script showed 10,129 lines of Verilog in my design directory (not all of which was being used in the simulation) plus another ~15k in the generated MCB code - well below 50k. As soon as I fired up the simulator, though, it slowed to a crawl and displayed a nag screen in the console:
This is a limited version of the ISE Simulator. The current design has exceeded the design size limit for this version and the performance of the simulation will be derated. Please contact your nearest sales office at www.xilinx.com/company/contact.htm or visit the Xilinx on-line store at www.xilinx.com/onlinestore/design_resources.htm if interested in purchasing the full, unlimited version of this simulator.A binary search of my RTL (commenting out half of the file at a time) revealed that the MCB primitive was causing the line count to be hit. Despite the FAQ claiming that the Xilinx libraries are not included in the count, the MCB's behavioral model clearly is included.
Since the derated simulation was unusably slow (the millisecond or two required for calibration to complete would have taken several days of wall-clock time, before even starting to simulate my test suite) this was clearly not a viable option.
I started to look around for alternatives. The DDR chip was being clocked at 80 MHz DDR'd to 160 MT/s, too fast for my 100 MHz oscilloscope/LA to be of any help. Since the RAM was dynamic I couldn't slow it down. I only had 20 GPIO channels on the board, not nearly enough to bring out all of the signals of interest for probing (not that this was really a problem as the LA only had 16 channels).
The best option at this point looked to be Xilinx's ChipScope internal logic analyzer, which supports programmable triggers and can store a large number (their documentation isn't clear on exactly how many) of channels of data to on-chip memory, then stream to a PC over JTAG or Ethernet (or possibly other IO standards too). Unfortunately ChipScope is not included in a license. My options were to spend $695 on a node-locked license, $850 on a floating license, or $3000+ on a full Logic Edition ISE license.
Not having that kind of cash, I realized I was out of easy options - I was going to have to write my own tool. I sat down and started coding on an internal logic analyzer of my own, dubbed RED TIN (two words picked randomly from a dictionary - I'm terrible at coming up with names!).
A week and 2,063 lines of code later I had an alpha version ready for testing at http://code.google.com/p/red-tin-logic-analyzer/. It consists of four parts:
- RedTinLogicAnalyzer, a Verilog core which handles the core capture/buffering code (currently hard-coded to 128 channels and 512 samples, will be parameterizable later on).
- RedTinUARTWrapper, a wrapper around RedTinLogicAnalyzer that implements the board-to-PC protocol (the intention is to eventually support JTAG and other interfaces as well).
- The redtin application, a C++/gtkmm GUI application which functions as the control panel of the logic analyzer
- A third-party waveform viewer that can read standard vcd-format files. gtkwave is hard-coded in the alpha build but I'll provide an interface for using other viewers later on.
Using the core is pretty simple - just insert it into the design, feed up to 128 inputs into the "din" input, supply a clock, and connect the UART pins up to top-level pins on the FPGA. Note the 50 bits of zeros at the end to pad the total width up to 128 bits.
Without further ado, here's a screenshot of it in action!
It's a little hard to see in the small view (I have a quad-monitor setup on my desk, this is the left half) but the left side shows the control panel application (with iMPACT in the background) and the right side is data captured from one of my boards.
After a bit of poking around I found my bug!
While I was pushing write commands onto the extended FIFO, I had somehow forgotten to do so with reads. This resulted in a race condition where reads could either be stepped on by writes or, even worse, occur before the write had committed!
The above screenshot shows a write (p0_cmd_instr = 3'b000) executing after a read (p0_cmd_instr = 3'b001) even though the write was issued first (compare rising edge of wr to rising edge of read_active). As a result the read data was complete garbage (32'hDFBDBFEF instead of 32'hFEEDFACE) because nothing had been written to that address yet.