Improving PC/104 Bandwidth using FPGA Microcontroller

The TS-ADC24 can provide up to 8 MB/s of ADC data, but the ISA (PC/104) bus on most systems is limited to 2 MB/s bandwidth or less. So one might conclude that the TS-ADC24 is over-designed. However, the TS-ADC24 itself does not require the long ISA strobe times that typical PC/104 systems use, and a well-designed PC/104 system such as the TS-8100-4740 featuring a Spartan 6 FPGA can actually exceed 2MB/s for sustained bursts. This translates into sampling 4 ADC channels at 250 kHz or even 500 kHz. This is possible due to standard functionality in the FPGA including customizable bus timing, user DMA, and an embedded processor. With extra engineering, 1000 kHz would be possible, but this article explores what can be accomplished by a typical C programmer who does not want to venture into the realm of FPGA development.

Please note that this paper covers programming techniques that may be considered advanced and that may require detailed understanding of the architecture of the products involved. Technologic Systems free tech support may not cover modification, debugging, porting, adaptation, or enhancement of the sample software provided with this article.

pc104-bandwidth-improvement-figure-1

The system examined in this note is a stack of three Technologic Systems boards: The TS-8100, TS-4740, and TS-ADC24. Once configured, the TS-ADC24 automatically takes 4 ADC samples at a time and buffers them in a 512 deep FIFO which must be serviced by software. For example, a sampling period configuration of 4 µs (250 ksps) would require a mean bandwidth of 2 MB/s and a FIFO service latency well below 512 µs. This paper examines attempts to move data from the TS-ADC24 to main memory at speeds exceeding traditional PC/104, without suffering a FIFO overflow on the TS-ADC24, and using the stock Linux OS and FPGA configuration that are provided with the system. This diagram on the previous page shows the chain of buses between the ADC and the ARM9 CPU.

The simplest software algorithm for acquiring TS-ADC24 data is one that reads it explicitly. With this technique, the faster you try to read, the more time the CPU will spend stalled waiting for bus cycles to come back. Consider this simple line of C code:

The ADC macro is a pointer dereference into hardware address space and it simply compiles into a load opcode. While that load instruction is executing, the CPU is stalled. The load instruction becomes a read cycle on the SMC bus, the MUXBUS, and the PC/104 bus. Each of these buses has overhead that adds to the total time that it takes for this one instruction to execute, no matter how fast the CPU is.

Despite all the layers of bus overhead, the ISA strobe length is a major factor that determines how fast the data register on the peripheral can be read. On this system, like many Technologic Systems SBCs, the strobe length is configurable. This MUXBUS strobe length becomes the ISA strobe length. The shortest strobe length that works for this combination of boards is 80 ns. Configurable bus timing is a substantial advantage of the FPGA-based external bus.

The first example source file demonstrating this basic method is adc1.c.

Source Code: adc1.c

On the system in question, with an optimized MUXBUS timing configuration, when the data loop is executing it runs at 1.7 MHz, for a maximum short term throughput of 3.4 MB/s. In practice, however, a software process reading at that speed would spend 100% of CPU cycles doing that and have none left for IO and other processes. Some cushion needs to be built in, and anything over 2 MB/s is impractical.

With adc1.c, a sampling rate of 200 kHz is possible in practice, for a net bandwidth of 1.6 MB/s, however it uses 100% of CPU resources, and its performance is fragile. It will suffer a FIFO overflow if any other process demands the CPU.

To make this program more reliable and useful, interrupt handling is an obvious improvement. The TS-ADC24 can be configured to trigger an interrupt at any given FIFO count trigger level. Triggering at half full is a simple and practical way of using a FIFO. This improvement is done in adc2.c using user space IRQ handling.

Source Code: adc2.c

With adc2.c, a sampling rate of 250 kHz is possible, for a net bandwidth of 2.0 MB/s. At 100 kHz sampling, it uses 32% of CPU resources.

At 200 kHz or faster sampling, adc2.c strains the system and leaves insufficient CPU resources for other processes. The next step is to reduce CPU usage by using user DMA bursts instead of explicit data reads. The DMA controller in the TS-4740 FPGA can read the peripheral at up to 6.25 MHz. This is not difficult to do using the mvdmastream.h API. Below is a diagram of the wishbone logic blocks in the standard TS-4740 FPGA implementation.

pc104-bandwidth-improvement-figure-2

User DMA is demonstrated in adc3.c. Note that this does not involve the DMA controller in the Marvell CPU. DMA as discussed here refers to a mechanism designed by Technologic Systems that causes burst transactions on the SMC bus to be used.

Source Code: mvdmastream.c
Source Code: mvdmastream.h
Source Code: adc3.c

With adc3.c, a sampling rate of 350 kHz is possible, for a net bandwidth of 2.8 MB/s. At 100 kHz sampling, it uses 17% of CPU resources.

Once again the obstacle that prevents a further gain in performance is not pure bandwidth, it is latency. Tests show that the latency between an interrupt and the first read of a DMA burst can exceed 100 µs. At maximum sampling speed, the TS-ADC24 FIFO goes from empty to full in 128 µs, so this application is running up against the real-time constraints of Linux. Assistance from the FPGA is needed.

An FPGA programmer might solve this problem nicely by installing a DMA core to read from the TS-ADC24 whenever it asserts its interrupt, and place that data in a larger internal FIFO, which in turn can be read by a user DMA core at around 50MB/s. This paper assumes the developer is not an FPGA programmer and must work with the stock FPGA functionality. With the Lattice Mico32 (LM32) embedded microcontroller, the FPGA still provides a degree of flexibility to a C programmer that can prove useful here.

Programming the LM32 is fairly simple. For readers who want to skip that part but still observe a high bandwidth demonstration, the LM32 binary, adc24buffer.exe, is provided. For anyone interested, the source is also provided. In order to build it, Cygwin (including the make utility) and Lattice Micosystem tools are required. Simply modify the Makefile to reflect the install location of your Lattice tools, and run make to build.

Executable: adc24buffer.exe (.zip)
Source Files: LM32 (.zip)

This LM32 program simply monitors the TS-ADC24 and reads ADC data from it when it is available. The data is stored in a ring buffer in LM32 RAM where it can be retrieved by adc4.c. So by writing software for two separate CPUs, we are able leverage enough of our resources to further increase effective bandwidth.

Source Code: adc4.c

With adc4.c, a sampling rate of 850 kHz is possible, for a net bandwidth of 6.8 MB/s. At 100 kHz sampling, it uses 10% of CPU resources.

One confusing note about adc4.c is that a mechanism was desired to allow the LM32 to send an interrupt to Linux, guaranteeing that its ring buffer won’t overflow. There is no built in way to do this, so the LM32 program does it by enabling and disabling the TS-ADC24 interrupt. This points out the need to perhaps have a “user interrupt” register bit in the TS-4740 SYSCON. This may be added in the future.

Using DMA, the LM32, and interrupts, all latency problems are solved and pure bus bandwidth is back to being the limiting factor. This system is effective at up to 800 kHz sampling (6.4 MB/s), but at 1000 kHz it runs out of bus. Examining the diagram above, note that when the LM32 is pulling data from the MUXBUS and storing it in memory, and the user DMA channel is also accessing that memory, then both are arbitrating for the same internal wishbone bus. The LM32 is running on a 100 MHz clock, but in this implementation its performance is around 30 MIPS. Like the ARM9, it stalls during a PC/104 bus cycle, so pulling 8 MB/s from the peripheral is asking a lot. Throw in the added stalls from the time that DMA is reading data, and the LM32 cannot quite keep up.

This paper showed how a determined C programmer can squeeze up to 6.4 MB/s of sustained bandwidth from the PC/104 bus on a TS-8100-4740 in a real-world application. The results are summarized below. Many embedded system designers would assume that this is impossible, and only evaluate more expensive systems based on PCI or PCIe for an application with requirements in this range. On a MB/s/dollar metric, the TS-4740 is an excellent value.

Program Description Max ksps PC/104 KB/s CPU Usage at 100 ksps
adc1.c Basic implementation 200 1600 99%
adc2.c Added interrupt handling 250 2000 32%
adc3.c Added burst transfers 350 2800 17%
adc4.c Added LM32 assist 850 6800 10%

Author: Derek Hildreth

eBusiness Manager for Technologic Systems