BxB Logo "BxBFFT" Fast Fourier Transform

The "BxBFFT": an outstanding high-speed streaming FFT

The BxBFFT is an amazing high-speed streaming Fast Fourier Transform. It is specifically targeted at jobs that require the highest FFT speeds.

The BxBFFT is designed for digital signal processing applications where the sample rate is many times the FPGA clock rate, such as radar systems, lidar systems, spectral switches, cable TV, MKID photonic sensors, high-bandwidth beamformers, radio telescopes, test and measurement systems, analog system simulators, high-speed radios, cellular backhaul, medical imaging systems, and communication satellites. When the processing rates get high, the BxBFFT is second to none.

The following points summarize the BxBFFT's main advantages:

1. Power savings:        
Half the power of other FFTs (in Xilinx).
2. Resource savings:     
Half the FPGA LUTs (in Xilinx).
3. High Throughput:      
Highly parallel processing at the highest clock rates. Also the lowest latency.
4. Large FFT Sizes:      
Supports 128k and greater FFT sizes with efficiency.
5. Feature Support:      
Non-power-of-2 FFTs, real-to-complex FFTs, many included options to meet design goals.
6. Productivity:         
Easier to get reliable results; Faster synthesis and simulation.
7. Vendor Independence:  
Efficient support of multiple FPGA vendors, with a path to ASICs.

The BxBFFT is cross-platform. For features common to all platforms, see further below on this page. For features specific to the three FPGA familes that are supported off-the-shelf, see these additional pages:

Xilinx Ultrascale/Ultrascale+ FPGAs
Xilinx Versal FPGAs
Altera Agilex7 FPGAs
Altera Stratix10 FPGAs
Altera Arria10 FPGAs

Delivery of BxBFFTs for these three FPGA families is nearly immediate. The BxBFFT also supports implementation in other FPGA families and in ASICs, but these are not off-the-shelf and thus have longer delivery times and may have higher costs.

High-speed applications often run into difficulties meeting implementation budgets in resources, power consumption, or FPGA clock rate. These problems are often not revealed until late in the design cycle when they are difficult to fix. The BxBFFT is a top performer in all of these areas, preventing project issues and delays with a better front-end design. Performance curves specific to each FPGA family are on the web pages linked to above, since performance varies with the underlying FPGA architecture.

BxBFFT Features

The BxBFFT has a large feature set, including many features not found in other FFTs. It has easy-to-use controls of FFT numerical performance, controls for resource utilization tradeoffs, controls to obtain the highest timing margins, and options supporting many special requirements.

Occasionally a particular FFT feature is critical to an application. The BxBFFT supports the widest variety of features, out of the box. Below is a comparison. Note that some vendors, such as Altera and Xilinx, have multiple FFT offerings. Only their highest-speed FFT compares with the BxBFFT. So the FFT features and performance shown here and elsewhere are for the vendor's highest-speed FFT, which often has fewer supported features than slower FFTs from that same vendor.

Below is a table of the supported features of the BxBFFT and other FFTs.

Feature Table

Non-Power-of-2 FFTs

BxBFFTs support FFT sizes that are multiples of powers of 2, 3, 5, and 7, not just powers of 2. Non-power-of-2 BxBFFTs use extensive optimizations not available for power-of-2 cases. Although power-of-2 BxBFFTs are usually the most efficient, non-power-of-2 BxBFFTs are not far behind. In rare cases, non-power-of-2 BxBFFT performance is even superior to the performance of the closest power-of-2 BxBFFT.

One of the most important advantages of non-power-of-2 BxBFFTs is that they allow non-power-of-2 Points Per Clock, which is the number of complex data points that the FFT processes in parallel each clock. (This is abbreviated PPC, and is sometimes called Super Sample Rate or SSR). Having more options for parallelism gives more options to make a design close. For example, to get a desired FFT throughput, PPC=4 might have too high an FPGA clock rate, but PPC=8 might require too much power or too many resources. In these cases, PPC=5 with a non-power-of-2 BxBFFT may lead to design closure. This factor becomes more significant as ADC and DAC rates increase. For example a design that can close with PPC=36 will use significantly less logic than the next power-of-2 step up of PPC=64.

In addition, non-power-of-2 BxBFFTs have system advantages. They can more easily match a design to frequencies of existing equipment or match it to frequency standards. They allow single-clock operation of some designs, where power-of-2 FFTs would require multiple synchronous clock sources. These factors can make designs close that otherwise would not, or they can reduce FPGA logic and external part count.

The graphs below show available BxBFFTs and Xilinx SSR FFTs below size 10,000 and from PPC=2 to PPC=10. The first thing to note is the richness of BxBFFT offerings compared to a power-of-2-limited FFT such as the Xilinx SSR FFT or Altera Parallel FFT. The same power-of-2 limitation is true of most other FFTs. The BxBFFT has thousands of options to make designs close, where power-of-2-limited FFTs have only 21 options.

These graphs also show that although power-of-2 BxBFFTs generally have the lowest power consumption, non-power-of-2 BxBFFTs are also quite good. For example, non-power-of-2 BxBFFTs are often better in power consumption than the closest Xilinx SSR FFTs.

BxBFFT Power in Xilinx

Real FFTs

The BxBFFT supports FFTs with real inputs and complex outputs. These obtain spectrums with the highest accuracy, as there are no real-to-complex conversions between ADC data and the FFT, which create artifacts and impose filter rolloff. As usual, the BxBFFT also ships with the Real FFT inverse.

Background Reset (for Space Applications)

The BxBFFT supports a feature where it can be fully reset while operating, without interrupting processing. This feature supports high-reliability operation in space environments, which have natural radiation. Radiation causes Single-Event Upsets (SEUs), which can cause transient errors (such as resetting counters) or persistent errors (such as altering the logic programmed into the FPGA). Frequent periodic background resets of the BxBFFT fix the transient errors caused by SEUs without affecting normal operation. It is not necessary to detect that an SEU occurred.

Competing FFTs often cannot fix SEU errors in the background. As a consequence, competing FFTs often can't fix SEU errors periodically at all. This is because the continued interruptions would adversely affect required system availability. However, the system must fix SEU errors, because leaving an SEU in place corrupts processing and also affects availability. One solution is to detect SEUs, so that the FFT is reset only when it needs to be reset. This leads to complicated detection schemes that aren't fully reliable. Another solution is to use algorithms that allow FFT idle time in which SEUs can be repaired. However, idle time is not natural for many applications. The BxBFFT avoids these issues and these complications with its background reset.

In the case where an SEU makes a persistent alteration to FPGA logic, the standard approach is to have a "scrubbing" operation that reads back the FPGA configuration, checks for changes to the logic, and repairs them. This makes the persistent SEU transient. The BxBFFT's background reset works well with this, to automatically restoring operation as soon as the logic is repaired.

For the highest reliability, Triple Module Redundancy (TMR) triplicates logic into three legs and then votes on the answer. This means that even when one set of logic is affected by an SEU, proper operation is not affected because the other two legs outvote the incorrect answer. The full SEU-protection scheme has TMR, then scrubbing, then a background reset of the BxBFFT to automatically finish the SEU repair. Each of these operations are independent and decoupled, for easy implementation. The background reset doesn't just restore BxBFFT operation; it also restores proper BxBFFT sync to match the other two operating BxBFFTs, so that system operation is fully and automatically restored.

The BxBFFT's Ease of Use and Productivity Enhancements

The BxBFFT was designed to get you running quickly. It has features to make configuration, synthesis, and simulation faster and easier, saving NRE.

Configuration

The BxBFFT has easy controls for managing amplitude gain, that make obtaining high numerical performance easy for most customers. For customers with specific requirements, the BxBFFT also allows precise stage-by-stage shift control. To handle the most demanding applications, amplitude can be managed with dynamic run-time monitoring and dynamic shifting controls. These allow the gain control through the BxBFFT to be adapted to changing input signal environments.

The BxBFFT also has controls to select whether memory gets implemented as URAM, BRAM, or distributed RAM. These controls can be asserted globally, or they can be specifically targeted to individual BxBFFT stages. This helps fit a design in the FPGA, and also helps prevent overly tight resources of one memory type that might lead to longer routes and make it more difficult to meet timing.

Other memory-related controls can eliminate ROM twiddle tables at specific stages in favor of on-the-fly sine/cosine generation. This can save significant amounts of memory for large FFT sizes.

Pipelining can also be configured. The default pipelining works well for most situations, as shown by the high achieved Fmax of the BxBFFT. However in situations where there is unusually high external resource contention, more pipelining may improve timing. For such a case, BxBFFT pipelining can be increased globally or at specific stages.

Another thing that can be configured is input and output order. Typically "Fully Natural" order is preferred at input and output by most customers. Occasionally "Scrambled" order on BxBFFT output is of benefit, since it can save a significant amount of memory. A "Partially Natural" order is also available. This order is less commonly supported by FFTs, but is highly useful in certain situations. For example, it saved one customer significant processing in a zero-pad operation.

It is also possible to configure whether sample number zero is at the start of the data or is in the center of the data. Having it at the start of the data is the FFT standard. Having it in the middle, with negative data indexes to the left and positive data indexes to the right, is often more useful for data that is centered about a specific focal point. This can be configured separately for input data and for output data.

Whether the BxBFFT is a forward FFT or inverse FFT is another compile-time selection.

BxBFFT data width can be selected between 18 bits and 27 bits. This is a tradeoff between resources, FFT numerical accuracy, and ease-of-use. It is generally of benefit to start a design at 27 bits, which brings up a design easily with good numerical performance and no risk of overflow. The design can then be optimized to lower numbers of bits to reduce resources and power, while observing the effect on numerical accuracy.

Synthesis

BxBFFTs have more timing margin than competitors. This additional timing margin is what allows BxBFFTs to achieve high Fmax and thus high throughput. Timing margin also means that place and route steps don't need to work as hard to meet desired timing constraints. As a result, FPGA implementation time is shorter. How much shorter depends on the FPGA vendor, so additional data is on the vendor-specific BxBFFT web pages.

Simulation

Simulation of the BxBFFT is faster than competitors, which can save significant engineering time in product design and testing. Even more important is the time it might save in long verification runs. The fast simulation speed is due to the simple and direct nature of the BxBFFT's System Verilog code.

The BxBFFT is tested with several simulators, including Xilinx XSim, Altera Questa, Icarus Verilog, and Verilator. Verilator support is especially important, since it can provide immense speed increases of long simulations.

Below is a graph showing simulation time of various FFTs relative to the BxBFFT. In this case, simulation was with Icarus Verilog for System Verilog FFTs, and Xilinx XSim for VHDL FFTs. In most cases the FFTs simulate significantly slower than the BxBFFT, and in some cases immensely slower.

FFT Simulation Times

Comprehensive BxBFFT Delivery Package

The BxBFFT ships as a very comprehensive package, intended to foresee all customer needs.

A customer ordering a BxBFFT chooses an FFT size, chooses the parallelism in Points Per Clock (PPC), and chooses whether the BxBFFT is fully complex or real-to-complex. Sometimes the customer adds additional constraints, such as that LUT usage should be a minimum or memory usage should be. Bit by Bit Signal processing finds the combination of radix stages and optimizations that give the lowest power and resources for those parameters, and generates and delivers the BxBFFT. One of the reasons for the BxBFFT's high performance is that these parameters are set at delivery time. It means that optimizations can be performed specific to a BxBFFT's size and PPC. Other FFTs that use the same code or use the same design for all FFT sizes miss out on these size-specific optimizations, and the BxBFFT does not.

Most other settings are user-alterable, as parameters at the BxBFFT's top level. These include forward/inverse, input/output data order, whether input/output zero position is at left or in the center, the data bit width, settings to manage and control signal gain, pipelining control, memory implementation control, selection of AXIS-standard I/O interface or simpler BxB I/O interface, and a stage-by-stage selection of using normal ROM twiddles or on-the-fly generated twiddles. All settings begin at reasonable defaults, to get designs working quickly.

The code for a BxBFFT is a single System Verilog file with several associated data files for twiddle ROM tables. The small number of files keeps the delivery neat and file management easy. Internal names are mangled to prevent name conflicts with other BxBFFTs, with different BxBFFT versions, or with other customer IP. Since the code is standard System Verilog, it is readily usable in customer development flows and is friendly to third-party tools. The BxBFFT's code is tested with multiple simulation and synthesis tools, to help ensure its wide portability.

The delivery also includes C++ and matlab BxBFFT models, which are faster to simulate.

The delivery includes many tests of the System Verilog, C++, and Matlab simulation models. These tests verify that all models work and that they give identical results. The tests also serve as examples of how to connect to the model, configure it, and get data in and out.

Tests are also included to show that Vivado synthizes the core correctly. Simulations of the Vivado-produced post-route netlist verify that Vivado has correctly synthesized the BxBFFT's code. The synthesis runs also give other information, such as the quantity of FPGA resources used by the BxBFFT and the achieved Fmax.

For Xilinx FPGAs, a Xilinx IP Integrator model is also included. For those using Xilinx block designs, this is the fastest way to instantiate and configure a BxBFFT.

Finally, there is extensive documentation regarding how to set up and configure the BxBFFT.

ASICs

The BxBFFT was optimized first to be an excellent FFT, and then Xilinx or Altera optimizations were added on top of this. Thus many of the BxBFFT advantages carry over not just to other FPGA product lines but also to ASICs. Porting the BxBFFT to ASICs requires re-optimization of low-level functional elements such as memories, real multiplies, and complex multiplies to the libraries that come with the ASIC process. The BxBFFT's capability to be targeted to ASICs in this way was shown when the same techniques were used to re-optimize the Xilinx implementation to support Altera FPGAs.

The BxBFFT design is fully pipelined, and this pipelining allows the highest timing margins to be achieved, and the highest ASIC clock rates to be achieved. Alternately, high timing margins allow ASIC voltage to be reduced while still meeting timing, for the lowest ASIC power consumption.

Bit by Bit Signal Processing is interested in ports to other FPGA lines or to ASICs, if a sufficient business case exists. If this would significantly benefit your business, contact us.

Pricing

BxBFFT pricing is intended to make FFTs available for all professional uses at reasonable cost. If you think prices are unreasonable for your project, send an email with a justification for a different pricing scheme, and we'll discuss it.

Academic / Educational

The BxBFFT is available for small academic projects for US$1000 per BxBFFT. License terms will require that the BxBFFT is cited in papers to which the BxBFFT contributed, and that Bit by Bit Signal Processing should receive copies of any performance measurements made that are related to the BxBFFT. Distribution rights are not included with academic pricing. Bit by Bit Signal Processing will have rights to use information from academic projects to make BxBFFT advantages known for marketing purposes. Support for academic projects is at a lower priority than commercial jobs.

Commercial

Commercial companies can get access to the entire range of BxBFFTs, with binary distribution rights and support, for US$15000 per year. Rights are purchased 3 years ahead, so the first-year cost is US$45000, and then it is US$15000 each year thereafter. Support ends after payments cease, and distribution rights end 3 years after payments cease. (These prices may be increased periodically to match inflation.) A wide range of power-of-2 BxBFFTs is immediately available after purchase. Non-power-of-2 BxBFFTs are generated at customer request with modest lead time, since there are too many to have them all pre-generated.

Alternately, BxBFFTs for a specific FFT size and speed can be purchased individually for commercial development or distribution. BxBFFTs in a selected FPGA family without distribution rights are US$2000 each, with 1 year support. The price drops to USE$1500 each for 3 to 8 FFT sizes, and US$1200 each for 9 or more BxBFFT sizes.

With never-ending distribution rights for unlimited products in the selected FPGA family, BxBFFTs are US$15000 each for 1 or 2 sizes, US$10000 for 3 to 8 sizes, and US$8000 each for 9 or more sizes.

These prices reflect a discount for allowing Bit by Bit Signal Processing to publicize the relationship for marketing purposes. Purchases that must be kept secret will have slightly higher cost.

Other arrangements are possible to match your business needs. If you would like to propose an alternate arrangement, please do so.

Military

Purchases that could see applications with non-U.S. militaries will need to be reviewed for compliance with U.S. export law. Otherwise, this is the same as commercial applications.

Conclusions

The results on these pages illustrate how the BxBFFT is superior in most ways to other FFTs in Xilinx Ultrascale/Ultrascale+ FPGAs. It uses less power, uses fewer resources, and attains higher speeds. It is unmatched at almost all FFT sizes and speeds. It is unmatched in supported features. It is also cross-platform, supporting both Xilinx and Altera FPGAs, with a path into ASICs.

Many of the results on this web site are also availble in this PDF presentation.

Links

Bit by Bit Signal Processing Main Page
BxBFFT Product Main Page with these pages for specific FPGAs:
Xilinx Ultrascale FPGAs
Xilinx Versal FPGAs
Altera Agilex7 FPGAs
Altera Stratix10 FPGAs
Altera Arria10 FPGAs
BxBFFT Product Comparison PDF
BxBApp Demonstration
Tutorials
Email Contact: ross@bitbybitsp.com
Phone Contact: +1-623-487-8011 (this has automated call screening)