Verilog 2001 Coding Style to Infer LatticeECP2 sysDSP Blocks
with Synplify & Synplify Pro
Troy Scott, Lattice Semiconductor
Expressing your DSP design as RTL code is an
elegant way to model a design. The Lattice Semiconductor LatticeECP2 (EConomy
Plus 2nd generation) family integrates features and capabilities previously
only available on higher cost / high performance FPGAs. A key example is the
sysDSP Block capability, a high performance circuit that supports multiply,
addition, subtract and accumulate in widths up to 36x36. The ECP2 consists of
a FPGA fabric coupled with between three and twenty-two sysDSP blocks http://www.latticesemi.com/products/fpga/ecp2/index.cfm.
Developing DSP algorithms in HDL is an ideal way to maintain designs
over time, provides clear documentation, and makes your design less reliant
on proprietary building blocks. This article provides several arithmetic samples
in Verilog 2001 HDL that take advantage of LatticeECP2 sysDSP blocks.
sysDSP Block Architecture
The sysDSP block in the LatticeECP2 family support four functional
elements in three (x9, x18 and x36) data path widths. Operands can be either
signed or unsigned but not mixed within a functional element. Similarly, the
operand widths cannot be mixed within a block. sysDSP blocks may be concatenated
to form larger operations.
Figure 1. Lattice Semiconductor sysDSP Block
The resources in each sysDSP block can be configured to support
one of the following four elements: MULT (Multiply), MAC (Multiply, Accumulate),
MULTADD (Multiply, Addition/Subtraction), or MULTADDSUM (Multiply, Addition/Subtraction,
Accumulate). A global clock, clock enable, and reset signals from FPGA routing
are available to every DSP block and are applied to each input register, pipeline
register, and output register.
The DSP block supports different widths of signed and unsigned
multipliers besides x9, x18, and x36 widths. For unsigned operands, unused upper
data bits should be filled to create a valid x9, x18, or x36 operand. For signed
two’s complement operands, sign extension of the most significant bit
should be performed until x9, x18, or x36 width is reached. The number of elements
available per block depends on the data path width; for example, when a 9x data
path is used 2 MULTADDSUM elements are available per block.
Verilog 2001 Signed Arithmetic Support
Language extensions in Verilog 2001 ease modeling of arithmetic
expressions. The Synplify and Synplify Pro tools support signed arithmetic using
based on signed data types.
Verilog 1995 provides only one signed data type, the integer variable.
The net and reg data type is considered unsigned. Integers with a base format
notation are also considered unsigned. In Verilog 1995 signed arithmetic can
be done with 32-bit integers. In Verilog 2001, reg and net data types can be
declared using the reserved signed keyword.
RTL Examples
The sysDSP element block diagrams of the LatticeECP2 Family Data
Sheet provide a good guide for writing RTL that will target each block. The
diagrams illustrate the pipeline stages, multipliers, add/sub, or sum operators
available. While not all sysDSP configurations can be inferred via synthesis,
many useful models are possible.
1. Registered Multiply/Accumulate. This commonly
used function is useful for FIR filters and other DSP functions.
module mult_acc (dataout,
data_a0, data_a1, clk, rst);
parameter m = 18;
parameter n = 18;
output
signed [(m+n+16)-1:0] dataout;
input signed [m-1:0] data_a0;
input signed [n-1:0] data_a1;
input clk;
input rst;
//
Accumulate logic
assign acc_out = multa + dataout;
//
Registered output
always @(posedge rst or posedge clk)
begin : SEQ_MULT_ACC
if (rst)
dataout <= 0;
else
dataout <= acc_out;
end
endmodule
The resulting logic from this model can be mapped directly into
a single sysDSP block, MULT18X18MACB. The mult_acc model produces a circuit
that will run at just over 300 MHz with the –5 speed grade. Additional
sysDSP block options such as input registers and pipeline registers can be added
to this model and to create a higher-performance circuit.
2. Fully Pipelined Multiply/Accumulate. You can
improve the performance of the Register Multiply/Accumulate model by taking
advantage of sysDSP block pipeline registers. The following RTL code uses one
level of registers at the data_a0 and data_a1 inputs, as well as the multiplier
output register, dataout:
module mult_acc_pipe (dataout,
data_a0, data_a1, clk, rst);
parameter m = 18;
parameter n = 18;
output
signed [(m+n+16)-1:0] dataout;
input signed [m-1:0] data_a0;
input signed [n-1:0] data_a1;
input clk;
input rst;
reg
signed [(m+n+16)-1:0] dataout;
wire [(m+n+16)-1:0] acc_out;
reg signed [(m+n)-1:0] multa;
reg signed [m-1:0] data_a0r;
reg signed [n-1:0] data_a1r;
// Accumulate
logic
assign acc_out = multa + dataout;
always
@(posedge clk or posedge rst)
begin : SEQ_MULT_ACC_PIPE
if (rst) begin
data_a0r <= 0;
data_a1r <= 0;
multa <= 0;
dataout <= 0;
end
else begin
data_a0r <= data_a0;
data_a1r <= data_a1;
// Multiply logic
multa <= data_a0r *
data_a1r;
dataout <= acc_out;
end
end
endmodule
By taking advantage of internal registers performance is improved
to more than 390 MHz for the with the ECP2’s -5 speed grade, independent
of the implementation (place and route) tools. This example will be synthesized
by using a single sysDSP block.
3. Fully Pipelined Multiply/Add. This example
illustrates the coding style to infer a sysDSP block with Multiply/Add functionality.
The RTL can be modified for a high-performance Multiply/Sub:
output
signed [(m+n)-1:0] dataout;
input signed [m-1:0] data_a0;
input signed [n-1:0] data_a1;
input signed [m-1:0] data_b0;
input signed [n-1:0] data_b1;
input clk;
input rst;
reg
signed [m-1:0] data_a0r;
reg signed [n-1:0] data_a1r;
reg signed [m-1:0] data_b0r;
reg signed [n-1:0] data_b1r;
reg signed [(m+n)-1:0] mult_outa;
reg signed [(m+n)-1:0] mult_outb;
reg signed [(m+n):0] addsub_out;
always
@(posedge clk or posedge rst)
begin : SEQ_MULT_ADDSUB_PIPE
if (rst) begin
data_a0r <= 0;
data_a1r <= 0;
data_b0r <= 0;
data_b1r <= 0;
mult_outa <= 0;
mult_outb <= 0;
addsub_out <= 0;
end
else begin
data_a0r <= data_a0;
data_a1r <= data_a1;
data_b0r <= data_b0;
data_b1r <= data_b1;
mult_outa <= data_a0r
* data_a1r;
mult_outb <= data_b0r
* data_b1r;
// addsub_out <= mult_outa
- mult_outb;
addsub_out <= mult_outa
+ mult_outb;
end
end
assign dataout = addsub_out;
endmodule
The Multiply/Add will run over 390 MHz with the ECP2’s -5
speed grade, This example will be synthesized by using a single MULT18X18ADDSUBB
sysDSP block.
4. Basic FIR Filter. Since sysDSP blocks have
closely integrated multipliers and adders, filters can be implemented with minimal
routing resources and delays. Given that a sysDSP block provides 4 multipliers
and 3 adders a 4-tap filter can be implemented in a single block.
The samples so far use parallel operands however in FIR filter
design is it common to use an input term that is shifted in from a memory holding
coefficient values. Each input register of the sysDSP block provides a shiftout
output that connects to the shiftin input of the adjacent input register of
the same sysDSP block. The registers on the boundaries of a sysDSP block also
connect to the registers of adjacent DSP blocks through the use of shiftin/shiftout
connections. These connections create register chains spanning multiple DSP
blocks, which make it easy to increase the length of FIR filters. The Synplify
and Synplify Pro tools will infer the shift register input ports of the sysDSP
block when the model uses a cascade style as shown:
module systolic_fir (y,
h7, h6, h5, h4, h3, h2, h1, h0, x, clk, rst);
parameter tap = 8;
// FIR data path width
parameter m = 9;
parameter n = 9;
defparam inst0.m = 9;
defparam inst0.n = 9;
defparam inst1.m = 9;
defparam inst1.n = 9;
output
signed [m+n+2:0] y;
input signed [m-1:0] h7, h6, h5, h4, h3, h2, h1, h0;
input signed [n-1:0] x;
input clk;
input rst;
reg
signed [m+n+2:0] y;
wire [m+n+1:0] mass0;
wire [m+n+1:0] mass1;
wire
[m-1:0] h [0:tap-1];
wire [n-1:0] PCOUT_INT [0:1];
output
signed [(m+n)+1:0] dataout;
output signed [m-1:0] shreg_aout;
input signed [m-1:0] shreg_ain;
input signed [n-1:0] data_b0;
input signed [n-1:0] data_b1;
input signed [n-1:0] data_b2;
input signed [n-1:0] data_b3;
input clk;
input rst;
reg
signed [(m+n)+1:0] dataout;
wire [(m+n):0] addouta;
wire [(m+n):0] addoutb;
reg signed [(m+n)-1:0] multa;
reg signed [(m+n)-1:0] multb;
reg signed [(m+n)-1:0] multc;
reg signed [(m+n)-1:0] multd;
reg signed [m-1:0] shreg_ain_r0;
reg signed [m-1:0] shreg_ain_r1;
reg signed [m-1:0] shreg_ain_r2;
reg signed [m-1:0] shreg_ain_r3;
reg signed [n-1:0] data_b0r;
reg signed [n-1:0] data_b1r;
reg signed [n-1:0] data_b2r;
reg signed [n-1:0] data_b3r;
The systolic FIR filter design resource usage reports from the
Synplify and Synplify Pro tools and the ispLEVER design mapper (MAP) are shown
in the figures below:
Design Summary
--------------
Number of registers: 75
PFU registers: 54
PIO registers: 21
Number of SLICEs: 40 out of 23976 (0%)
SLICEs(logic/ROM): 40 out of 18144 (0%)
SLICEs(logic/ROM/RAM): 0 out of 5832 (0%)
As RAM: 0 out of 4374
(0%)
As Logic/ROM: 0
Number of logic LUT4s: 2
Number of distributed RAM: 0 (0 LUT4s)
Number of ripple logic: 11 (22 LUT4s)
Number of shift registers: 0
Total number of LUT4s: 24
Number of external PIOs: 104 out of 500 (21%)
Number of PIO IDDR/ODDR: 0
Number of PIO FIXEDDELAY: 0
Number of 3-state buffers: 0
Number of PLLs: 0 out of 4 (0%)
Number of DLLs: 0 out of 2 (0%)
Number of block RAMs: 0 out of 21 (0%)
Number of CLKDIVs: 0 out of 2 (0%)
Number of GSRs: 1 out of 1 (100%)
JTAG used : Yes
Readback used : No
Oscillator used : No
Startup used : No
Number Of Mapped DSP Components:
--------------------------------
MULT36X36B 0
MULT18X18B 0
MULT18X18MACB 0
MULT18X18ADDSUBB 0
MULT18X18ADDSUBSUMB 0
MULT9X9B 0
MULT9X9ADDSUBB 0
MULT9X9ADDSUBSUMB 2
--------------------------------
Figure 3. ispLEVER MAP Report
The RTL examples shown can be used in a wide range of applications
and are efficiently synthesized by the Synplify and Synplify Pro tools. Most
logic is mapped into sysDSP blocks to minimize utilization of generic programmable
function units (PFUs). The performance for each sysDSP block is independent
of the place and route tools. To make it easier for synthesis tools to recognize
the sysDSP structure, it is important to write the code in a manner that reflects
the target hardware implementation.
Conclusion
This article demonstrates Verilog 2001 RTL models for DSP functions
that will cause the Synplify and Synplify Pro tools to infer sysDSP blocks of
the LatticeECP2 device. Designs targeting the sysDSP Block can offer significant
improvement over traditional LUT-based implementations. This article contains
excerpts from Lattice Semiconductor data sheets and application note,“LatticeECP2
sysDSP Usage Guide”, available from the links below.
Synplicity, Synplicity FPGA Synthesis, Synplify for Lattice, Reference Guide
Synplicity, Verilog 2001 Feature Update
Troy Scott has been helping design, document, QA and promote EDA
products for about 14 years. He is a Product Marketing Engineer at Lattice Semiconductor
Corporation. He welcomes feedback and can be reached at troy.scott@latticesemi.com