Verilog 2001 Coding Style to Infer LatticeECP2 sysDSP Blocks
with Synplify & Synplify Pro

Troy Scott, Lattice Semiconductor

Expressing your DSP design as RTL code is an elegant way to model a design. The Lattice Semiconductor LatticeECP2 (EConomy Plus 2nd generation) family integrates features and capabilities previously only available on higher cost / high performance FPGAs. A key example is the sysDSP Block capability, a high performance circuit that supports multiply, addition, subtract and accumulate in widths up to 36x36. The ECP2 consists of a FPGA fabric coupled with between three and twenty-two sysDSP blocks http://www.latticesemi.com/products/fpga/ecp2/index.cfm.

Developing DSP algorithms in HDL is an ideal way to maintain designs over time, provides clear documentation, and makes your design less reliant on proprietary building blocks. This article provides several arithmetic samples in Verilog 2001 HDL that take advantage of LatticeECP2 sysDSP blocks.

sysDSP Block Architecture

The sysDSP block in the LatticeECP2 family support four functional elements in three (x9, x18 and x36) data path widths. Operands can be either signed or unsigned but not mixed within a functional element. Similarly, the operand widths cannot be mixed within a block. sysDSP blocks may be concatenated to form larger operations.

Figure 1. Lattice Semiconductor sysDSP Block

The resources in each sysDSP block can be configured to support one of the following four elements: MULT (Multiply), MAC (Multiply, Accumulate), MULTADD (Multiply, Addition/Subtraction), or MULTADDSUM (Multiply, Addition/Subtraction, Accumulate). A global clock, clock enable, and reset signals from FPGA routing are available to every DSP block and are applied to each input register, pipeline register, and output register.

The DSP block supports different widths of signed and unsigned multipliers besides x9, x18, and x36 widths. For unsigned operands, unused upper data bits should be filled to create a valid x9, x18, or x36 operand. For signed two’s complement operands, sign extension of the most significant bit should be performed until x9, x18, or x36 width is reached. The number of elements available per block depends on the data path width; for example, when a 9x data path is used 2 MULTADDSUM elements are available per block.

Verilog 2001 Signed Arithmetic Support

Language extensions in Verilog 2001 ease modeling of arithmetic expressions. The Synplify and Synplify Pro tools support signed arithmetic using based on signed data types.

Verilog 1995 provides only one signed data type, the integer variable. The net and reg data type is considered unsigned. Integers with a base format notation are also considered unsigned. In Verilog 1995 signed arithmetic can be done with 32-bit integers. In Verilog 2001, reg and net data types can be declared using the reserved signed keyword.

RTL Examples

The sysDSP element block diagrams of the LatticeECP2 Family Data Sheet provide a good guide for writing RTL that will target each block. The diagrams illustrate the pipeline stages, multipliers, add/sub, or sum operators available. While not all sysDSP configurations can be inferred via synthesis, many useful models are possible.

1. Registered Multiply/Accumulate. This commonly used function is useful for FIR filters and other DSP functions.

module mult_acc (dataout, data_a0, data_a1, clk, rst);
    parameter m = 18;
    parameter n = 18;

    output signed [(m+n+16)-1:0] dataout;
    input signed [m-1:0] data_a0;
    input signed [n-1:0] data_a1;
    input clk;
    input rst;

    reg signed [(m+n+16)-1:0] dataout;

    // Multiply logic
    wire [m+n-1:0] multa = data_a0 * data_a1;
    wire [(m+n+16)-1:0] acc_out;

    // Accumulate logic
    assign acc_out = multa + dataout;

    // Registered output
    always @(posedge rst or posedge clk)
    begin : SEQ_MULT_ACC
      if (rst)
        dataout <= 0;
      else
        dataout <= acc_out;
    end

endmodule

The resulting logic from this model can be mapped directly into a single sysDSP block, MULT18X18MACB. The mult_acc model produces a circuit that will run at just over 300 MHz with the –5 speed grade. Additional sysDSP block options such as input registers and pipeline registers can be added to this model and to create a higher-performance circuit.

2. Fully Pipelined Multiply/Accumulate. You can improve the performance of the Register Multiply/Accumulate model by taking advantage of sysDSP block pipeline registers. The following RTL code uses one level of registers at the data_a0 and data_a1 inputs, as well as the multiplier output register, dataout:

module mult_acc_pipe (dataout, data_a0, data_a1, clk, rst);
   parameter m = 18;
   parameter n = 18;

   output signed [(m+n+16)-1:0] dataout;
   input signed [m-1:0] data_a0;
   input signed [n-1:0] data_a1;
   input clk;
   input rst;

   reg signed [(m+n+16)-1:0] dataout;
   wire [(m+n+16)-1:0] acc_out;
   reg signed [(m+n)-1:0] multa;
   reg signed [m-1:0] data_a0r;
   reg signed [n-1:0] data_a1r;

   // Accumulate logic
   assign acc_out = multa + dataout;

   always @(posedge clk or posedge rst)
   begin : SEQ_MULT_ACC_PIPE
      if (rst) begin
         data_a0r <= 0;
         data_a1r <= 0;
         multa <= 0;
         dataout <= 0;
      end
      else begin
         data_a0r <= data_a0;
         data_a1r <= data_a1;
         // Multiply logic
         multa <= data_a0r * data_a1r;
         dataout <= acc_out;
      end
   end
endmodule

By taking advantage of internal registers performance is improved to more than 390 MHz for the with the ECP2’s -5 speed grade, independent of the implementation (place and route) tools. This example will be synthesized by using a single sysDSP block.

3. Fully Pipelined Multiply/Add. This example illustrates the coding style to infer a sysDSP block with Multiply/Add functionality. The RTL can be modified for a high-performance Multiply/Sub:

module mult_addsub (dataout, data_a0, data_a1, data_b0, data_b1, clk, rst)
/* synthesis syn_preserve = 1 */;
   parameter m = 18;
   parameter n = 18;

   output signed [(m+n)-1:0] dataout;
   input signed [m-1:0] data_a0;
   input signed [n-1:0] data_a1;
   input signed [m-1:0] data_b0;
   input signed [n-1:0] data_b1;
   input clk;
   input rst;

   reg signed [m-1:0] data_a0r;
   reg signed [n-1:0] data_a1r;
   reg signed [m-1:0] data_b0r;
   reg signed [n-1:0] data_b1r;
   reg signed [(m+n)-1:0] mult_outa;
   reg signed [(m+n)-1:0] mult_outb;
   reg signed [(m+n):0] addsub_out;

   always @(posedge clk or posedge rst)
   begin : SEQ_MULT_ADDSUB_PIPE
      if (rst) begin
         data_a0r <= 0;
         data_a1r <= 0;
         data_b0r <= 0;
         data_b1r <= 0;
         mult_outa <= 0;
         mult_outb <= 0;
         addsub_out <= 0;
      end
      else begin
         data_a0r <= data_a0;
         data_a1r <= data_a1;
         data_b0r <= data_b0;
         data_b1r <= data_b1;
         mult_outa <= data_a0r * data_a1r;
         mult_outb <= data_b0r * data_b1r;
         // addsub_out <= mult_outa - mult_outb;
         addsub_out <= mult_outa + mult_outb;
      end
   end
   assign dataout = addsub_out;

endmodule

The Multiply/Add will run over 390 MHz with the ECP2’s -5 speed grade, This example will be synthesized by using a single MULT18X18ADDSUBB sysDSP block.

4. Basic FIR Filter. Since sysDSP blocks have closely integrated multipliers and adders, filters can be implemented with minimal routing resources and delays. Given that a sysDSP block provides 4 multipliers and 3 adders a 4-tap filter can be implemented in a single block.

The samples so far use parallel operands however in FIR filter design is it common to use an input term that is shifted in from a memory holding coefficient values. Each input register of the sysDSP block provides a shiftout output that connects to the shiftin input of the adjacent input register of the same sysDSP block. The registers on the boundaries of a sysDSP block also connect to the registers of adjacent DSP blocks through the use of shiftin/shiftout connections. These connections create register chains spanning multiple DSP blocks, which make it easy to increase the length of FIR filters. The Synplify and Synplify Pro tools will infer the shift register input ports of the sysDSP block when the model uses a cascade style as shown:

module systolic_fir (y, h7, h6, h5, h4, h3, h2, h1, h0, x, clk, rst);
   parameter tap = 8;
   // FIR data path width
   parameter m = 9;
   parameter n = 9;
   defparam inst0.m = 9;
   defparam inst0.n = 9;
   defparam inst1.m = 9;
   defparam inst1.n = 9;

   output signed [m+n+2:0] y;
   input signed [m-1:0] h7, h6, h5, h4, h3, h2, h1, h0;
   input signed [n-1:0] x;
   input clk;
   input rst;

   reg signed [m+n+2:0] y;
   wire [m+n+1:0] mass0;
   wire [m+n+1:0] mass1;

   wire [m-1:0] h [0:tap-1];
   wire [n-1:0] PCOUT_INT [0:1];

   assign h[3] = h3, h[2] = h2, h[1] = h1, h[0] = h0;
   assign h[7] = h7, h[6] = h6, h[5] = h5, h[4] = h4;

   mult_add_sum_sh inst0 (.dataout (mass0),
                          .shreg_aout (PCOUT_INT[0]),
                          .shreg_ain (x),
                          .data_b0 (h[0]),
                          .data_b1 (h[1]),
                          .data_b2 (h[2]),
                          .data_b3 (h[3]),
                          .clk (clk),
                          .rst (rst));

   mult_add_sum_sh inst1 (.dataout (mass1),
                          .shreg_aout (PCOUT_INT[1]),
                          .shreg_ain (PCOUT_INT[0]),
                          .data_b0 (h[4]),
                          .data_b1 (h[5]),
                          .data_b2 (h[6]),
                          .data_b3 (h[7]),
                          .clk (clk),
                          .rst (rst));
   always @(posedge clk or posedge rst)
   begin : SUM
      if (rst) y <= 0;
      else y <= mass0 + mass1;
   end
endmodule

module mult_add_sum_sh (dataout, shreg_aout, shreg_ain, data_b0, data_b1, data_b2, data_b3, clk, rst)
/* synthesis syn_preserve = 1 */;
   parameter m = 18;
   parameter n = 18;

   output signed [(m+n)+1:0] dataout;
   output signed [m-1:0] shreg_aout;
   input signed [m-1:0] shreg_ain;
   input signed [n-1:0] data_b0;
   input signed [n-1:0] data_b1;
   input signed [n-1:0] data_b2;
   input signed [n-1:0] data_b3;
   input clk;
   input rst;

   reg signed [(m+n)+1:0] dataout;
   wire [(m+n):0] addouta;
   wire [(m+n):0] addoutb;
   reg signed [(m+n)-1:0] multa;
   reg signed [(m+n)-1:0] multb;
   reg signed [(m+n)-1:0] multc;
   reg signed [(m+n)-1:0] multd;
   reg signed [m-1:0] shreg_ain_r0;
   reg signed [m-1:0] shreg_ain_r1;
   reg signed [m-1:0] shreg_ain_r2;
   reg signed [m-1:0] shreg_ain_r3;
   reg signed [n-1:0] data_b0r;
   reg signed [n-1:0] data_b1r;
   reg signed [n-1:0] data_b2r;
   reg signed [n-1:0] data_b3r;

   always @(posedge clk or posedge rst)
   begin : SEQ_MULT_ADD_SUM_PIPE
      if (rst) begin
         shreg_ain_r0 <= 0;
         shreg_ain_r1 <= 0;
         shreg_ain_r2 <= 0;
         shreg_ain_r3 <= 0;
         data_b0r <= 0;
         data_b1r <= 0;
         data_b2r <= 0;
         data_b3r <= 0;
         multa <= 0;
         multb <= 0;
         multc <= 0;
         multd <= 0;
         dataout <= 0;
      end
      else begin
         shreg_ain_r0 <= shreg_ain;
         shreg_ain_r1 <= shreg_ain_r0;
         shreg_ain_r2 <= shreg_ain_r1;
         shreg_ain_r3 <= shreg_ain_r2;
         data_b0r <= data_b0;
         data_b1r <= data_b1;
         data_b2r <= data_b2;
         data_b3r <= data_b3;
         multa <= shreg_ain_r0 * data_b0r;
         multb <= shreg_ain_r1 * data_b1r;
         multc <= shreg_ain_r2 * data_b2r;
         multd <= shreg_ain_r3 * data_b3r;
         dataout <= addouta + addoutb;
      end
   end

   assign addouta = multa + multb;
   assign addoutb = multc + multd;
   assign shreg_aout = shreg_ain_r3;

endmodule

The systolic FIR filter design resource usage reports from the Synplify and Synplify Pro tools and the ispLEVER design mapper (MAP) are shown in the figures below:

Resource Usage Report
Part: lfe2_12e-5
Register bits: 75 of 12000 (1%)
I/O cells: 104
DSP primitives: 2

Details:
CCU2B:             11
FD1S3AX:           54
GSR:               1
IB:                83
INV:               1
MULT9X9ADDSUBSUMB: 2
OB:                21
OFS1P3DX:          21
VHI:               1
VLO:               1

Figure 2. Synplify Resource Usage Report

Design Summary
--------------
   Number of registers: 75
      PFU registers: 54
      PIO registers: 21
   Number of SLICEs: 40 out of 23976 (0%)
      SLICEs(logic/ROM): 40 out of 18144 (0%)
      SLICEs(logic/ROM/RAM): 0 out of 5832 (0%)
         As RAM: 0 out of 4374 (0%)
         As Logic/ROM: 0
   Number of logic LUT4s: 2
   Number of distributed RAM: 0 (0 LUT4s)
   Number of ripple logic: 11 (22 LUT4s)
   Number of shift registers: 0
   Total number of LUT4s: 24
   Number of external PIOs: 104 out of 500 (21%)
   Number of PIO IDDR/ODDR: 0
   Number of PIO FIXEDDELAY: 0
   Number of 3-state buffers: 0
   Number of PLLs: 0 out of 4 (0%)
   Number of DLLs: 0 out of 2 (0%)
   Number of block RAMs: 0 out of 21 (0%)
   Number of CLKDIVs: 0 out of 2 (0%)
   Number of GSRs: 1 out of 1 (100%)
   JTAG used : Yes
   Readback used : No
   Oscillator used : No
   Startup used : No
   Number Of Mapped DSP Components:
   --------------------------------
   MULT36X36B          0
   MULT18X18B          0
   MULT18X18MACB       0
   MULT18X18ADDSUBB    0
   MULT18X18ADDSUBSUMB 0
   MULT9X9B            0
   MULT9X9ADDSUBB      0
   MULT9X9ADDSUBSUMB   2
   --------------------------------

Figure 3. ispLEVER MAP Report

The RTL examples shown can be used in a wide range of applications and are efficiently synthesized by the Synplify and Synplify Pro tools. Most logic is mapped into sysDSP blocks to minimize utilization of generic programmable function units (PFUs). The performance for each sysDSP block is independent of the place and route tools. To make it easier for synthesis tools to recognize the sysDSP structure, it is important to write the code in a manner that reflects the target hardware implementation.

Conclusion

This article demonstrates Verilog 2001 RTL models for DSP functions that will cause the Synplify and Synplify Pro tools to infer sysDSP blocks of the LatticeECP2 device. Designs targeting the sysDSP Block can offer significant improvement over traditional LUT-based implementations. This article contains excerpts from Lattice Semiconductor data sheets and application note,“LatticeECP2 sysDSP Usage Guide”, available from the links below.

Troy Scott has been helping design, document, QA and promote EDA products for about 14 years. He is a Product Marketing Engineer at Lattice Semiconductor Corporation. He welcomes feedback and can be reached at troy.scott@latticesemi.com

From The Syndicated Q3, 2006, published quarterly by Synplicity, Inc., www.synplicity.com.
Copyright © 2006 Synplicity, Inc. All rights reserved.