## LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient <u>MUL</u>tiplication for Neural Network Inference

Yanyue Xie Northeastern University xie.yany@northeastern.edu

Suranga Handagala Northeastern University s.handagala@northeastern.edu Zhengang Li Adobe li.zhen@northeastern.edu

Miriam Leeser Northeastern University mel@coe.neu.edu Dana Diaconu Northeastern University diaconu.d@northeastern.edu

Xue Lin Northeastern University xue.lin@northeastern.edu

ABSTRACT

For FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. The availability of LUTs typically outnumbers that of DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators with a reconfigurable dataflow architecture. Our approach challenges the conventional peak performance on DSP-based accelerators and sets a new benchmark for efficient neural network inference on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images per second and maintaining a top-1 accuracy of 70.95% on the ImageNet dataset.

## **CCS CONCEPTS**

• Hardware  $\rightarrow$  Reconfigurable logic and FPGAs; • Computing methodologies  $\rightarrow$  Machine learning.

## **KEYWORDS**

FPGAs, Quantization, Look-up tables, Roofline model.

#### ACM Reference Format:

Yanyue Xie, Zhengang Li, Dana Diaconu, Suranga Handagala, Miriam Leeser, and Xue Lin. 2025. LUTMUL: Exceed Conventional FPGA Roofline Limit by <u>LUT</u>-based Efficient <u>MUL</u>tiplication for Neural Network Inference. In 30th Asia and South Pacific Design Automation Conference (ASPDAC '25), January 20–23, 2025, Tokyo, Japan. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3658617.3697687

## **1** INTRODUCTION

Field-Programmable Gate Arrays (FPGAs) have been widely used as deep learning accelerators, facilitating advancements in computer vision [7, 17, 35, 39] and natural language processing [4, 13,

ASPDAC '25, January 20-23, 2025, Tokyo, Japan

© 2025 Copyright held by the owner/author(s).

ACM ISBN 979-8-4007-0635-6/25/01

https://doi.org/10.1145/3658617.3697687

FPGA accelerators can follow GPU-like [32, 34] architecture, which maps computation to compute cores with repetitive use. While beneficial, this approach encounters memory bandwidth issues similar to GPUs. Compared with GPUs, FPGAs usually have lower memory bandwidth, and the lower clock frequency of FPGAs means a lower upper bound of performance. While FPGA-based accelerators with specific instruction set architectures [37] offer flexibility across different models, they often compromise on performance due to non-optimized compute kernels for specific neural network layers.

16, 38] tasks. However, FPGAs lag behind Graphics Processing Units (GPUs) in terms of performance and ease of programming.

FPGA reconfigurable logic mainly consists of look-up tables (LUT),

block RAMs (BRAMs), and digital signal processing (DSP) blocks.

Together with routing resources, FPGA can be reconfigured for

customized designs. Despite the flexibility, FPGAs face constraints

in clock frequency, floating-point performance, and memory band-

width. This performance gap between FPGAs and GPUs is becoming

even larger when considering the tensor core performance of GPUs.

To address this, we need an algorithm-hardware co-design method

to boost FPGAs with greater inference capability.

To bridge the performance gap between FPGAs and GPUs, particularly in deep learning applications, we introduce LUTMUL, which leverages the look-up tables on FPGAs for deep learning tasks, focusing on accelerating convolutional neural networks (CNNs). We recognize that the traditional FPGA designs, heavily dependent on DSP blocks, may not fully exploit the parallelism and flexibility that LUTs offer, as the availability of LUTs typically outnumbers DSPs by a factor of 100. Our method emphasizes a novel utilization of LUTs to enhance computational efficiency and throughput in deep learning applications. Specifically, we embed the convolutional neural network weights into LUTs, where the LUT input is the activations and the LUT output is the multiplication result. Different from LUT-based general multipliers, our method is efficient in resources (requiring just 2 LUTs for a single 4-bit multiplication) and helps fully exploit the parallelism.

We propose a reconfigurable dataflow architecture for our LUTbased efficient multiplication kernel. Our dataflow architecture minimizes the memory access time by processing the data on-chip through each layer without external memory. The reconfigurability of the FPGA allows us to tailor the architecture specifically for each distinct layer of CNNs. With LUT resources, the generated accelerator can potentially exceed the peak performance of conventional DSP-based FPGA accelerators. Our dataflow architecture aims to

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

enhance the overall efficiency of our deep learning accelerators, optimizing FPGAs for deep learning tasks.

Our contributions can be summarized as follows:

- We present LUTMUL, an algorithm-hardware co-design method that embeds quantized neural network weights into look-up tables for efficient multiplications and uses dedicated look-up tables for full parallelism.
- We design a reconfigurable dataflow architecture that exploits scalability and LUT potential to save computational resources.
- Using LUTMUL, FPGA designs can potentially exceed the peak performance of conventional DSP-based FPGA accelerators when using the same amount of resources.

## 2 BACKGROUND

## 2.1 Roofline Model Analysis

GPUs leverage Single Instruction Multiple Data (SIMD) architecture, allowing them to simultaneously perform the same operation across multiple data points. This parallelism makes GPUs exceptionally efficient for tasks that can be divided into smaller, similar operations, such as matrix multiplication in deep learning.

FPGAs, by contrast, achieve parallel processing through their reconfigurability, allowing hardware to be tailored to specific computational tasks. This flexibility allows FPGAs to efficiently handle complex and diverse data processing tasks, offering advantages over the fixed architecture of GPUs. While FPGAs lack the raw SIMD power of GPUs for certain applications, they excel in scenarios requiring custom hardware configurations or low-latency, such as specific signal processing tasks or custom machine learning algorithms. However, this adaptability often comes with a trade-off in processing speed and ease of programming, with FPGAs typically lagging behind the computational throughput of GPUs.

The roofline model [31] is a useful tool for analyzing the performance of both GPUs and FPGAs. An algorithm running on GPUs or FPGAs can be either compute bound or memory bound. According to the roofline model [39], the peak performance of FPGAs is:

$$Peak \ performance = p \times PEs \times 2 \times f, \tag{1}$$

where *PEs* is the number of processing elements (PEs) used in the accelerator, such as the DSP blocks, *f* is the clock frequency, and ×2 term accounts for multiply-accumulate (MAC) operations. The packing factor for DSP blocks, *p*, varies based on the bit-width of the operation, with p = 1 for 16-bit, p = 2 for 8-bit, and p = 4 for 4-bit multiply-accumulate operations.

Furthermore, the performance of an FPGA-based accelerator is also limited by the memory, which is related to the memory bandwidth (BW) and computation-to-communication (CTC) ratio:

$$Peak memory bandwidth = BW \times CTC ratio.$$
(2)

Table 1 summarizes the major differences between GPUs and FPGAs, such as clock frequency, number of compute cores, and memory bandwidth. The significant difference in clock frequency contributes to a notable performance gap between GPUs and FPGAs. Even with optimization such as pruning and quantization [10], FPGA inference speed generally remains inferior to that of GPUs.

Table 1: Comparison between GPUs and FPGAs. Both V100 and U280 are compared using the PCIe version. Performance is the theoretical peak extracted from corresponding product datasheet [19, 33].

| Devices       | <b>V100 GPU</b>        | Alveo U280 FPGA |  |  |
|---------------|------------------------|-----------------|--|--|
| Technology    | 12nm                   | 16nm            |  |  |
| Clock         | 1530MHz                | 300MHz          |  |  |
| Compute cores | 5120 CUDA cores        | 9024 DSP48E2    |  |  |
|               | 640 Tensor cores       |                 |  |  |
| Performance   | 14TFLOPs(FP32 CUDA)    | 24.5 TOPs(INT8) |  |  |
|               | 112TFLOPs(FP16 Tensor) |                 |  |  |
| Memory        | 32GB HBM2              | 32GB DDR4       |  |  |
|               | 52GB HBM2              | 8GB HBM2        |  |  |
| Bandwidth     | 900GB/s                | 38GB/s (DDR4)   |  |  |
|               | 900GB/S                | 460GB/s (HBM2)  |  |  |
| Power         | 250W                   | 225W(Max)       |  |  |
|               | 200W                   | 100W(Typical)   |  |  |
| Price         | \$11,458               | \$7,717         |  |  |

Figure 1 shows the roofline model for U280. We take  $\frac{1}{64}$  resource and HBM bandwidth of U280 for analysis. Conventional DSP-based accelerators are compute bound when the arithmetic intensity satisfies a threshold. Our LUTMUL exploits the potential of LUTs and can achieve higher parallelism by our LUT efficient mapping.



Figure 1: Roofline model analysis for LUTMUL and other DSP-based architectures. We take  $\frac{1}{64}$  resource and memory bandwidth of U280 for analysis.

## 2.2 Dataflow Architecture

Dataflow architecture contrasts with the traditional control flow architecture. The dataflow nodes or processing elements can immediately start and execute when all its inputs are ready. Dataflow architecture employs simple operations, such as broadcast (oneto-many), map (element-wise, *e.g.* activation function), zip (multioperands, *e.g.* convolution and matrix multiplication), and reduce (many-to-one, *e.g.* pooling) [22]. A key advantage of reconfigurable dataflow architecture is its ability to allow data to flow through the computation graph efficiently, significantly enhancing parallelism and minimizing memory access time.

## 2.3 FPGA-based Neural Network Accelerator Architecture

Since FPGAs have relatively limited on-chip resources, most of the FPGA accelerators [3, 24, 32, 34, 37] map computation onto hardware and reuse the PE array. Notably, [18] explores the intra-layer mixed-scheme quantization and maps vision transformer layers onto a General Matrix Multiply (GEMM) kernel, where each layer maintains a fixed ratio of this quantization scheme. Systolic array architecture [30] presents another method for efficiently mapping convolution and matrix multiplication onto FPGAs with high throughput.

FINN [2, 26] uses a dataflow architecture and integrates all layers of the network into a single FPGA. The resources for each layer can be adjusted according to computation requirements so that all layers are balanced and pipelined for better throughput. FINN is particularly well-suited for exploiting inter-layer quantization for neural networks because each layer has dedicated computation and memory resources.

## 3 ALGORITHM-HARDWARE CO-DESIGN FOR LUTMUL

#### 3.1 Motivation

The roofline model reveals a theoretical peak performance for DSPbased accelerators, applicable across various architectures such as GEMM, systolic array, or dataflow architecture. We can leverage LUT resources to perform multiplication and full parallelism to enable FPGA with greater performance. Given that the availability of LUTs usually outnumbers DSPs, using LUTs can potentially exceed the upper bound of performance of current DSP-based FPGA accelerators.



# Figure 2: Accuracy loss and LUT resources for 1-bit to 8-bit quantization.

Figure 2 shows the quantized neural network accuracy [14, 36] and the number of LUTs needed per multiplication by our method. We trade-off between accuracy and LUT usage and choose 4-bit as our quantization bit-width. Binary and ternary neural networks incur large accuracy drops and consume half of the LUTs that 4-bit uses as the output bits of LUTs are limited. Compared with higher bit-width quantization, 4-bit uses significantly fewer LUTs and has negligible accuracy loss.

## 3.2 LUTMUL Design Flow



Figure 3: LUTMUL Design flow.

Figure 3 depicts the LUTMUL design flow. Initially, we train the neural network in our quantization-aware training framework. The quantization bit-widths for weights and activations are adjustable hyperparameters. The final quantized neural network is exported in Open Neural Network Exchange (ONNX) format, facilitating subsequent hardware generation.

The ONNX intermediate representation is interpreted as a computation graph and undergoes a streamlining transformation [27]. The scaling factors of each channel and batch normalization layer are reordered and absorbed into the activation function, transforming into a multi-threshold unit. Each computation node is converted to a High-Level Synthesis (HLS) layer using our HLS templates. These HLS layers are folded according to performance and resource requirements and interconnected sequentially. The final hardware bitstream, generated by Vivado, is deployed on FPGA boards via the PYNQ framework.

## 3.3 Reconfigurable Dataflow Architecture

Figure 4 illustrates the hardware architecture of a MobileNetV2 [23] implementation. This design, focusing on inverted residual blocks, employs a First In, First Out (FIFO) buffer between layers to store activations. The architecture uses a reconfigurable dataflow architecture.

Our design spans all Super Logic Regions (SLRs) to maximize hardware resources. Signals only traverse SLRs when the current SLR resources are insufficient for the next layer to avoid severe timing violations. Dataflow architecture is inherently suited for design spanning multiple SLRs and can be scaled up, enabling additional FPGAs connected via network for deploying larger networks [6].

#### 3.4 Convolution Generator

For convolutional layers, the convolution operations can be lowered to matrix-matrix multiplications. These can be mapped in a streaming manner and fed to the multiplication kernel. The multiplication



Figure 4: Hardware architecture of accelerator generated by LUTMUL. Our design is fully on-chip and does not use DRAM or HBM memory.

kernel is fully paralleled to perform a matrix-vector multiplication, where the weights are stationary vectors and activations are streaming inputs. Therefore, we need a convolution generator to perform the im2col operations: reading data from FIFO, moving across input images to form an image matrix, and streaming the output to the multiplication kernel.

The convolution generator accommodates various configurations, including pointwise, depthwise, and standard convolution with different kernel sizes and strides, since each kind of convolutional layer expects different input data sequences, necessitating specific generator settings.

## 3.5 Look-Up Table based Efficient Multiplication

Figure 5 demonstrates the look-up table based multiplication kernels and how to determine look-up table initialization (INIT) values. After embedding the weights into look-up tables, our look-up tables transform into efficient constant multipliers [11]. Our look-up table based multiplier is efficient in resources, utilizing only 2 LUTs on average for a single 4-bit multiplication, compared with a general multiplier that consumes 13-28 LUTs for an equivalent operation. The choice of 4-bit quantization is pivotal as it maintains model accuracy and optimizes look-up table usage, as shown in Figure 2. We show the number of LUT6 (6:1 LUT, 6-bit input, 1-bit output) for a general n-bit multiplication (n:2n LUT, n-bit input, 2n-bit output) using our method:

$$#LUTs = \frac{2n \times 2^n}{1 \times 2^6}.$$
(3)

Yanyue Xie, Zhengang Li, Dana Diaconu, Suranga Handagala, Miriam Leeser, and Xue Lin

Algorithm 1 shows the pseudo High-Level Synthesis (HLS) code for look-up table based multiplication. The table contents are derived from pre-computed weights. The weights of convolutional layers are fully paralleled, meaning that the *COUT* channel in Algorithm 1 refers to the output channels, and the *CIN* channel refers to the input channels times the kernel size squared. These dimensions (four in total) are fully unrolled in the spatial domain. The remaining input feature map height and width dimensions are pipelined in the temporal domain. Input activations are streamed from the convolution generator and passed through look-up tables. The output results are multiplication results. They are accumulated, go through the threshold unit, and generate activations for the next layer.

| Algorithm 1 Look-up table based multiplication kernel                  |
|------------------------------------------------------------------------|
| Input: Streaming parallel input from the Convolution Generator and pre |
| computed look-up table contents                                        |
| Output: Streaming output for the next layer                            |
| 1: for $i \leftarrow 1$ to ROWS × COLS do                              |
| 2: #pragma HLS PIPELINE II=1                                           |
| 3: $input \leftarrow src.read()$                                       |
| 4: <b>for</b> $co \leftarrow 1$ to COUT <b>do</b>                      |
| 5: #pragma HLS UNROLL                                                  |
| 6: <b>for</b> $ci \leftarrow 1$ to CIN <b>do</b>                       |
| 7: #pragma HLS UNROLL                                                  |
| 8: $mul[co][ci] \leftarrow lut[co][ci][input[ci]]$                     |
| 9: end for                                                             |
| 10: end for                                                            |
| 11: <b>for</b> $co \leftarrow 1$ to COUT <b>do</b>                     |
| 12: #pragma HLS UNROLL                                                 |
| 13: for $ci \leftarrow 1$ to CIN do                                    |
| 14: #pragma HLS UNROLL                                                 |
| 15: $output[co] + = mul[co][ci]$                                       |
| 16: end for                                                            |
| 17: end for                                                            |
| 18: dst.write(output)                                                  |
| 19: end for                                                            |

#### 3.6 Quantization-Aware Training

Quantization [10] and DSP packing [24] have become a standard approach for mapping neural networks onto FPGA-based accelerators, as FPGAs' LUTs and DSP blocks are not optimized for floating-point but ideal for integer or fixed-point operations. Quantization, paired with DSP packing, reduces resource demands for the multiplications and improves throughput.

The quantization operation is defined as:

$$y = quantize(x) = clamp(round(\frac{x}{s} + z), y_{min}, y_{max}),$$
 (4)

where x is the floating-point value to quantize, s is the scaling factor of the output quantized tensor, and z is the zero-point or quantization bias coefficient. The function, *round*, can be round-to-even or round-to-zero, and *clamp* performs clipping inclusive of the boundaries  $y_{min}$  and  $y_{max}$ .

For the reverse process, to compute the floating-point representation of a quantized value, we define the dequantize operation as:

$$dequantize(y) = s(y - z),$$
(5)



**Figure 5: Illustration of LUTMUL for efficient multiplication via look-up tables. The left-hand side figure demonstrates how to use LUT6\_2 primitive for embedding multiplication results of weights and input activations. The right-hand side table demonstrates the multiplication results of two example weights and how to generate the corresponding look-up table contents.** The weights (int4) and multiplication output (int8) are using two's complement representation, while activation are all unsigned numbers (uint4). The Most Significant Bit (MSB) of LUT6\_2 input is configured as '1' to enable two output ports. The bit below the MSB is a Weight Select (WS) signal to select between two weights. The lowest 4-bit inputs serve as activation inputs. Our method embeds two int4 weights inside four LUT6, a resource-efficient approach contrasting with the LUT6-instantiated general multipliers, which consume 6-14× more LUT6 resources. Two used example weights are 1 and -3 respectively. The embedded LUT contents for these four LUTs are **64'hfffe\_0000\_fffe\_0000\_f83e\_0000, 64'h09c6\_ff00\_5a5a\_f0f0**, and 64'hcccc\_cccc\_aaaa\_aaaa, respectively.

where y is a quantized tensor, z is its zero-point, and s is the scaling factor.

Quantization introduces errors to the trained model parameters and results in performance degradation. Quantization-Aware Training (QAT) is a popular approach that retrains the model with quantized parameters on the pretraining dataset to converge to the pretrained model performance. The usual forward and backward passes are performed on the quantized model in floating point, and the model parameters are quantized after each gradient update. In particular, it is important to do this projection after the weight update is performed in floating point precision. Performing the backward pass with floating point is vital, as accumulating the gradients in quantized precision can result in zero gradients or gradients with high error, especially in low-precision scenarios [9].

#### **4 EVALUATION**

#### 4.1 Experimental Setup

To evaluate the performance of LUTMUL, we implement MobileNetV2 on FPGAs and compare it with existing FPGA-based MobileNet accelerators. MobilenetV2 [23] has 3.4M parameters and achieves 71.88% top-1 accuracy on the ImageNet dataset [5]. We utilize the FINN framework [2] as our foundational platform. For quantization, we adopt PyTorch 1.13.0 and Brevitas 0.9.1 [20]. Specifically, we choose 4-bit for weights and activations quantization except for the first and last layers, which are set as 8-bit. To preserve the accuracy of the MobileNetV2 model, we apply the channel-wise

quantization scheme for our model. Our quantized MobileNetV2 network is trained for 420 epochs, culminating in a 70.95% top-1 accuracy evaluated on the ImageNet dataset [5].

For the hardware evaluation, the utilized development platform is the AMD Xilinx Alveo U280 data center accelerator card on the Open Cloud Testbed (OCT) [40]. We implement the first 15 layers of MobileNetV2 in a fully parallel manner and fold the remaining layers for resource optimization. To maximize the computation efficiency without timing violation, the working frequency is set to 333 MHz for all the designs implemented through Vitis HLS/Vivado 2022.1.

## 4.2 Experimental Results

Table 2 showcases the hardware performance and comparisons with other FPGA-based MobileNet accelerators. Most of these accelerators are tailored for edge FPGAs, such as ZU9EG, except for FINN, which has data center accelerator implementation for MobileNetV1. The FINN result is generated and tested on the same device as our implementation, while other data points are extracted from their original publications.

In terms of accuracy, our model achieves the best 70.95% top-1 accuracy on ImageNet among all implementations. Quantizationaware training effectively mitigates quantization errors, preserving the model original accuracy, even with 4-bit quantized weights and activations.

| Implementation             | FINN<br>[2] | FPL'19<br>[32] | Light-OPU<br>[37] | FPL'21<br>[34] | Mix & Match<br>[3] | FILM-QNN<br>[24] | LUTMUL<br>(Ours) |
|----------------------------|-------------|----------------|-------------------|----------------|--------------------|------------------|------------------|
| Network                    | MobileNetV1 | MobileNetV2    | MobileNetV3       | MobileNetV2    | MobileNetV2        | MobileNetV2      | MobileNetV2      |
| Bit-Width                  | W4A4        | W8A8           | W8A8              | W8A8           | W4A4               | W8A5&W4A5        | W4A4             |
| Top-1 Accuracy             | 70.4%       | 68.1%          | 66.7%             | 70.8%          | 65.6%              | 65.7%            | 70.95%           |
| Platform                   | Alveo U280  | ZU9EG          | XC7K325T          | XC7V690T       | XC7Z045            | ZU9EG            | Alveo U280       |
| Frequency (MHz)            | 333         | 333            | 200               | 150            | 100                | 150              | 333              |
| LUT                        | 501363      | 161944         | 173522            | 308449         | 145049             | 180100           | 529242           |
| FF                         | 476316      | 301416         | 241175            | 278926         | 111575             | -                | 503192           |
| BRAM36                     | 898         | 771            | 193.5             | 941.5          | 225.5              | 440.5            | 1119             |
| DSP                        | 106         | 2070           | 704               | 2160           | 900                | 2092             | 106              |
| Power (W)                  | 41.69       | -              | 8.5*              | 11.35          | -                  | 12.9             | 42.12            |
| Frame Rate (FPS)           | 925         | 809.8          | 332.6             | 302.3          | 549.3              | 537.9            | 1627             |
| Throughput (GOPS)          | 556.4       | 487.1          | 84.48*            | 181.8          | 326.9              | 320.1            | 978.6            |
| Energy Efficiency (GOPS/W) | 13.35       | -              | 9.9               | 16.02          | -                  | 24.8             | 23.23            |

Table 2: Comparisons of MobileNet implementations between previous FPGA-based accelerators.

Note: '-' means that the result is not given in the original publications, and '\*' means that the result is inferred from the original publications.

As for the inference performance, our implementation achieves a throughput of 1627 images per second. Our implementation consumes the most FPGA resources but could still fit on a single Alveo U280. However, it is noteworthy that our implementation also yields a 23.23 GOPS/W energy efficiency, marginally lower than the FLIM-QNN [24], which is implemented on a more power-efficient edge FPGA board.

#### 4.3 Discussion



Figure 6: LUT Resource breakdown of the second convolution layer in MobileNetV2 using LUTMUL.

Figure 6 shows the LUT resource breakdown of the second convolution layer in MobileNetV2 using LUTMUL, which is a  $1 \times 1$  convolution kernel and has 32 input channels and 32 output channels. For these 1024 4-bit weights, multiplication operations use 1829 LUTs after HLS synthesis, which matches the theoretical analysis of LUTMUL. However, HLS instantiates an adder for each addition operation to achieve an II of 1, resulting in a high usage of LUT for adder logic. After Vivado implementation, the LUT usage decreased to 5922. Vivado optimizes the logic and instantiates 3277 LUTs as ROM and 2645 LUTs as adder and other logic. Even though adder logic accounts for a large part of resources, the parallel MAC

performance by LUTMUL still outperforms the DSP packing method using the same number of resources.

## 4.4 Comparisons with Related Works

Our method is not only limited to integer multiplication, but can also be extended to customized data formats, such as FP4 and MXFP4 [21], while DSP packing is designed efficiently for integer formats. LUTNet [28, 29] also utilizes LUT for inference and explores the flexibility of LUT. However, LUTNet design suffers from low accuracy when the network becomes larger. PolyLUT [1] trains multivariate polynomials instead of linear functions and embeds piecewise polynomial functions into LUTs. CompressedLUT [15] proposes a lossless LUT compression method and is efficient for non-linear functions and large LUT logic blocks, such as [8, 12, 25]. Our method maps MAC operations to the single-LUT level, and Vivado can handle remaining logic optimization efficiently.

#### 5 CONCLUSION

We propose LUTMUL, an efficient method that leverages look-up tables for multiplication in convolutional neural networks. Compared with the general multiplier, our method is efficient in resources, which only needs two look-up tables on average for a single 4-bit multiplication. Compared with other DSP-based FPGA accelerators, LUTMUL's reconfigurable dataflow architecture enables full parallelism, reduces memory access time, and increases the theoretical upper bound of performance. Experimental results demonstrate that our design maintains a top-1 accuracy of 70.95% on the ImageNet dataset and achieves a throughput of 1627 images per second on a single Alveo U280 FPGA, outperforming other FPGA-based MobileNet accelerators.

## ACKNOWLEDGMENTS

This research was supported in part by the National Science Foundation under Grants CCF-1901378, CNS-1925658, and CNS-2319962. LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient MULtiplication for Neural Network Inference ASPDAC '25, January 20–23, 2025, Tokyo, Japan

#### REFERENCES

- Marta Andronic and George A Constantinides. 2023. PolyLUT: learning piecewise polynomials for ultra-low latency FPGA LUT-based inference. In 2023 International Conference on Field Programmable Technology (ICFPT). 60–68.
- [2] Michaela Blott, Thomas B Preußer, Nicholas J Fraser, Giulio Gambardella, Kenneth O'brien, Yaman Umuroglu, Miriam Leeser, and Kees Vissers. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 3 (2018), 1–23.
- [3] Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K-H So, Xuehai Qian, Yanzhi Wang, and Xue Lin. 2021. Mix and match: A novel fpga-centric deep neural network quantization framework. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 208–220.
- [4] Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. 2024. Understanding the potential of fpgabased spatial acceleration for large language model inference. ACM Transactions on Reconfigurable Technology and Systems (TRETS) (2024).
- [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 248-255.
- [6] Dana Diaconu, Yanyue Xie, Mehmet Gungor, Suranga Handagala, Xue Lin, and Miriam Leeser. 2023. Machine Learning Across Network-Connected FPGAs. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
- [7] Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, et al. 2023. Heatvit: Hardwareefficient adaptive token pruning for vision transformers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 442–455.
- [8] Daniel Gerlinghoff, Benjamin Chen Ming Choong, Rick Siow Mong Goh, Weng-Fai Wong, and Tao Luo. 2024. Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA).
- [9] Amir Gholami, Schoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. 2022. A survey of quantization methods for efficient neural network inference. In *Low-Power Computer Vision*. Chapman and Hall/CRC, 291–326.
- [10] Song Han, Huizi Mao, and William J Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. International Conference on Learning Representations (ICLR) (2016).
- [11] Martin Hardieck, Martin Kumm, Konrad Möller, and Peter Zipf. 2019. Reconfigurable convolutional kernels for neural networks on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 43–52.
- [12] Jingkai Hong, Arash Fayyazi, Amirhossein Esmaili, Mahdi Nazemi, and Massoud Pedram. 2023. Algorithms and Hardware for Efficient Processing of Logic-based Neural Networks. In Proceedings of the 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
- [13] Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, and Joo-Young Kim. 2022. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 616–630.
- [14] Qing Jin, Linjie Yang, and Zhenyu Liao. 2020. Adabits: Neural network quantization with adaptive bit-widths. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2146–2156.
- [15] Alireza Khataei and Kia Bazargan. 2024. CompressedLUT: An Open Source Tool for Lossless Compression of Lookup Tables for Function Evaluation and Beyond. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA).
- [16] Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. Ftrans: energyefficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED). 175–180.
- [17] Zhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, et al. 2024. Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS). 324–337.
- [18] Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, and Zhenman Fang. 2022. Auto-vit-acc: An fpga-aware automatic acceleration framework for vi sion transformer with mixed-scheme quantization. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 109–116.
- [19] Nvidia. 2017. Nvidia Tesla V100 GPU Architecture Whitepaper. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecturewhitepaper.pdf. Last accessed Oct. 30, 2024.
- [20] Alessandro Pappalardo. 2023. Xilinx/brevitas. https://doi.org/10.5281/zenodo. 3333552
- [21] Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger,

Kristof Denolf, et al. 2023. Microscaling data formats for deep learning. arXiv preprint arXiv:2310.10537 (2023).

- [22] Sambanova Whitepaper. 2021. Accelerated Computing with a Reconfigurable Dataflow Architecture. https://sambanova.ai/wp-content/uploads/2021/ 04/SambaNova\_Accelerated-Computing-with-a-Reconfigurable-Dataflow-Architecture\_Whitepaper\_English.pdf. Last accessed Oct. 30, 2024.
- [23] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4510–4520.
- [24] Mengshu Sun, Zhengang Li, Alec Lu, Yanyu Li, Sung-En Chang, Xiaolong Ma, Xue Lin, and Zhenman Fang. 2022. Film-qnn: Efficient fpga acceleration of deep neural networks with intra-layer, mixed-precision quantization. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 134–145.
- [25] Yaman Umuroglu, Yash Akhauri, Nicholas James Fraser, and Michaela Blott. 2020. LogicNets: Co-designed neural networks and circuits for extreme-throughput applications. In 2020 30th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 291–297.
- [26] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 65–74.
- [27] Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060 (2017).
- [28] Erwei Wang, James J Davis, Peter YK Cheung, and George A Constantinides. 2019. LUTNet: Rethinking inference in FPGA soft logic. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 26–34.
- [29] Erwei Wang, James J Davis, Georgios-Ilias Stavrou, Peter YK Cheung, George A Constantinides, and Mohamed Abdelfattah. 2022. Logic shrinkage: Learned FPGA netlist sparsity for efficient neural network inference. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 101–111.
- [30] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference (DAC). 1–6.
- [31] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. *Commun. ACM* 52, 4 (2009), 65–76.
- [32] Di Wu, Yu Zhang, Xijie Jia, Lu Tian, Tianping Li, Lingzhi Sui, Dongliang Xie, and Yi Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 136–143.
- [33] AMD Xilinx. 2021. Alveo Product Selection Guide. https://docs.xilinx.com/v/u/en-US/alveo-product-selection-guide. Last accessed Oct. 30, 2024.
- [34] Shun Yan, Zhengyan Liu, Yun Wang, Chenglong Zeng, Qiang Liu, Bowen Cheng, and Ray CC Cheung. 2021. An fpga-based mobilenet accelerator considering network structure characteristics. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 17–23.
- [35] Geng Yang, Yanyue Xie, Zhong Jia Xue, Sung-En Chang, Yanyu Li, Peiyan Dong, Jie Lei, Weiying Xie, Yanzhi Wang, Xue Lin, and Zhenman Fang. 2024. SDA: Low-Bit Stable Diffusion Acceleration on Edge FPGAs. In 2024 34th International Conference on Field-Programmable Logic and Applications (FPL). 264–273.
- [36] Linjie Yang and Qing Jin. 2021. Fracbits: Mixed precision quantization via fractional bit-widths. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 35. 10612–10620.
- [37] Yunxuan Yu, Tiandong Zhao, Kun Wang, and Lei He. 2020. Light-OPU: An FPGAbased overlay processor for lightweight convolutional neural networks. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 122–132.
- [38] Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA).
- [39] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA). 161–170.
- [40] Michael Žink, David Irwin, Emmanuel Cecchet, Hakan Saplakoglu, Orran Krieger, Martin Herbordt, Michael Daitzman, Peter Desnoyers, Miriam Leeser, and Suranga Handagala. 2021. The Open Cloud Testbed (OCT): A platform for research into new cloud technologies. In 2021 IEEE 10th International Conference on Cloud Networking (CloudNet). IEEE, 140–147.