Publications

You can also find my articles on my Google Scholar profile.

DAC 2025: LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Published in Proceedings of the 62nd ACM/IEEE Design Automation Conference, 2025

Large language models (LLMs) have grown significantly in scale, leading to a critical need for efficient model pruning techniques. Existing post-training pruning techniques primarily focus on measuring weight importance on converged dense models to determine salient weights to retain. However, they often overlook the changes in weight importance during the pruning process, which can lead to performance degradation in the pruned models. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, ensuring global performance optimization. Inspired by the recent discovery of prominent outliers in LLMs, LLM-Barber introduces an innovative pruning metric that identifies weight importance using weights multiplied by gradients. Our experiments show that LLM-Barber can efficiently prune models like LLaMA and OPT families with 7B to 13B parameters on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.

Download Paper

DAC 2025: A Highly Energy-Efficient Binary BERT Model on Group Vector Systolic CIM Accelerator

Published in Proceedings of the 62nd ACM/IEEE Design Automation Conference, 2025

Transformer-basedlargelanguagemodels(LLMs)imposesignifcantbandwidthandcomputechallengeswhen deployedonedgedevices.SRAM-basedcompute-in-memory (CIM)acceleratorsofferapromisingsolutiontoreducedata movementbutarestilllimitedbymodelsize.Thisworkdevelops aternaryweightsplitting(TWS)binarizationtoobtainBrain-Floating-Point-16xINT1(BF16×1-b)basedtransformersthatex-hibitcompetitiveaccuracywhilesignifcantlyreducingmodelsize comparedtofullprecisioncounterparts.Then,afullydigital SRAM-basedCIMacceleratorisdesignedincorporatingabit-parallelSRAMmacrowithinahighlyeffcientgroupvector systolicarchitecture,whichcanstoreonecolumnofBERT-Tiny modelwithstationarysystolicdatareuse.Thedesignina28nm technologyonlyrequires2KBSRAMwithanareaof2mm2.It achievesathroughputof6.55TOPSandconsumesatotalpower of419.74mW,resultinginastate-of-the-artareaeffciencyof 3.3TOPS/mm2andnormalizedenergyeffciencyof20.98TOPS/W forBERT-Tinymodel,demonstratinga10.25×improvementin areaeffciencyanda2.23×improvementinenergyeffciency comparedtootherstate-of-the-artcounterparts.Additionally,our proposedconfgurationcompressesthemodelsizeby32%with onlya0.5%accuracylossonSST-2.

Download Paper

DAC 2024: APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Published in 61st IEEE/ACM Design Automation Conference. (DAC), San Francisco, CA, 2024

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer’s weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

Recommended citation: Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. 2024. APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models. Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong and Hao Yu, “APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models”, In Proceedings of DAC 2024: 61st IEEE/ACM Design Automation Conference, San Francisco, CA, June 23-27, 2024
Download Paper | Download Slides

DATE 2024: An Isotropic Shift-Pointwise Network for Crossbar-Efficient Neural Network Design

Published in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2024

Resistive random-access memory (RRAM), with its programmable and nonvolatile conductance, permits compute-in-memory (CIM) at a much higher energy efficiency than the traditional von Neumann architecture, making it a promising candidate for edge AI. Nonetheless, the fixed-size crossbar tiles on RRAM are inherently unfit for conventional pyramid-shape convolutional neural networks (CNNs) that incur low crossbar utilization. To this end, we recognize the mixed-signal (digital-analog) nature in RRAM circuits and customize an isotropic shift-pointwise network that exploits digital shift operations for efficient spatial mixing and analog pointwise operations for channel mixing. To fast ablate various shift-pointwise topologies, a new reconfigurable energy-efficient shift module is designed and packaged into a seamless mixed-domain simulator. The optimized design achieves a near-100% crossbar utilization, providing a state-of-the-art INT8 accuracy of 94.88% (76.55%) on the CIFAR-10 (CIFAR-100) dataset with 1.6M parameters, which sets a new standard for RRAM-based AI accelerators.

Recommended citation: Ziyi Guan, Boyu Li, Yuan Ren, Muqun Niu, Hantao Huang, Graziano Chesi, Hao Yu and Ngai Wong, “An Isotropic Shift-Pointwise Network for Crossbar-Efficient Neural Network Design”, Design, Automation & Test in Europe Conference & Exhibition (DATE), March 25, Valencia, 2024
Download Paper | Download Slides

DATE 2024: FMTT: Fused Multi-head Transformer with Tensor-compression for 3D Point Clouds Detection on Edge Devices

Published in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2024

The real-time detection of 3D objects represents a grand challenge on edge devices. Existing 3D point clouds models are over-parameterized with heavy computation load. This paper proposes a highly compact model for 3D point clouds detection using tensor-compression. Compared to conventional methods, we propose a fused multi-head transformer tensor-compression (FMTT) to achieve both compact size yet with high accuracy. The FMTT leverages different ranks to extract both high and low-level features and then fuses them together to improve the accuracy. Experiments on the KITTI dataset show that the proposed FMTT can achieve 6.04× smaller than the uncompressed model from 55.09MB to 9.12MB such that the compressed model can be implemented on edge devices. It also achieves 2.62% improved accuracy in easy mode and 0.28%improved accuracy in hard mode.

Recommended citation: Zikun Wei, Tingting Wang, Chenchen Ding, Bohan Wang, Ziyi Guan, Hantao Huang, and Hao Yu “FMTT: Fused Multi-head Transformer with Tensor-compression for 3D Point Clouds Detection on Edge Devices”, Design, Automation & Test in Europe Conference & Exhibition (DATE), March 25, Valencia, 2024.
Download Paper | Download Slides

ICSICT 2022: A Video-based Fall Detection Network by Spatio-temporal Joint-point Model on Edge Devices

Published in 2022 IEEE 16th International Conference on Solid-State & Integrated Circuit Technology (ICSICT), 2021

Traditional neural networks deployed on CPU/GPU architectures have achieved impressive results on various AI tasks. However, the growing model sizes and intensive computation have presented stringent challenges for deployment on edge devices with restrictive compute and storage resources. This paper proposes a one-shot training-evaluation framework to solve the neural architecture search (NAS) problem for in-memory computing, targeting the emerging resistive random-access memory (RRAM) analog AI platform. We test inference accuracy and hardware performance of subnets sampled in different dimensions of a pretrained supernet. Experiments show that the proposed one-shot hardware-aware NAS (HW-NAS) framework can effectively explore the Pareto front considering both accuracy and hardware performance, and generate more optimal models via morphing a standard backbone model.

Recommended citation: Ziyi Guan, Shuwei Li, Yuan Cheng, Changhai Man, Wei Mao, Ngai Wong, and Hao Yu, “A Video-based Fall Detection Network by Spatio-temporal Joint-point Model on Edge Devices”, Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021, pp. 422–427
Download Paper

DATE 2021: A Video-based Fall Detection Network by Spatio-temporal Joint-point Model on Edge Devices

Published in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021

Tripping or falling is among the top threats in elderly healthcare, and the development of automatic fall detection systems are of considerable importance. With the fast development of the Internet of Things (IoT), camera vision-based solutions have drawn much attention in recent years. The traditional fall video analysis on the cloud has significant communication overhead. This work introduces a fast and lightweight video fall detection network based on a spatio-temporal joint-point model to overcome these hurdles. Instead of detecting falling motion by the traditional Convolutional Neural Networks (CNNs), we propose a Long Short-Term Memory (LSTM) model based on time-series joint-point features, extracted from a pose extractor and then filtered from a geometric joint-point filter. Experiments are conducted to verify the proposed framework, which shows a high sensitivity of 98.46% on Multiple Cameras Fall Dataset and 100% on UR Fall Dataset. Furthermore, our model can achieve pose estimation tasks simultaneously, attaining 73.3 mAP in the COCO keypoint challenge dataset, which outperforms the OpenPose work by 8%.