-
Mar 11, 2025
[Review] Multi-head Latent Attention
This post reviews Multi-head Self-attention (MHA), Group Query Attention (GQA), and Multi-head Latent Attention (MLA).
-
Jul 11, 2024
[ICML24] LayerMerge
We propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss.
-
Sep 11, 2023
[ICML23] Efficient CNN Depth Compression
We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient inference latency.