• [ICML 2025] GuidedQuant
    We propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the layer-wise quantization objective. Additionally, we introduce LNQ, a non-uniform scalar quantization algorithm which is guaranteed to monotonically decrease the quantization objective value.
  • [Review] Multi-head Latent Attention
    This post reviews Multi-head Self-attention (MHA), Group Query Attention (GQA), and Multi-head Latent Attention (MLA).
  • [ICML 2024] LayerMerge
    We propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss.
  • [ICML 2023] Efficient CNN Depth Compression
    We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient inference latency.