Efficient Compression of Large Language Model Layers for Faster Inference

  by   Benjamin Mora






Departments Computer Science
DescriptionThis project will explore techniques to compress the layers of large language models (LLMs) while preserving performance. By leveraging methods like low-rank decomposition, quantization, and pruning, the goal is to reduce model size and computational demands without significantly affecting accuracy. The project involves benchmarking different compression strategies on inference speed, memory usage, and output quality. Potential applications include deploying LLMs on resource-constrained devices or optimizing cloud-based inference for cost efficiency. This project requires significant computational power, so using a capable computer with an efficient graphics card is recommended. Our lab can provide access to specialized GPU hardware if needed..
PreparationPlease read: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture) and have a look at this very nice website: https://bbycroft.net/llm
Project Categories Artificial Intelligence (AI), Data Science
Project Keywords Machine Learning, Neural Networks, Optimisation


Level of Studies

Level 6 (Undergraduate Year 3) yes
Level 7 (Masters) yes
Level 8 (PhD) yes