LLM Production

분류 기준은 허깅페이스의 Optimize LLM in Production1 참고하여 production에 필요한 여러 구성요소를 정리해본다.

Lower Precision

Flash Attention

  • flash-attention2
  • Flash-Decoding for long-context inference3
  • PagedAttention: vLLM

Architectural Innovations

  • Improving positional embeddings of LLMs
  • The key-value cache
    • Multi-Query Attention(MQA)
    • Grouped-Query Attention(GQA)

Last Modified: 2023/11/23 16:11:13

is a collection of Papers I have written.
© 2000 - Sang-Kil Park Except where otherwise noted, content on this site is licensed under a CC BY 4.0.
This site design was brought from Distill.