Scalable and Efficient Load Balancing and Task Scheduling for Large Language Model Inference Deployment on Kubernetes

Prototype - Python

2024

Scalable and Efficient Load Balancing and Task Scheduling for Large Language Model Inference Deployment on Kubernetes

In recent years, the scale of large language models (LLMs) has expanded exponentially, transitioning from millions of parameters to billions. This growth has been accompanied by a significant shift towards multi-modal learning, which enhances the ability of LLMs to understand and generate various forms of data, including text, images, audio, and video. While these advancements improve the quality of outputs of LLMs, they also introduce considerable delays during the inference stage. Although techniques such as model pruning, quantization, and knowledge distillation are being explored to reduce the computational burden without significantly compromising the performance of LLMs, challenges remain when deploying these LLms on edge computing devices. These challenges are primarily due to limited resources and dynamically changing user request loads. Current edge computing solutions for LLM inference struggle with efficiently handling user requests, meeting latency requirements, and having the flexibility to scale resources up or down quickly in response to fluctuating edge loads.

Fangyi MOU

Welcome to my peronal webpage ↖(^ω^)↗

Fangyi MOU

Welcome to my peronal webpage ↖(^ω^)↗