Don’t miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Hong Kong, China (June 10-11); Tokyo, Japan (June 16-17); Hyderabad, India (August 6-7); Atlanta, US (November 10-13). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at
Optimizing Model Serving on Kubernetes With Model Streaming – Ekin Karabulut & Ronen Dar, Run:ai
Deploying large language models in Kubernetes environments faces a critical challenge: the cold start problem.When auto-scaling workloads with tools like Knative, the latency from loading large model weights into GPU memory slows response times, degrades performance, and increases costs.Traditional methods rely on loading weights sequentially into CPU memory then to the GPU,which is slow and inefficient.This talk introduces Run:ai Model Streamer, an open-source tool that mitigates cold starts by streaming model weights to GPU memory while reading them from storage in parallel.It integrates seamlessly into inference engine containers and Kubernetes workflows, enabling parallelized weight streaming without modifying weight formats, making it an easy-to-adopt solution for Kubernetes-based AI deployments.We’ll share benchmarking results comparing storage backends like GP3 SSDs, IO2 SSDs, and S3, highlighting performance improvements, cost savings, and best practices from these experiments.
source
Disclaimer
The content published on this page is sourced from external platforms, including YouTube. We do not own or claim any rights to the videos embedded here. All videos remain the property of their respective creators and are shared for informational and educational purposes only.
If you are the copyright owner of any video and wish to have it removed, please contact us, and we will take the necessary action promptly.