Did you know that 90% of ML models never make it into production? Even among the few that do, many face critical challenges such as slow prediction times and limited device memory. This happens because ML models are typically trained on high-performance GPUs but then deployed to comparatively low-resource hardware to reduce cloud computing costs. This is where optimization comes in without compromising on performance. By optimizing these models, we can reduce their size, improve inference speed, and ultimately make them more scalable for wider adoption. In this AI webinar, we'll dive into practical strategies for optimizing ML models, including quantization, pruning, distillation, and KV cache compression.
Once the model is optimized and deployed, we could serve it via APIs like REST and gRPC, making it accessible and functional for applications or end-users. In the webinar, we’ll explore the backend infrastructure and components required and the techniques to optimize model building and serving. You’ll discover how to keep your ML models highly scalable and capable using parallelism and scaling strategies.
Starting with a pre-trained model from Hugging Face, we will walk you through the process of optimizing it for efficient inference and deploying it to a custom AI lab environment. Our AI experts will share the challenges of optimizing models for inference, different optimization techniques, and improving the performance of ML models in production. We will also cover making the deployed model accessible and scalable while balancing speed, accuracy, and resource efficiency — all in a live demo.
Manual tester turned developer advocate. Atul talks about Cloud Native, Kubernetes, AI & MLOps to help other developers and organizations adopt cloud native. He is also a CNCF Ambassador and the organizer of CNCF Hyderabad.
Aman specializes in AI Cloud solutions and cloud native design, bringing extensive expertise in containerization, microservices, and serverless computing. His current focus lies in exploring AI Cloud technologies and developing AI applications using cloud native architectures.
Sanket Sudake specializes in AI Cloud initiatives and building cloud-native platforms. He is a Fission Serverless platform maintainer with deep expertise in distributed systems, containers, and cloud environments.
Leverage our AI stack charts to empower your team with faster, more efficient AI service deployment on Kubernetes.