vLLM Workshop at MLOps Community Agent World: Hands-On Learning with Cast.ai

Introduction: My Journey with LLMs

As the GenAI wildfire spread across private organizations during my time at Education First in late 2022, I found myself captivated by developments from frontier labs. The impressive progress of the GPT model family drew me to OpenAI early on, where I prototyped various LLM-based applications by connecting to frontier models through paid API access.

At the time, skeptics dismissed “wrapper companies” as unsustainable. However, as model intelligence continued scaling while API costs decreased, a clear truth emerged: if you build a product that solves genuine customer pain points, you have both a viable business and a defensible moat.

Early Experiments: From APIs to Local Hosting

The Text-to-SQL Era

During ChatGPT 3.5’s early stages, we prototyped text-to-SQL solutions (perhaps worth a standalone post). It was a painful process of prompt engineering and hoping for the best when LLMs ran in production. The industry quickly shifted from “vibes” to rigorous evaluations, though we were admittedly still in the vibes camp.

When GPT’s multimodal capabilities emerged, I experimented with two projects:

TikTok video analysis - Understanding viral videos to extract underlying scripts and SOPs across different geographic locations
Shoe defect detection - Identifying manufacturing faults from product photos

Random projects, I know, but they were invaluable learning experiences with the APIs.

Local LLM Hosting Adventures

When llama.cpp and Ollama gained traction, I experimented with hosting 7B models on my MacBook. While it was thrilling to witness the Metal GPU humming away and feel the momentum of the open-source community, the models lacked the intelligence for practical applications.

The one exception: we gained access to a beefy Mac machine running LLama 70B. I attempted using it for coding questions but quickly pivoted back to ChatGPT or Claude, admitting defeat.

Why This Workshop Matters

The Open Source LLM Renaissance

Where does the open-source community stand now? Models like DeepSeek apparently surpass frontier closed models on certain benchmarks—perhaps the time has finally arrived. Simultaneously, my company is exploring the possibility of hosting our own models for cost optimization and data privacy.

This workshop, hosted by Cast.ai at the MLOps Community Agent World event, presented the perfect opportunity to acquire hands-on skills for hosting open-source LLMs via vLLM on Kubernetes.

About Cast.ai

Cast.ai is a Kubernetes automation platform that optimizes cloud costs and improves performance through intelligent resource management. Their platform is particularly well-suited for AI/ML workloads, making them an ideal partner for this vLLM workshop. Cast.ai helps organizations automatically optimize their Kubernetes clusters, which is crucial when running expensive GPU workloads for LLM inference.

Cast.ai Platform Cast.ai’s Kubernetes optimization platform - perfect for managing GPU-intensive LLM workloads

Key Takeaways from the vLLM Workshop

vLLM: The high-performance inference engine for large language models

1. vLLM: The Framework of Choice for Open Source LLM Hosting

vLLM has emerged as the de facto framework for hosting open-source LLMs, delivering optimal performance in both throughput and latency depending on specific use cases. Its sophisticated memory management and request batching make it ideal for production deployments.

2. Hardware Planning is Critical

Successful LLM deployment requires extensive upfront planning to determine the right hardware configuration and model size for your use case. Consider these scenarios:

Enterprise Chatbot Use Case:

Expected load: 1000s of concurrent users
Priority: Low latency for responsive user experience
Hardware needs: Multiple high-memory GPUs with fast interconnects

Document Processing Pipeline:

Expected load: High token volume with batch processing
Priority: Maximum throughput over latency
Hardware needs: Optimized for batch sizes and memory bandwidth

The Cast.ai platform excels at this optimization challenge by automatically right-sizing your Kubernetes clusters based on actual workload patterns, potentially saving 50-70% on cloud costs while maintaining performance.

3. GPU Math: Understanding Memory vs. Compute Bounds

One of the most valuable insights from the workshop was understanding GPU performance optimization. The instructor shared a practical trick:

Calculate the Ops-per-Byte Ratio:

High ratio → GPU is compute-bound → Increase batch size
Low ratio → GPU is memory-bound → Decrease batch size

For any GPU card, you can estimate the optimal ops-per-byte ratio:

Optimal ratio = Compute bandwidth (FLOPS) / Memory bandwidth (bytes/sec)

This calculation helps you understand whether you’re bottlenecked by computation or memory access, allowing you to tune your deployment accordingly.

4. Deep Dive into KV Cache Mechanics

We engaged in extensive whiteboarding exercises, calculating KV cache operations and computational requirements. This hands-on approach proved invaluable for understanding:

The underlying mechanisms of attention mechanisms
How quantization affects KV cache size and performance
Trade-offs between memory usage and model accuracy
Batch processing strategies for optimal GPU utilization

It felt like being back in school—but with immediate practical applications!

5. Understanding vLLM Architecture

We explored vLLM’s architecture and request flow in depth. The pre-read assignment from this excellent article provided essential context. Key architectural components include:

PagedAttention mechanism for efficient memory management
Continuous batching for maximizing throughput
Request scheduling algorithms for balancing latency and efficiency
KV cache optimization strategies for handling large context windows

Workshop Experience and Cast.ai Integration

While we unfortunately didn’t get hands-on time deploying models on Cast.ai’s EKS cluster as originally planned, the theoretical foundation and architectural insights more than compensated. The workshop demonstrated how Cast.ai’s platform would streamline the deployment process by:

Automatic GPU node provisioning based on model requirements
Cost optimization through intelligent resource scheduling
Multi-cloud support for flexibility and redundancy
Real-time monitoring and autoscaling for LLM workloads

Understanding the integration between vLLM’s inference capabilities and Cast.ai’s Kubernetes optimization creates a powerful combination for production LLM deployments.

Next Steps: Putting Knowledge into Practice

Now armed with this knowledge, I’m planning to:

Benchmark different model sizes on various GPU configurations
Implement vLLM in a production environment using Kubernetes
Explore Cast.ai’s platform for cost-optimized GPU cluster management
Document performance metrics comparing throughput vs. latency trade-offs
Share detailed implementation guides once deployed

Why This Matters for Your Organization

If your company is considering hosting open-source LLMs, here’s what this workshop taught me:

Cost Optimization: Using platforms like Cast.ai can dramatically reduce infrastructure costs while maintaining performance. For GPU-intensive workloads, this can mean the difference between a viable and prohibitively expensive solution.

Performance at Scale: vLLM’s architecture, combined with proper Kubernetes orchestration, enables production-grade LLM serving that can compete with managed API services in terms of latency and throughput.

Control and Privacy: Hosting your own models provides data sovereignty and customization opportunities that closed APIs cannot match.

Resources and Further Reading

vLLM Documentation
Aleksa Gordić’s vLLM Deep Dive (excellent pre-read)
Cast.ai Platform - Kubernetes cost optimization for AI workloads
Cast.ai Blog: GPU Optimization - Best practices for GPU workload management

Conclusion

This workshop reinforced that the open-source LLM ecosystem has matured significantly. With frameworks like vLLM and platforms like Cast.ai handling the infrastructure complexity, hosting your own models has become increasingly practical for organizations of all sizes.

The combination of cost optimization, performance, and control makes self-hosted LLMs a compelling alternative to API-based solutions—especially for use cases with high volume, sensitive data, or specific latency requirements.

Once I implement vLLM in production and integrate with Cast.ai’s platform, I’ll share detailed performance metrics, cost analysis, and practical deployment patterns.