Cloudflare

Making Workers AI faster and more efficient: Performance optimization with KV cache compression and speculative decoding

2024-9-26

Jesse Kipp

With a new generation of data center accelerator hardware and using optimization techniques such as KV cache compression and speculative decoding, we’ve made large language model (LLM) inference lightning-fast on the Cloudflare Workers AI platform.