George Williams
"Bigger models lead to better AI. It’s a fundamental “scaling law” that is supercharging innovations across the hardware technology stack, from new chip substrates to radical changes in datacenter architecture. Recent advances notwithstanding, model size continues to outpace today’s hardware accelerators, which impacts latency, throughput, and power consumption for all fundamental model related tasks including inference, training, fine-tuning. This in turn negatively impacts the performance of real-world compound AI system use cases including context retrieval augmentation (RAG), mixture-of-expert (MOE) stochastic routing models, and large-scale model tenancy, just to name a few.
In this talk, I will highlight some recent advances in extreme model compression that not only eliminate expensive matrix multiplication, but reduce the compute burden of transformer based large language models (LLMs) to just additions and simple bit vector operations. As you might expect, the use of base-3 ternary operations reduce significantly model size and compute burden, but surprisingly exhibit better model performance compared to other compression schemes including binary models and 4 bit integer quantization within the same or even smaller memory and power budget. I’ll walk-through specific system and zero-shot benchmarks that we acquired on CPU, GPU, and Compute-In-Memory micro-processors on well-known MOE LLM models grounded using a RAG vector database to mitigate hallucinations.
Finally, we will revisit decades old ideas in ternary storage, which combined with ternary bit processors can pave new directions in ultra-compact and power efficient trillion scale models in a singular substrate, using existing CMOS fabrication techniques."