Accelerating LLM Inference with Speculative Decoding - LinkedIn Engineering Blog

Nov 6, 2025·

Dhyey Mavani

· 1 min read

PDF

LinkedIn Engineering Blog - Speculative Decoding

Abstract

In this LinkedIn Engineering Blog post, Dhyey Mavani and the Hiring Assistant team reveal how they leveraged speculative decoding—a technique that drafts multiple tokens ahead and verifies them in parallel—to dramatically accelerate LLM inference without sacrificing quality. The article provides a comprehensive technical breakdown of how n-gram speculation works, why it’s particularly effective for structured outputs like rubric-style summaries, and how LinkedIn achieved nearly 4× throughput gains and 66% P90 latency reduction. The team discusses key configuration parameters (num_speculative_tokens, prompt_lookup_max, prompt_lookup_min) and explains when n-gram speculation is most effective versus draft-model approaches. This post offers valuable insights for teams building real-time AI agents at scale, demonstrating how operational simplicity and performance can be balanced in production GenAI systems.

Event

Accelerating LLM Inference with Speculative Decoding - LinkedIn Engineering Blog

Location

LinkedIn Engineering Blog

1000 West Maude Avenue, Sunnyvale, CA 94085

Large language models are transforming how we build software, but they come with a fundamental challenge: speed. For real-time applications like LinkedIn’s Hiring Assistant—the company’s first AI agent for recruiters—latency isn’t just a technical metric; it’s critical to user experience. When an agent needs to process long job descriptions and candidate profiles while generating thousands of tokens, every millisecond counts.

In this LinkedIn Engineering Blog post, the Hiring Assistant team shares how they tackled this challenge using speculative decoding, a technique that accelerates text generation without compromising quality. The result? Nearly 4× higher throughput and an average 66% reduction in P90 end-to-end latency.

Last updated on Nov 6, 2025