Karya Inference — Fastest & Cheapest LLM Inference on Indian GPUs

The Problem

Your GPU Is Fast. Your Inference Isn't.

New GPU architectures keep getting faster — but most teams still run vanilla autoregressive decode. Your H200 spends most of its time waiting for the next token, not computing. You're paying for 100% GPU and getting 25-45% utilization.

We fix that. We deploy the proven optimization stack on your existing hardware. Same GPUs. Same models. 2-3x the speed per stream. 60% lower cost per token.

Vanilla Setup

25%

+ vLLM / SGLang

45%

+ Speculative Decode

200%

+ Full Stack

300%

Current Cost

₹8.2L/mo

With Karya Inference

₹3.3L/mo

You Save

₹59L/yr

*4x H200 running 24/7 on Indian GPU cloud (~₹278/hr)

What We Do

Four Ways to Faster, Cheaper Inference

Pick your level. We recommend starting with the free audit.

Foundation

vLLM Optimization

1.3-1.8x

Tune your existing vLLM for maximum throughput. FP8 quantization, CUDA graphs, KV cache management, and smart batching. The baseline every team should have.

single-stream · 30-45% cheaper

FP8 weight + KV cache quantization
CUDA graph compilation
Continuous batching tuning
Memory bandwidth optimization

Serve Layer

SGLang Serve Optimization

1.5-2.0x

Migrate to SGLang for better MoE model performance. Disaggregated prefill/decode, custom kernels, and prefix caching. Built for models like Sarvam and Qwen3.

single-stream · 33-50% cheaper

Disaggregated prefill + decode
Custom kernel scheduling
RadixAttention for prefix caching
Optimized for MoE routing

Speculative Decode

Speculative Decoding

2.0-3.0x

A small draft model predicts tokens ahead of time. Your main model just verifies. Same output quality — guaranteed. Zero accuracy loss.

single-stream · 50-65% cheaper

Custom draft model training
Best for MoE and FP8 models
Pre-trained draft for Sarvam-30B
Zero quality loss guaranteed

Advanced

DFlash Parallel Diffusion

1.5-6.0x

For dense models, DFlash diffuses 16 tokens in parallel for massive speedups. For MoE, we tune block size and tree shape to maximize acceptance on your architecture.

varies by model type

Parallel token diffusion
Block size tuning per model
Best on BF16 dense models
FP8-aware calibration for MoE

Measured Performance

Real Numbers from Real Hardware

All numbers are single-stream (batch size 1) — what one user experiences.

Qwen3.6-35B-A3B on H200 BS=1 · Measured

Setup	Throughput	Speedup	Method
Vanilla Autoregressive	~180 tok/s	1.0x	Baseline BF16
+ FP8 Quantization	333 tok/s	1.85x	Foundation
+ DFlash (T=0)	333 tok/s	1.85x	Parallel diffusion
+ Speculative Decode	~550 tok/s	3.0x	Draft model verify

Note: DFlash acceptance drops with FP8 MoE (accept ~3.4 vs 5-7 on BF16 dense). Speculative decode with a trained draft model works better for MoE + FP8 because it sees target hidden states, not just token IDs.

Sarvam-30B Projection BS=1 · 2.75x confirmed

Stack Layer	Throughput	Speedup	Cost vs Baseline
Baseline (vanilla)	~200 tok/s	1.0x	100%
+ vLLM / SGLang tuning	~260 tok/s	1.3x	77%
+ Speculative Decode	~550 tok/s	2.75x	36%

2.75x speedup confirmed with a pre-trained Apache 2.0 draft model. Sarvam-30B is well-suited for speculative decoding due to its lightweight architecture.

FAQ

Common Questions

Who are we?

We are data scientists and ML engineers from Microsoft — Krishan Bansal and Shivam Mittal — who run, train, optimize, and host large language models every day. This isn't theory for us: we benchmark on real H200 hardware, train draft models from scratch, and deploy production inference stacks. We built Karya Inference because Indian GPU fleets deserve the same optimization expertise that hyperscalers have internally.

What exactly is speculative decoding?

A small draft model predicts several tokens ahead of time. Your main model verifies them in a single forward pass. Accepted tokens are kept — rejected ones are discarded. The output is mathematically identical to standard autoregressive decoding. Zero quality loss.

Is this a completely new or experimental technology?

No — it's built on recent breakthroughs from 2024-2026 that are already in production at top AI labs in the US and China. Methods like speculative decoding and draft-model verification are being deployed at scale by teams at Meta, Google DeepMind, and leading Chinese AI companies. The research is mature, the implementations are battle-tested, and the speedups are real. We bring this same expertise to Indian GPU infrastructure.

Do I need new hardware?

No. Everything runs on your existing H100, H200, A100, or L40 GPUs. The draft model is tiny (0.4-2B params, ~1-5 GB). It fits alongside your main model on the same GPU. No infrastructure changes needed.

Is there really zero quality loss?

Yes, guaranteed. Speculative decoding uses rejection sampling — the target model decides which draft tokens to accept. Accepted tokens are exactly what the target would have generated. This is mathematically proven to produce identical outputs to standard autoregressive decoding.

What's the typical ROI timeline?

For a 4-GPU H200 fleet, the ₹10L deployment fee pays for itself in under 2 weeks through reduced GPU hours. At 2-5x speedup, you need 50-80% fewer GPU hours for the same workload. Annual savings range from ₹40L to ₹80L depending on fleet size and utilization.

Which Indian GPU cloud providers do you support?

We work with all major Indian GPU clouds — Yotta (9,216+ GPUs), E2E Networks (H200/B200), NxtGen, CtrlS, and on-prem clusters. We also work with hyperscaler deployments on AWS, GCP, and Azure India regions. Our benchmarks are measured on E2E Networks H200 SXM5 hardware.

What happens during the free audit?

We spend 2 hours benchmarking your current setup — measuring throughput, GPU utilization, and memory bandwidth. Then we run all four methods (vLLM, SGLang, speculative decoding, DFlash) on your model and hardware. You get a report with exact speedup projections and cost savings. No commitment required.

How long does a full deployment take?

4 weeks from audit to production. Week 1: benchmark and method selection. Week 2-3: implementation and draft model training. Week 4: load testing, validation, and production deployment. We provide 30 days of post-deployment support included in the fee.

2-3x Speed on Your Existing
GPU Hardware

Real Benchmarks on H200

Your GPU Is Fast. Your Inference Isn't.

Four Ways to Faster, Cheaper Inference

vLLM Optimization

SGLang Serve Optimization

Speculative Decoding

DFlash Parallel Diffusion

Real Numbers from Real Hardware

Qwen3.6-35B-A3B on H200 BS=1 · Measured

Sarvam-30B Projection BS=1 · 2.75x confirmed

Common Questions

Who are we?

What exactly is speculative decoding?

Is this a completely new or experimental technology?

Do I need new hardware?

Is there really zero quality loss?

What's the typical ROI timeline?

Which Indian GPU cloud providers do you support?

What happens during the free audit?

How long does a full deployment take?

Stop Overpaying for Slow Inference

2-3x Speed on Your ExistingGPU Hardware

Real Benchmarks on H200

Your GPU Is Fast. Your Inference Isn't.

Four Ways to Faster, Cheaper Inference

vLLM Optimization

SGLang Serve Optimization

Speculative Decoding

DFlash Parallel Diffusion

Real Numbers from Real Hardware

Qwen3.6-35B-A3B on H200 BS=1 · Measured

Sarvam-30B Projection BS=1 · 2.75x confirmed

Common Questions

Who are we?

What exactly is speculative decoding?

Is this a completely new or experimental technology?

Do I need new hardware?

Is there really zero quality loss?

What's the typical ROI timeline?

Which Indian GPU cloud providers do you support?

What happens during the free audit?

How long does a full deployment take?

Stop Overpaying for Slow Inference

2-3x Speed on Your Existing
GPU Hardware