InferenceMAX

BySemiAnalysis logo
GitHub logo

LLM inference performance is a major concern in providing AI services, but accurate performance analysis remains elusive.

The fast cadence of software development and model releases makes comparing performance between setups difficult. Existing performance benchmarks quickly become obsolete because they are static, and participants often game the benchmarks with unrealistic, highly specific configurations.

InferenceMAX addresses these issues by benchmarking popular models on major hardware platforms nightly with the latest software.

For each model and hardware combination, InferenceMAX sweeps through different tensor parallel sizes and maximum concurrent requests, presenting a throughput vs. latency graph for a complete picture. In terms of software configurations, we ensure they are broadly applicable across different serving scenarios, and we open-source the repo to encourage community contributions.

We hope InferenceMAX provides the community with up-to-date and realistic LLM inference performance insights.