⚡ LLM Inference APIs - Latency

Number of APIs: 63

Prerequisites

Qodex
API keys for all APIs

Usage

Create a fork
Update collection variables
Send requests

Methodology

Only AI companies that use OpenAI's Chat Completions API design are included in the study. Only multi-tenant endpoints are included in the study.

For each of the LLM inference APIs, three requests with the same payload (same model, same hyperparameters, and same prompt) were submitted at a given point in time. The average latency was evaluated, both with and without streaming.

Initial Findings

Groq is delivering ultra-low LLM inference latency with the world's first Language Processing Unit (LPU), outperforming GPU-based processing (NVIDIA, AMD, Intel, etc.) An outlier is Fireworks AI with Llama 2 70B.

Data: https://github.com/bstraehle/ai-ml-dl/blob/main/apis/LLM%20Inference%20APIs%20-%20Latency.xlsx

Streaming On-Gemma 7B - Chat (Deep Infra - gemma-7b-it) POST https://api.deepinfra.com/v1/openai/chat/completions
Streaming Off-Llama 2 70B - Chat (Fireworks AI - llama-v2-70b-chat) 🏆 POST https://api.fireworks.ai/inference/v1/chat/completions
Streaming On-Mixtral 8x7B - Chat (Deep Infra - mixtral-8x7b-instruct-v0.1) POST https://api.deepinfra.com/v1/openai/chat/completions
Streaming Off-Llama 3 70B - Chat (NVIDIA AI - llama3-70b) POST https://integrate.api.nvidia.com/v1/chat/completions
Streaming Off-Llama 3 70B - Chat (OctoAI - meta-llama-3-70b-instruct) POST https://text.octoai.run/v1/chat/completions
Streaming Off-Llama 3.1 405B - Chat (NVIDIA AI - llama-3.1-405b-instruct) POST https://integrate.api.nvidia.com/v1/chat/completions
Streaming Off-Llama 3.1 405B - Chat (Together AI - Meta-Llama-3.1-405B-Instruct-Turbo) POST https://api.together.xyz/v1/chat/completions
Streaming On-Gemma 7B - Chat (Groq - gemma-7b-it) 🏆 POST https://api.groq.com/openai/v1/chat/completions
Streaming On-Gemma 7B - Chat (NVIDIA AI - gemma-7b) POST https://integrate.api.nvidia.com/v1/chat/completions
Streaming On-Gemma 7B - Chat (OctoAI - gemma-7b-it) POST https://text.octoai.run/v1/chat/completions