⚡ LLM Inference APIs - Latency
Number of APIs: 63
Prerequisites
Qodex
API keys for all APIs
Usage
Create a fork
Update collection variables
Send requests
Methodology
Only AI companies that use OpenAI's Chat Completions API design are included in the study. Only multi-tenant endpoints are included in the study.
For each of the LLM inference APIs, three requests with the same payload (same model, same hyperparameters, and same prompt) were submitted at a given point in time. The average latency was evaluated, both with and without streaming.
Initial Findings
Groq is delivering ultra-low LLM inference latency with the world's first Language Processing Unit (LPU), outperforming GPU-based processing (NVIDIA, AMD, Intel, etc.) An outlier is Fireworks AI with Llama 2 70B.
Data: https://github.com/bstraehle/ai-ml-dl/blob/main/apis/LLM%20Inference%20APIs%20-%20Latency.xlsx
-
Streaming On-Gemma 7B - Chat (Deep Infra - gemma-7b-it) POST https://api.deepinfra.com/v1/openai/chat/completions
-
Streaming Off-Llama 2 70B - Chat (Fireworks AI - llama-v2-70b-chat) 🏆 POST https://api.fireworks.ai/inference/v1/chat/completions
-
Streaming On-Mixtral 8x7B - Chat (Deep Infra - mixtral-8x7b-instruct-v0.1) POST https://api.deepinfra.com/v1/openai/chat/completions
-
Streaming Off-Llama 3 70B - Chat (NVIDIA AI - llama3-70b) POST https://integrate.api.nvidia.com/v1/chat/completions
-
Streaming Off-Llama 3 70B - Chat (OctoAI - meta-llama-3-70b-instruct) POST https://text.octoai.run/v1/chat/completions
-
Streaming Off-Llama 3.1 405B - Chat (NVIDIA AI - llama-3.1-405b-instruct) POST https://integrate.api.nvidia.com/v1/chat/completions
-
Streaming Off-Llama 3.1 405B - Chat (Together AI - Meta-Llama-3.1-405B-Instruct-Turbo) POST https://api.together.xyz/v1/chat/completions
-
Streaming On-Gemma 7B - Chat (Groq - gemma-7b-it) 🏆 POST https://api.groq.com/openai/v1/chat/completions
-
Streaming On-Gemma 7B - Chat (NVIDIA AI - gemma-7b) POST https://integrate.api.nvidia.com/v1/chat/completions
-
Streaming On-Gemma 7B - Chat (OctoAI - gemma-7b-it) POST https://text.octoai.run/v1/chat/completions