⚡ LLM Inference APIs - Latency

Number of APIs: 63

Prerequisites

  • Qodex

  • API keys for all APIs

Usage

  1. Create a fork

  2. Update collection variables

  3. Send requests

Methodology

Only AI companies that use OpenAI's Chat Completions API design are included in the study. Only multi-tenant endpoints are included in the study.

For each of the LLM inference APIs, three requests with the same payload (same model, same hyperparameters, and same prompt) were submitted at a given point in time. The average latency was evaluated, both with and without streaming.

Initial Findings

Groq is delivering ultra-low LLM inference latency with the world's first Language Processing Unit (LPU), outperforming GPU-based processing (NVIDIA, AMD, Intel, etc.) An outlier is Fireworks AI with Llama 2 70B.

Data: https://github.com/bstraehle/ai-ml-dl/blob/main/apis/LLM%20Inference%20APIs%20-%20Latency.xlsx

  1. Streaming Off-Llama 3 70B - Chat (Lepton AI - llama3-70b) POST https://llama3-70b.lepton.run/api/v1/chat/completions

  2. Streaming Off-Llama 3.1 405B - Chat (Fireworks AI - llama-v3p1-405b-instruct) POST https://api.fireworks.ai/inference/v1/chat/completions

  3. Streaming Off-Llama 3.1 405B - Chat (OctoAI - meta-llama-3.1-405b-instruct) POST https://text.octoai.run/v1/chat/completions

  4. Streaming On-Gemma 7B - Chat (Fireworks AI - gemma-7b-it) POST https://api.fireworks.ai/inference/v1/chat/completions

  5. Streaming On-Mixtral 8x7B - Chat (Anyscale - mixtral-8x7b-instruct-v0.1) POST https://api.endpoints.anyscale.com/v1/chat/completions

  6. Streaming On-Mixtral 8x7B - Chat (Fireworks AI - mixtral-8x7b-instruct) POST https://api.fireworks.ai/inference/v1/chat/completions

  7. Streaming On-Mixtral 8x7B - Chat (Groq - mixtral-8x7b-32768) 🏆 POST https://api.groq.com/openai/v1/chat/completions

  8. Streaming On-Mixtral 8x7B - Chat (Together AI - mixtral-8x7b-instruct-v0.1) POST https://api.together.xyz/v1/chat/completions

  9. Streaming On-Llama 2 70B - Chat (Anyscale - llama-2-70b-chat-hf) POST https://api.endpoints.anyscale.com/v1/chat/completions

  10. Streaming Off-Mixtral 8x7B - Chat (Anyscale - mixtral-8x7b-instruct-v0.1) POST https://api.endpoints.anyscale.com/v1/chat/completions