Generate
POST {{baseUrl}}/v1/generate
Supported Large Language Models (LLMs) are
Model name: |
---|
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
microsoft/Phi-3-mini-4k-instruct |
meta-llama/Meta-Llama-3-8B-Instruct |
mistralai/Mistral-7B-Instruct-v0.2 |
📘 Learn more about the New Gen LLMs here
Request JSON Object Structure
Parameter | Description | Default Value |
---|---|---|
model | Model to execute input on supported Models are in object definition. | - |
prompt | Formatted prompt to be fed as input for the model. Note: input to this value is expected to be formatted prompt. | - |
messages | Optional[List[dict]] OpenAI Formatted Message. Example: messages = [ {"role": "user", "content": "What is your favourite condiment?"}, {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, {"role": "user", "content": "Do you have mayonnaise recipes?"}] When this input format is used, model prompt template is auto applied. Note this is not supported for microsoft/phi-2 . | - |
max_tokens | An integer representing the maximum number of tokens to generate in the output. | - |
n | Number of outputs to generate / number of beams to use. Optional. | - |
best_of | Controls the number of candidate generations to produce from which the best is selected. Optional. | None |
presence_penalty | A float that penalizes new tokens based on their existing presence in the text. Encourages exploration of new topics and ideas. | 0.0 |
frequency_penalty | A float that decreases the likelihood of repetition of previously used words. The higher the penalty, the less likely repetition. | 0.0 |
repetition_penalty | A float that controls the penalty for token repetitions in the output. Values > 1 will penalize and decrease repetition likelihood. | 1.0 |
temperature | A float that controls randomness in the generation. Lower values are more deterministic, higher values encourage diversity. | 1.0 |
top_p | A float in the range [0,1] controlling the nucleus sampling method, which truncates the distribution to the top p%. | 1.0 |
top_k | An integer controlling the number of highest probability vocabulary tokens to keep for top-k filtering. | -1 |
min_p | An integer controlling the minimum number of tokens to be considered for generation. This can prevent generating too few tokens. | 0.0 |
use_beam_search | Boolean indicating whether to use beam search for generation, which might provide better quality outputs at the expense of speed. | False |
length_penalty | A float that penalizes or rewards longer sequences. Values < 1 favor shorter sequences, and values > 1 favor longer ones. | 1.0 |
early_stopping | Boolean indicating whether to stop generation early if the end token is predicted. Makes generation faster and prevents long outputs. | False |
mock_response | Boolean indicating if a mock response is generated. Currently, only True is supported. | - |
Ensure that your input adheres to these parameters for optimal generation results. The model will process the input and generate text based on the configuration and content provided in 'input_variables'.
Request Body
{"model"=>"TinyLlama/TinyLlama-1.1B-Chat-v1.0", "formatted_prompt"=>"<string>", "messages"=>[{"role"=>"<string>", "content"=>"<string>"}, {"role"=>"<string>", "content"=>"<string>"}], "max_tokens"=>256, "n"=>1, "best_of"=>1, "presence_penalty"=>0, "frequency_penalty"=>0, "repetition_penalty"=>1, "temperature"=>1, "top_p"=>1, "top_k"=>-1, "min_p"=>0, "use_beam_search"=>false, "length_penalty"=>1, "early_stopping"=>false, "mock_response"=>false}
HEADERS
Key | Datatype | Required | Description |
---|---|---|---|
Content-Type | string | ||
Accept | string |
RESPONSES
status: OK
{}