Generate

POST {{baseUrl}}/v1/generate

Supported Large Language Models (LLMs) are

Model name:
TinyLlama/TinyLlama-1.1B-Chat-v1.0
microsoft/Phi-3-mini-4k-instruct
meta-llama/Meta-Llama-3-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.2

📘 Learn more about the New Gen LLMs here

Request JSON Object Structure

ParameterDescriptionDefault Value
modelModel to execute input on supported Models are in object definition.-
promptFormatted prompt to be fed as input for the model. Note: input to this value is expected to be formatted prompt.-
messagesOptional[List[dict]] OpenAI Formatted Message. Example:

messages = [ {"role": "user", "content": "What is your favourite condiment?"}, {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, {"role": "user", "content": "Do you have mayonnaise recipes?"}]

When this input format is used, model prompt template is auto applied. Note this is not supported for microsoft/phi-2.
-
max_tokensAn integer representing the maximum number of tokens to generate in the output.-
nNumber of outputs to generate / number of beams to use. Optional.-
best_ofControls the number of candidate generations to produce from which the best is selected. Optional.None
presence_penaltyA float that penalizes new tokens based on their existing presence in the text. Encourages exploration of new topics and ideas.0.0
frequency_penaltyA float that decreases the likelihood of repetition of previously used words. The higher the penalty, the less likely repetition.0.0
repetition_penaltyA float that controls the penalty for token repetitions in the output. Values > 1 will penalize and decrease repetition likelihood.1.0
temperatureA float that controls randomness in the generation. Lower values are more deterministic, higher values encourage diversity.1.0
top_pA float in the range [0,1] controlling the nucleus sampling method, which truncates the distribution to the top p%.1.0
top_kAn integer controlling the number of highest probability vocabulary tokens to keep for top-k filtering.-1
min_pAn integer controlling the minimum number of tokens to be considered for generation. This can prevent generating too few tokens.0.0
use_beam_searchBoolean indicating whether to use beam search for generation, which might provide better quality outputs at the expense of speed.False
length_penaltyA float that penalizes or rewards longer sequences. Values < 1 favor shorter sequences, and values > 1 favor longer ones.1.0
early_stoppingBoolean indicating whether to stop generation early if the end token is predicted. Makes generation faster and prevents long outputs.False
mock_responseBoolean indicating if a mock response is generated. Currently, only True is supported.-

Ensure that your input adheres to these parameters for optimal generation results. The model will process the input and generate text based on the configuration and content provided in 'input_variables'.

Request Body

{"model"=>"TinyLlama/TinyLlama-1.1B-Chat-v1.0", "formatted_prompt"=>"<string>", "messages"=>[{"role"=>"<string>", "content"=>"<string>"}, {"role"=>"<string>", "content"=>"<string>"}], "max_tokens"=>256, "n"=>1, "best_of"=>1, "presence_penalty"=>0, "frequency_penalty"=>0, "repetition_penalty"=>1, "temperature"=>1, "top_p"=>1, "top_k"=>-1, "min_p"=>0, "use_beam_search"=>false, "length_penalty"=>1, "early_stopping"=>false, "mock_response"=>false}

HEADERS

KeyDatatypeRequiredDescription
Content-Typestring
Acceptstring

RESPONSES

status: OK

{}