Generate

POST {{baseUrl}}/v1/generate

Supported Large Language Models (LLMs) are

Model name:
TinyLlama/TinyLlama-1.1B-Chat-v1.0
microsoft/Phi-3-mini-4k-instruct
meta-llama/Meta-Llama-3-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.2

📘 Learn more about the New Gen LLMs here

Request JSON Object Structure

Parameter	Description	Default Value
`model`	Model to execute input on supported Models are in object definition.	`-`
`prompt`	Formatted prompt to be fed as input for the model. Note: input to this value is expected to be formatted prompt.	`-`
`messages`	Optional[List[dict]] OpenAI Formatted Message. Example: `messages = [ {"role": "user", "content": "What is your favourite condiment?"}, {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, {"role": "user", "content": "Do you have mayonnaise recipes?"}]` When this input format is used, model prompt template is auto applied. Note this is not supported for `microsoft/phi-2`.	`-`
`max_tokens`	An integer representing the maximum number of tokens to generate in the output.	`-`
`n`	Number of outputs to generate / number of beams to use. Optional.	`-`
`best_of`	Controls the number of candidate generations to produce from which the best is selected. Optional.	`None`
`presence_penalty`	A float that penalizes new tokens based on their existing presence in the text. Encourages exploration of new topics and ideas.	`0.0`
`frequency_penalty`	A float that decreases the likelihood of repetition of previously used words. The higher the penalty, the less likely repetition.	`0.0`
`repetition_penalty`	A float that controls the penalty for token repetitions in the output. Values > 1 will penalize and decrease repetition likelihood.	`1.0`
`temperature`	A float that controls randomness in the generation. Lower values are more deterministic, higher values encourage diversity.	`1.0`
`top_p`	A float in the range [0,1] controlling the nucleus sampling method, which truncates the distribution to the top p%.	`1.0`
`top_k`	An integer controlling the number of highest probability vocabulary tokens to keep for top-k filtering.	`-1`
`min_p`	An integer controlling the minimum number of tokens to be considered for generation. This can prevent generating too few tokens.	`0.0`
`use_beam_search`	Boolean indicating whether to use beam search for generation, which might provide better quality outputs at the expense of speed.	`False`
`length_penalty`	A float that penalizes or rewards longer sequences. Values < 1 favor shorter sequences, and values > 1 favor longer ones.	`1.0`
`early_stopping`	Boolean indicating whether to stop generation early if the end token is predicted. Makes generation faster and prevents long outputs.	`False`
`mock_response`	Boolean indicating if a mock response is generated. Currently, only `True` is supported.	`-`

Ensure that your input adheres to these parameters for optimal generation results. The model will process the input and generate text based on the configuration and content provided in 'input_variables'.

Request Body

{"model"=>"TinyLlama/TinyLlama-1.1B-Chat-v1.0", "formatted_prompt"=>"<string>", "messages"=>[{"role"=>"<string>", "content"=>"<string>"}, {"role"=>"<string>", "content"=>"<string>"}], "max_tokens"=>256, "n"=>1, "best_of"=>1, "presence_penalty"=>0, "frequency_penalty"=>0, "repetition_penalty"=>1, "temperature"=>1, "top_p"=>1, "top_k"=>-1, "min_p"=>0, "use_beam_search"=>false, "length_penalty"=>1, "early_stopping"=>false, "mock_response"=>false}

HEADERS

Key	Datatype	Required	Description
`Content-Type`	string
`Accept`	string

RESPONSES

status: OK

{}