top of page

Demystifying LLM Parameters: Optimizing Language AI for Better Outputs

When using Language AI to generate content, there are many options to control the outputs. Let's take a look at them in this post.

In Simple words, a Large Language Model is a prediction engine. The model takes a string as an input (the prompt) and then predicts what the following words should be. Behind the scenes, it comes up with probabilities for the various permutations and combinations of words that could follow. The output of the model is a giant list of possible words and their probabilities. It returns only one of those words based on the parameters you set.

In this post, we’re going to cover what those parameters are and how you can tweak them to get the best outputs.

Number of Tokens

As mentioned earlier, that the language model builds a list of words and their probabilities as outputs. This is technically incorrect. It builds a list of tokens, which is roughly 4 characters, but not always. For example, a word like “juice” might end up being one token, whereas larger words might be broken up into multiple tokens.

You probably don’t want the language model to keep generating outputs infinitely, so the number of tokens parameters allows you to set a limit to how many tokens are generated. There’s also a natural limit to the number of tokens the model can produce. Smaller models can go up to 1024 while larger models go up to 2048.

It’s not recommended to hit those limits though. If you’re generating content using a large limit, the model may go off in a direction you’re not expecting. It’s generally recommended to generate in short bursts versus one long burst.


Temperature is a close second to prompt engineering when it comes to controlling the output of the Command model. It determines how creative the model should be.

Consider the phrase "She walked into". You'd probably expect words like "the room" or "the building" to follow, rather than "the forest" or "the ocean". Your mind naturally generates predictions for the next words, leaning towards "the room" as the most likely continuation and "the ocean" as less probable (unless you're envisioning a particularly adventurous scene).

And that’s essentially what the model does. It has probabilities for all the different words that could follow and then selects the next word to output. The Temperature setting tells it which of these words it can use.

A Temperature of 0 makes the model deterministic. It limits the model to use the word with the highest probability. You can run it over and over and get the same output. As you increase the Temperature, the limit softens, allowing it to use words with lower and lower probabilities until at a Temperature of 5 it’s biased towards lower probabilities, and it might generate “tarnished” if you run it enough times.

Top-k and Top-p

Aside from Temperature, Top-k and Top-p are the two other ways to pick the output token.

Top-k tells the model to pick the next token from the top ‘k’ tokens in its list, sorted by probability.

Consider the input phrase - “The name of that state is”. The next token could be “Andhra Pradesh”, “Tamil Nadu”, “Karnataka”, and so on, with varying probabilities. There may be dozens of potential outputs with decreasing probabilities but if you set k as 3, you’re telling the model to only pick from the top 3 options.

So if you ran the same prompt a bunch of times, you’ll get Andhra Pradesh very often, and you’ll get a smattering of Tamil Nadu or Kerala, but nothing else.

If you set k to 1, the model will only pick the top token (Andhra Pradesh, in this case).

Top-p operates similarly to top-k but selects tokens based on the sum of their probabilities. For instance, if we set p to 0.15, it will exclusively consider tokens with probabilities that sum up to approximately 14.7%. This approach is more adaptive and commonly employed to eliminate outputs with lower probabilities. For example, if you set p to 0.75, you effectively exclude the least probable 25% of outputs.

Frequency and Presence Penalties

The final set of parameters is the frequency and presence penalties.

The frequency penalty reduces the likelihood of tokens that have already been used in the preceding text, including the prompt. The penalty is higher for tokens that have appeared more frequently, discouraging their repetition in the generated text.

The presence penalty applies a penalty to tokens if they have appeared at least once before, regardless of how frequently they occurred. These settings are beneficial for minimizing repetition in generated outputs.

Play With The Parameters

Determining the optimal parameters for generating text with Large Language Models isn't about finding a right or wrong approach. It's more about aligning the parameters with your specific goals for the model. The most effective method for identifying the ideal parameter set is through experimentation.

In many cases, trying out different Temperature settings is good enough and you won’t need to touch the other parameters. However, if you have some specific output in mind and want finer control over what the model generates, start using the top-k, top-p, and penalties to get it just right.

21 views0 comments


bottom of page