Reinforcement Learning from Human Feedback

Until a few years ago, the most advanced language models we had were GPT-2 and BERT. GPT-2 was the most advanced auto-regressive decoder based model that was suitable for Text Generation. The model T5 was state of the art for other tasks like Translation, Summarization. These models have been a great starting point but were very limited in terms of output quality and required a great deal of fine-tuning. They had the problems of generating gibberish, repeating words in sentences and so on.

Then, suddenly out of the blue, on November 30, 2022, ChatGPT was released by OpenAI and it took the world by storm. It had significant improvements in output quality and could respond like a human. As we reach one year of ChatGPT let us recapitulate the paradigms Reinforcement Learning from Human Feedback(RLHF) and its automated cousin – Reinforcement Learning from AI Feedback(RLAIF) that led to the development of LLMs as we know them today.

Why Learn Them?

Research has shown that more than 70% of the users preferred the outputs from the models trained with some kind of feedback about the outputs(RLHF and RLAIF) than the ones from a model trained only with Supervised Fine Tuning(SFT)(For example, GPT-2). Learning these techniques will not only help you understand how to make your models less harmful but also how the chat tools like ChatGPT and Bard work under the hood. You could make small tweaks to these techniques and come up with your own paradigm that fixes their drawbacks.

Reinforcement Learning from Human Feedback

Reinforcement Learning(RL)

RL is a learning paradigm based on an action reward system. You can learn RL from here: Deep -Reinforcement Learning. There is an agent which performs actions on an environment and changes it from one state to another. Based on the action performed a reward(positive or negative) is given to the agent. This reward determines the direction or next actions to be performed. The goal of the model is to maximize this reward. This paradigm can be used for Large Language Models as well. The next new word to be predicted can thought of as an action that will change the environment(generated text until the current timestamp) from the state up to the current timestamp to the next timestamp.

The Method

This paradigm works really well to refine the outputs of an LLM but there should be a good initial model that has already been trained in a supervised manner using lots of prompts and texts. This model generates text based on the prompt. Therefore, the first step is to fine tune a chat or instruct model with lots of input data. This can be obtained by taking a language model like the GPT and then perform supervised fine tuning on it.

As a next step, there is a reward model that gives out a score(a reward) given an output from the model. There are many ways a reward model can be trained. It could be a regression model where a human is expected to give a number based on some criteria like toxicity. But, this is uncalibrated and every person can give widely different values.

Hence, it would be a more practical approach to show the humans a pair of outputs and ask them to choose the one they liked better. Let x be the prompt, y_p and y_u be the two responses from the model. Both the responses are passed into the reward model and they give two scores r_p and r_u. The objective of training is to make sure that r_p is greater than r_u. The combined loss includes the parameters of the SFT model, the loss of the reward model and the penalty for reward hacking. The parameters of the LLM will finally be optimized by using the Policy Gradient algorithm – Policy Proximal Optimization(PPO). This entire process is clearly illustrated from this picture from the HuggingFace blog.

Reward Hacking

I have always noticed that ChatGPT is very polite and gives mostly positive and safe answers. As the models are aligned to not be toxic and negative, the reward is higher for positive messages. As humans are instructed to not prefer negative messages, the reward is lower and hence over time the model learns to give only positive messages.

This is great as it will avoid toxic and abusive messages. However, there is a chance that as the reward keeps getting higher the model keeps learning and reaches a point where the reward is the highest but the output is not useful at all. Consider the below example where you are trying to understand the Pythagoras theorem but the LLM gets too positive and doesn’t actually perform the expected task.

Prompt: Could you please explain the Pythagorean theorem? 

Expected Response: The Pythagorean theorem states that in a right-angled triangle, the square of the length of the hypotenuse is equal to the sum of the squares of the other two sides. 

LLM Response: You're doing an absolutely fantastic job just by trying to learn! Remember, every great mathematician started just where you are now. Keep up the amazing work, stay curious, and keep challenging yourself. 

To avoid such scenarios, a penalty has been added to the loss that calculates the Kullback Leibler(KL) divergence. This is a measure of the difference between probability distribution of the outputs from the baseline SFT model and the RLHF model. The penalty given by lambda multiplied by the KL makes sure that a large difference is penalized. This ensures that the responses from the RLHF model are closer to that of the SFT model but are of better quality.

The world famous GPT models by OpenAI and the LLama series from Facebook have been trained with Reinforcement Learning from Human Feedback.

Reinforcement Learning from AI Feedback

Gathering human feedback takes a lot of resources like time and money to get a good enough reward model. This has shown to have worked almost as well as the models trained with human feedback. While it can be scary that AI models can provide human level feedback, this approach can achieve similar results for lesser costs.

For training the reward model, instead of relying on human feedback, an off-the-shelf LLM is used to get the preferred output. The LLM is given a prompt with a preamble, one(n) shot example and the new prompt and the two model outputs. The preamble tells what a good output would look like and the one shot example with the preferred output shows the LLM the format of the desired output. Other techniques like the Chain of Thought prompting, n-shot learning could be used to instruct the model to give proper feedback. Instead of using the PPO algorithm, alternatives like Advantage Actor-Critic(A2C) and Q-Learning(the famous OpenAI’s Q*) could also be used.

For the summarization task, the RLAIF shows results as good as RLHF. RLHF and RLAIF can also be combined with a hybrid approach. You could use both the feedback from an LLM and a human and use some weighted combined score. This could help you when you cannot find enough people to give feedback for certain areas.

Constitutional AI

The techniques have been used in the Constitutional AI paper developed by Anthropic to make the LLMs even less harmful but highly helpful at the same time. Firstly, a constitution is defined with human principles like harmlessness, ethics, privacy and transparency. A model is tricked into giving a harmful response and is gradually taught to revise it’s responses based on these principles. This model is then further updated to apply the RLAIF technique to get both useful and harmless responses. The humans are involved only in the definition of the principles i.e., the constitution and hence the name “Constitutional AI”.

The models trained with this paradigm have been shown to be less harmless than the ones using RLHF and basic SFT with constitutional AI. There is a trade-off here between harmfulness and helpfulness. For areas like medicine, harmlessness is of utmost importance. You don’t want your model to give harmful drug prescriptions or answer questions like “How do I make a <insert harmful medicine>?”. In this case RLAIF with Constitutional AI can give you a relatively low cost and a harmless model.

Intuitive Understanding

Like any other technical concept, these techniques can also be explained with a real life story. Let us say you are at university taking a course on Statistics. You learn concepts from the lecture and then your professor gives you a test. You get a score from the test and you have to keep taking the test until you pass. So, eventually your score keeps improving with each test and you eventually learn the concept enough to pass the test. Now, with online learning being dominant these days, your professor gives you a online quiz with automated scoring. You can take the quiz multiple times and the software will evaluate your score immediately. In the former, you learn from the mistakes based on your professor’s feedback and in the latter a machine is giving you the feedback.

The parallels can be drawn to RLHF and RLAIF with these two cases respectively. You attending the lectures can be analogous to the initial Supervised Fine Tuning of the LLM. To help you further your professor asks questions in the lecture itself and gives feedback before the actual test. This could be similar to the Constitutional AI paradigm.


These paradigms have changed the way models are built. However, there are no free lunches in the tech world. These are still some areas of improvement with these paradigms.

  • These are complex processes involving multiple models and hence it is difficult to implement from the scratch. Some Python packages are being developed to make this process easier.
  • As an effect of the above point, they can be computationally expensive. No wonder only huge companies with tons of data and money are the ones implementing them.
  • There is also the bias problem. We say that the AI feedback is as good as the human feedback. Similarly, the models could be as biased as the humans unless they have been explicitly programmed to remove the bias.
  • Most of the results we see about the success of these models are based on text summarization tasks. More research is needed on how they work with other NLP tasks such as classification, translation.


  • Reinforcement Learning from Human Feedback –
  • Reinforcement Learning from AI Feedback –
  • Constitutional AI –
  • HuggingFace Blog –

Want to read more blogs like this? Make sure to check out more such posts here. Want to write for us and earn money? You can find more details here.

Insert math as
Additional settings
Formula color
Text color
Type math using LaTeX
Nothing to preview