1 result for “policy gradient”
A reinforcement learning algorithm developed by OpenAI that stabilises policy gradient training by constraining the size of policy updates, widely used for fine-tuning large language models through RLHF.