CyberAgentAILab/filtered-dpo

Introducing Filtered Direct Preference Optimization (fDPO) that enhances language model alignment with human preferences by discarding lower-quality samples compared to those generated by the learning model

Stars: 16Language: Jupyter Notebook

View on GitHub Owner: CyberAgentAILab