Give AlbumentationsX a star on GitHub — it powers this leaderboard
The code used to filter CC data for The Pile