hplt-project/warc2text-runner
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
Stars: 4Language: Jupyter Notebook
Give AlbumentationsX a star on GitHub — it powers this leaderboard
Star on GitHubScripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.