Data filtering methods for training language models | ArxivCSExplorer