arXiv:2605.19407v1 Announce Type: cross
Abstract: We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally “poor” data.
A maturity model framework for federated networks of trusted research environments
IntroductionA Trusted Research Environment (TRE) is a highly secure computer system where sensitive data is stored that researchers can access remotely and make use of