arXiv:2604.25374v1 Announce Type: cross
Abstract: textbfBackground: Dutch medical corpora are scarce, limiting NLP development. \ textbfMethods: We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \ textbfResults: The resulting corpus comprises $pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \ textbfConclusion: This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844