Objectives Publicly available datasets dedicated to clinical speech deidentification tasks remain scarce due to privacy constraints and the complexity of speech-level annotation. To address this gap, we compiled the SREDH-AICup sensitive health information (SHI) speech corpus, a time-aligned clinical speech dataset annotated across 38 SHI categories. Methods Two publicly available English medical-domain datasets were adapted to support speech-level de-identification, including script reformulation and controlled re-recorded by 25 participants. Additional Mandarin Chinese clinical-style materials were incorporated to extend linguistic coverage. All audio data were annotated with million-level, time-aligned SHI spans using Label Studio. Inter-annotator agreement was evaluated using Cohen’s kappa, following iterative calibration rounds. The resulting corpus supports both automatic speech recognition (ASR) and speech-level recognition of SHIs. Results The final dataset comprises 20 hours of annotated audio, divided into training (10 hours, 1,539 files), validation (5 hours, 775 files), and test (5 hours, 710 files) subsets, totalling 7,830 SHI entities. The language distribution reflects the composition of the selected source materials, with 19.36 hours of English and 0.89 hours of Mandarin Chinese speech. Discussion The corpus exhibits a long-tail distribution consistent with clinical documentation patterns and highlights the limited availability of Chinese medical speech resources. These characteristics underscore both the realism of the dataset and structural challenges associated with multilingual speech de-identification. Conclusion The SREDH-AICup SHI speech corpus provides a clinically grounded, time-aligned speech dataset supporting automated medical speech de-identification research and facilitating future development of multilingual speech-based privacy protection systems.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844