• Home
  • Uncategorized
  • Evaluating Encoder and Decoder Models for Extended Clinical Concept Recognition in Japanese Clinical Texts: Comparative Study With Weighted Soft Matching

Background: Extracting medical knowledge for secondary purposes, such as diagnostic support, continues to pose a substantial challenge. Conventional named entity recognition has focused on short terms (eg, genes, diseases, and chemicals), whereas extraction and assessment of longer, complex expressions remain underexplored. Clinically vital concepts, such as diseases, pathologies, symptoms, and findings, often appear as long phrases, and accurate extraction is crucial for applications such as constructing causal knowledge from case reports. Consequently, a framework addressing both short terms and clinically meaningful long phrases—termed extended Clinical Concept Recognition (E-CCR)—is essential. Objective: This study, the first comprehensive investigation of E-CCR model selection, aimed to identify optimal strategies by comparing encoder versus decoder models and general-purpose versus domain-specific pretraining. We analyzed variation in effectiveness by target length and proposed a novel E-CCR evaluation metric. Methods: We evaluated 17 encoder and decoder models using J-CaseMap, a database of approximately 20,000 Japanese case reports annotated with clinical concepts. Performance was primarily assessed using the weighted soft matching score, which penalizes fragmentation of long extraction targets and weights scores by target length to account for the greater difficulty of extracting longer expressions. Results: On J-CaseMap, JMedDeBERTa(s)—an encoder model pretrained on domain-specific medical text—achieved the highest mean performance (F1-score=0.758, SD 0.002), with similarly strong results from JMedDeBERTa(c), suggesting comparable performance among the top encoder models. As the fragmentation penalty increased, performance generally declined; however, no consistently severe degradation was observed. On the Medical Report Named Entity Recognition for positive disease dataset, the general-domain DeBERTaV2-base yielded the highest mean F1 score, and differences among the medical-domain JMedDeBERTa(s) and JMedDeBERTa(c) variants were small, suggesting limited benefit of domain-specific pretraining. Overall, under our experimental settings (low-rank adaptation fine-tuning for decoders and full fine-tuning for encoders), encoder models outperformed decoder models, and token classification outperformed our instruction tuning setup. Conclusions: Under our experimental setting, encoder-based token classification achieved the highest mean performance on our internal dataset. Differences among the top encoder models were small and should be interpreted as comparable within the uncertainty implied by our annotation review, whereas decoder-based approaches did not surpass encoder-based models in this setup, suggesting that encoder models can deliver high accuracy with fewer parameters and may offer practical advantages in resource-constrained environments. Token classification outperformed instruction tuning for extracting long expressions, whereas instruction tuning was better suited to short terms. Using the weighted soft matching score, we found that performance did not substantially deteriorate as the fragmentation penalty increased, indicating that extracted spans were rarely fragmented. Similar trends in external validation datasets suggest that findings under our setup may generalize to information extraction tasks on Japanese medical text. Further investigation is needed to determine whether these findings hold across other languages and medical document types.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844