Joint Word and Morpheme Segmentation with Bayesian Non-Parametric Models

Shu Okabe; François Yvon

Communication Dans Un Congrès Année : 2023

Joint Word and Morpheme Segmentation with Bayesian Non-Parametric Models

Modèles bayésiens non-paramétriques pour la segmentation conjointe en mots et morphèmes

(1, 2) , (2, 1)

1
2

Shu Okabe

Fonction : Auteur
PersonId : 751379
IdHAL : shu-okabe
ORCID : 0009-0003-0169-3689

Sciences et Technologies des Langues - LISN

Traitement du Langage Parlé - LISN

François Yvon

Fonction : Auteur
PersonId : 5347
IdHAL : francois-yvon
ORCID : 0000-0002-7972-7442
IdRef : 057593531

Traitement du Langage Parlé - LISN

Sciences et Technologies des Langues - LISN

Résumé

Language documentation often requires segmenting transcriptions of utterances collected on the field into words and morphemes. While these two tasks are typically performed in succession, we study here Bayesian models for simultaneously segmenting utterances at these two levels. Our aim is twofold: (a) to study the effect of explicitly introducing a hierarchy of units in joint segmentation models; (b) to further assess whether these two levels can be better identified through weak supervision. For this, we first consider a deterministic coupling between independent models; then design and evaluate hierarchical Bayesian models. Experiments with two under-resourced languages (Japhug and Tsez) allow us to better understand the value of various types of weak supervision. In our analysis, we use these results to revisit the distributional hypotheses behind Bayesian segmentation models and evaluate their validity for language documentation data.

Mots clés

Computational Language Documentation, Word Segmentation, Morpheme Segmentation, Bayesian Non-Parametric Model

Domaines

Informatique et langage [cs.CL]

Fichier principal

2023.findings-eacl.48.pdf (639.31 Ko)

Origine	Fichiers éditeurs autorisés sur une archive ouverte

François Yvon : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04086368

Soumis le : mardi 2 mai 2023-09:41:48

Dernière modification le : mardi 17 décembre 2024-03:14:34

Archivage à long terme le : jeudi 3 août 2023-18:17:49

Dates et versions

hal-04086368 , version 1 (02-05-2023)

Licence

Paternité - Partage selon les Conditions Initiales

Identifiants

HAL Id : hal-04086368 , version 1

Citer

Shu Okabe, François Yvon. Joint Word and Morpheme Segmentation with Bayesian Non-Parametric Models. 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023), Association for Computational Linguistics, May 2023, Dubrovnik, Croatia. pp.628-642. ⟨hal-04086368⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA PARISTECH CENTRALESUPELEC UNIV-PARIS-SACLAY ANR LISN GS-COMPUTER-SCIENCE LISN-TLP

527 Consultations

123 Téléchargements

Joint Word and Morpheme Segmentation with Bayesian Non-Parametric Models

Modèles bayésiens non-paramétriques pour la segmentation conjointe en mots et morphèmes

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager