Novel textual features for language modeling of intra-sentential code-switching data

Code-switching refers to the frequent use of non-native language words/phrases by speakers while conversating in their native languages. Traditionally, for training a language model (LM) for code-switching data, one is required to tediously collect a large amount of text corpus in the respective code-switching domain. Alternately, we recently proposed a more viable approach that adapts an existing native LM to handle the code-switching data. In this work, we present our efforts for language modeling of code-switching data following both the traditional and the proposed approaches. The salient contributions of this paper includes: (i) creation of the Hindi-English code-switching text corpus, (ii) an improved parts-of-speech (POS) labeling scheme for accurate tagging of non-native words embedded in the code-switching data, and (iii) the proposal of a novel textual feature referred to as the code-switching location (CSL) feature, that allows LMs to predict the code-switching instances. The evaluation of the proposed features has been done on two code-switching datasets: Hindi-English and Mandarin-English. On experimental evaluation, a substantial reduction in the perplexity is achieved with the use of the improvised POS features. It is also observed that the proposed CSL features provide an independent and additive improvement over the POS features in terms of perplexity.

Link to publication - https://www.sciencedirect.com/science/article/abs/pii/S0885230820300322