Cuando yo lo vi he looked pretty bad.
Codeswitching (CS) is a widely observed phenomenon in social media where people communicate in two or more langauges interchangeably, (Spanish and English, for example). Codeswitching is common among bilingual speakers, both in speech and in writing. Identifying the languages in a codeswitched input is a crucial first step before applying other natural language processing algorithms.
This demo showcases my submission to the shared task of the Second Workshop in Computational Approaches to Codeswitching. The system being demoed can identify English and Spanish in a codeswitched sentence with high token-level accuracy.
Below are the boring details about how the system works:
This system contains two steps to identify tokens in a codeswitched input.
In the first step, we use FastText to train a subword information enhanced skipgrams word vectors model.
Word vectors are vector representations of the words learned from their raw form, using models such as Word2Vec. When used as the underlying input representation, word vectors have been shown to boost the performance in NLP tasks.
FastText word vectors are used instead of standard Word2Vec because FastText can obtain representations of out-of-vocabulary words by summing the representations of character n-grams. This feature is particularly useful because the size of the training data is relatively small. We expect the test dataset to contain words not found in the training dataset. Another motivation for using FastText word vectors is for its ability to take into account morphological information, which is very important for identifying morphologically-rich language like Spanish.
In the second step, We use supervised machine learning to train a Linear-Chain Conditional Random Field CRF classifier that predicts the label of every token. CRF is naturally suited for sequence labeling tasks and it has been shown to perform well in previous work on language identification tasks.
The source code for this project is available on Github. Please try not to flood the demo api endpoint.
MX. Xia, Codeswitching language identification using Subword information enriched Word Vectors
@article{xia2016codeswitching,
title={Codeswitching language identification using subword information enriched word vectors},
author={Xia, Meng Xuan},
journal={EMNLP 2016},
pages={132},
year={2016}
}
My name is Meng Xuan Xia. I graduated from McGill University in 2016 with a major in Honours Computer Science.
I'm a full stack developer and a novice NLP researcher.