Marco Mora Huizar and Dr. Heather Willson, Department of Linguistics and English Language
The purpose of the Marshallese Corpus Project (MCP) is to create an extensive online corpus of Marshallese language texts to be used for research and other applications. Currently, there is a severe lack of resources for the study of the Marshallese language, and it is difficult to gain access to many Marshallese texts. Further, there is little by way of available naturalistic language data in Marshallese. Therefore, the ability of both researchers and language learners to study the language is severely inhibited. This project involves collecting both printed and online text samples, standardizing the spelling of these texts and creating a database of Marshallese sentences, which will constitute the bulk of the corpus. This corpus will be freely available to researchers or anyone else wishing to study Marshallese, and will facilitate syntactic research by allowing linguists to search for different words in context, for different constructions, for different word orders, and for frequencies of words and constructions, as well as to perform diachronic research. It will also provide learners with the ability to see different forms of a word and how they are used in context.
At the onset of this year’s work on the project, we1 had already gathered the bulk of the texts we are going to use in the corpus. These texts include the Marshallese translation of the Bible and the Marshall Islands Journal, a bilingual periodical based on Majuro, RMI. We had also started a simple tag set. So the next steps of the project were to finish the design of the tag set, tagging the text and making the corpus available online.
For the purpose of this project, a tag set is a list of tags and their respective meanings. These tags will be used to label each word in the database of texts. The table at the right is an example of a simple tag set. The reason why a real Marshallese tag set took time and effort to design is that the Marshallese language is complex and in order to represent that complexity we needed to identify linguistic phenomena and find labels for them. For example, some Marshallese nouns change depending on whom or what is possessing them. These are called inalienables; they look and act differently than other nouns so they needed a separate tag. In addition, we could not simply assign a random tag because we want the tags to be retrievable; in other words, we want to be able to do searches for all the words labeled with the same tag. So we came up with the following labels: NN for normal nouns, NNP for proper nouns, NNI for inalienable nouns. Because NN is part of all three labels, a search for NN will pull up all nouns but a search for NNI will pull up only the nouns labeled as inalienable.
Once the design of the tag set was finished, the next step was to tag a word list that will be used to train the tagger. The tagger is a program that can be trained to label the words of a text according to the tag set. An example in English would be that we want the tagger to label a sentence such as: “The cat runs slowly” to “The(ART) cat(NN) runs(VV) slowly(ADV)”. Again, in order to train the tagger to be able to do this, we need to provide it with a list of the most common words in Marshallese and their respective tags. Our list contained just over 2400 words. Needless to say, tagging these words by hand took a great deal of time. At the moment we recently finished tagging this word list, meaning there is still some work to be done before being able to publish the corpus.
During the summer, I was involved in the creation of several language materials that are currently being used by the missionaries who are learning Marshallese at the Missionary Training Center. Being able to look at the corpus of Marshallese that we have so far allowed me to make significant contributions to the grammar materials we created.
Although we have accomplished a lot, we still have a lot of work to do. The next steps before being able to publish the corpus include 1) training the tagger and tag the texts that we have and 2) designing an interface by which to publish the corpus. However, along the way, in addressing the problems we had not foreseen, I have learned a lot about the importance of team work and innovation. (I marvel at the willingness of individuals to help in a good cause even though they may be unfamiliar with the type of work at hand.) I am thankful for the opportunity I have had to work on this project and for the opportunity to work with the individuals that have helped out along the way.
References
- This project could not have progressed as far as it has without the support of individuals who work on it without monetary compensation despite the pressures of work and other commitments. Dr. Heather Willson, who first suggested the project, has been a mentor and a great leader. David Fallon, an undergraduate student, has been a part of the team for a few months and has contributed time, effort and great ideas. Jared Meyers is working with us on finding efficient solutions for our problems despite the fact that the project itself is outside his area of research. Others that have contributed by providing example and guidance are Dr. Mark Davies from the Department of Linguistics and English Language at BYU and Dr. Kevin Scannell from the Department of Mathematics and Computer Science at St. Louis University.