Creating hebrewCorpus: A Vast Online Resource for Modern Hebrew

Justin Parry and Dr. Dilworth Parkinson, Asian and Near Eastern Languages

Main Text

Hebrew in its present form, called Israeli by some scholars because of its unique modern characteristics, has only existed for about 120 years (Zuckermann 1). Because of this, scholarship in Israeli Hebrew is in many ways still in its infancy. Within this language there is one area of research that needs clear development and better resources. This area is called corpus linguistics, because it deals with corpora. A corpus can be defined as a “systematic collection of speech or writing in a language” (Matthews 78). This area is particularly significant because it allows researchers and language learners to search for particular words and constructions and see them in context. Although there are limited Hebrew corpora that are available and searchable, there is a clear need for a comprehensive tool to assist scholars in studying Israeli Hebrew more fully.

Hebrew is especially difficult for natural language processing since it is very ambiguous. Although some basic texts contain full voweling (nikkud), most do not include diacritics and can be understood in a number of ways. In fact, in a recent study it was shown that Hebrew words have an average of 2.7 morphological analyses per word, compared to only 1.4 in English (Goldberg and Elhadad 2). Some Hebrew words even have 13 possible analyses (Carmel 313). Because of that, few tools exist that accurately tag Hebrew for part of speech.

A highly-usable corpus for Israeli Hebrew now exists, and is available for free online (to use the corpus, go to http://hebrewcorpus.nmelrc.org.). This corpus, called hebrewCorpus, was developed for the National Middle East Language Resource Center (NMELRC). It allows the user to search a number of texts from several genres, including academic journals, newspapers, Wikipedia, movies, and fiction. All of these texts add up to over 150 million words.

The idea for this project stemmed from an Arabic corpus that Dr. Dilworth Parkinson created (arabiCorpus). Dr. Shmuel Bolozky, a Hebrew linguist and chair of NMELRC, suggested that the same concept be applied to create a Hebrew corpus. I had the unique opportunity of developing hebrewCorpus in consultation with Dr. Parkinson and adding several texts to it. I am the current maintainer of the corpus. Since it has been placed online, it has been used by students, teachers, and scholars from all over the world.

The texts in hebrewCorpus are not tagged, but the program instead uses filters to try and predict parts of speech based on morphological structure. The program also uses regular expressions, which allow for greatly enhanced searching capabilities. Detailed instructions and a tutorial for the corpus are available on the website.

I have used this resource a number of times in my own research, and will continue to do so. Many of my findings have been included in academic papers. Here is a list of a few: a study of Hebrew me’od (‘very’) before and after adjectives, a comparison of internet chats by Hebrew second language learners and native speakers, and a comparison of Hebrew during its revival and Israeli Hebrew. The data gathered from this corpus has been crucial to these studies since it has helped solidify my claims with real-world examples.

This corpus has helped me in other ways as well. As a second language learner of Hebrew, there have been constructions that I have been unsure about as a non-native speaker, and seeing them in Hebrew sources has given me confidence on their correct usage. Both this project and the ORCA grant have also opened doors for me as a beginning graduate student at the University of Texas at Austin, and given me an introduction to computer programming in the Humanities. There have also been numerous benefits of hebrewCorpus for the scholarly world. Among these, it has opened up new doors of inquiry, and proved to be a testing ground for linguistic phenomena and conjectures. In addition, it has garnered positive attention for the NMELRC and BYU.

This project is the next step towards creating a large multi-genre Israeli Hebrew corpus that is lemmatized, or with the different inflected forms of words grouped, and tagged for part of speech. It is hoped that in the future such a corpus will exist for Hebrew that is comparable to a number of English corpora that have enhanced searching capabilities and usability. The materials developed in this project can then be adapted to fit an improved schema.

Bibliography

Carmel, David and Yoelle S. Maarek. “Morphological Disambiguation for Hebrew Search Systems.” Next Generation Information Technologies and Systems – NGITS (1999): 312- 326. Print.
Goldberg, Yoav and Michael Elhadad. “Easy First Dependency Parsing of Modern Hebrew.” Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages (2010): 103-107. Print.
HebrewCorpus. National Middle East Language Resource Center, 19 October 2009. Web. 31 December 2010.
Matthews, P. H. The Concise Oxford Dictionary of Linguistics. New York: Oxford University Press, 1997. Print.
Parkinson, Dilworth. arabiCorpus. Brigham Young University. Web. 31 December 2010.
Zuckermann, Ghil’ad. “A New Vision for Israeli Hebrew.” Journal of Modern Jewish Studies 5.1 (2006): 57-71. Print.