Extraction of Genealogical Information from the Internet

Troy Walker and Dr. David Embley, Computer Science

Data extraction is a rapidly growing area of computer science. It focuses on the extraction of pertinent data from large stores of knowledge such as databases or the internet. Data extraction allows us to use existing stores of data in new ways. One application for data extraction is genealogical research. Various commercial and non-profit groups make genealogical data available on line. In addition to these, hundreds of personal web pages contain personal family trees. I wanted to enable the extraction of information from these sources by computer. BYU’s Data Extraction Group (DEG) has developed tools for extracting data from web pages in HTML format. These tools can be found at www.deg.byu.edu. I developed an ontology (scheme for extracting and storing data) and related lexicons for these tools to extract genealogical data.

My first task was to describe what information to extract for each person in the input and the relationships between those bits of information. Because of the complexity of the problem, I tried to keep it simple. Every person has to have a name. Possibly we can tell his or her gender. Each person also can have any number of events recorded including birth, death, and marriage. Each event can contain a date and location.

The most difficult part is telling the computer what data to put in each of these fields. When we read web pages, we can easily tell that “John Doe” is probably a person’s name. “Social Security Index” is probably not. To help the computer put the right items in the record, the DEG tools use a combination of lexicons and regular expressions. A lexicon is a list of words. The DEG already had lists of first names and last names as well as months. I compiled lists of states and countries to assist me. These lexicons are useful for accurately identifying words, but cannot be complete. One cannot compile a list of all first names ever nor can one reasonably expect to predict all abbreviations people would use for state names. This is where the regular expressions pick up. Regular expressions can tell the computer what combinations of letters, numbers, and symbols to accept. For instance, a capital letter followed by any number of lower-case letters could be accepted as a city name. By using lexicons and regular expressions, I built an ontology to describe genealogical data.

Building this ontology took hours of trial and error. Once I had the framework in mind, I built it up piece by piece making sure that it worked as I expected. It usually did not. Different sources have different ways of expressing dates, names, and locations. Some people list names with the surname first. Locations in the USA are normally recorded without the country. Sometimes unknown data such as county is left blank leaving two commas in a row. I took many of these possibilities into account. Adding all of these contingencies took the bulk of my time on this research.

I tested my final ontology on a page from www.familysearch.org. I searched for people named Ezra Erastus Walker born within five years of 1885. This provided a variety of records without having so many that the tools would take too long to run. There were fifteen individuals listed.

With this as input, my ontology found all fifteen. It also found ten extraneous people. One was the name from the title of my search. The others had names such as “North America” and “Social Security.” For the people it correctly found, it got their complete names in all but one case. This was when the entry listed a nickname for the person. My ontology correctly identified genders where they were given. All events were found along with correct dates and locations in all but one instance. In that case, the date was correct, but the location was split in two.

There are many possibilities for future work. The biggest problem with my ontology is the fact that it matched so many extraneous names. This is offset by the fact that it correctly extracted all of the people that were inputted. Additional work could be done on excluding these extra names as well as accounting for possible cases such as nicknames. There are many ways to extend my model as well. The number of each event type could be restricted. I could only allow one birth event, for instance. Support for relationships such as mother, father, and child could also be an interesting challenge. This project has piqued my interest in research and I will continue to investigate this field as I begin to pursue a master’s degree.

Brigham Young University

Journal of Undergraduate Research

Extraction of Genealogical Information from the Internet

Troy Walker and Dr. David Embley, Computer Science