GedTools Computerized Chinese Genealogy Entry Automation

Nicholas Vrvilo and Dr. Dah-Jye Lee, Electrical and Computer Engineering

Doing family history in Chinese has many unique challenges. The Chinese writing system is not readable by the average person, which means only the small number of individuals who both hold temple recommends and are familiar with Chinese characters would be able to perform all of the family history work for all of the Chinese people. To avoid this problem, the Church requires Romanized1 versions of all Chinese names to be included all family history records submitted for temple work. This poses a problem because most Chinese genealogy workers in Taiwan (where a great deal of Chinese genealogy work is done) are not familiar with this Romanization system. Not only must they enter all of their ancestors’ names twice, but they have to look up the Romanized version of each Chinese character.

A second difficulty presents itself in the way that family histories are traditionally kept in China. The Chinese people greatly respect their ancestors, so most families have a record going back many generations. Although these records include an individual’s name and perhaps place of birth, much of the time the date information for births and deaths are incomplete. Since the Church provides standard methods for estimating2 these missing dates, filling in a few gaps typically is not that difficult. However, many families in Taiwan have returned to their ancestral villages in China and obtained several thousand years of their genealogy from their family shrines—some stretching all the way back to the Yellow Emperor.3 Since records of this size tend to have huge gaps with no data information at all, the process of estimating those dates by hand is long, tedious and error-prone. These tasks can take weeks for large records.

Although not the only problems associated with doing family history work in Chinese, the two mentioned above consume a great deal of time for family history workers in Taiwan. The purpose of this research project was to find a way to have the computer automatically handle as much of these tedious tasks as possible, thus virtually removing this great hindrance to family history work. I discovered that Personal Ancestral File (PAF) has the option to import and export files in GEDCOM4 format, which is the standard for sharing genealogical data between different programs. I decided to work with this format since it is plain-text, making it easy to parse and modify. In addition, files in this format can be uploaded to the new FamilySearch.org, or they can be directly imported into TempleReady which, although no longer used in the USA, still has wide usage in Taiwan due to lack of support for Chinese on FamilySearch.org. By automating this process, a file that might have taken weeks to process can be completed in a second.

While researching methods of automatically generating Romanization data, Allen Lee with Family Search support in Taipei referred me to the Unihan Project5 as a possible source. I was able to adapt much of the character data provided by their database, providing a reference for tens of thousands of characters’ Romanizations. I then applied this data to several thousand names that had already been Romanized by family history workers for comparison. This comparison revealed that the database needed to be biased toward names (some characters are pronounced differently when being used as a name6), and that was bias was achieved through some trial and error. After the biasing was done, I found that a great deal of the remaining inconsistencies were actually due to human errors in the original versions of the records and that, for the most part, the automatic Romanization was actually more accurate.
The initial problem of appending Romanized names to all entries in a GEDCOM file was relatively straightforward since each name can be read, converted, and written without any dependencies on any other entries in the file. However, the task of date estimations is far more complicated and presented many more difficulties. The first task was finding a way to build a family tree structure from the data represented in the GEDCOM file. Variable numbers of children, marriages and even parents made this more difficult, but at last I settled on a structure that I thought contained all of the needed information in an accessible structure.
One of the greatest difficulties with this task was dealing with the GEDCOM files exported from PAF. I discovered that PAF does not follow the GEDCOM 5.5 standard exactly, which resulted in several difficulties during the design, implementation and testing of my algorithms. However, the most complicated part of this project by far was the algorithm design and testing for date estimations. The complexity results from the interdependent nature of all dates in the family tree, and the possibility of inaccuracies, conflicts and other errors.

I was originally planning on using the Java programming language for this project, and I even completed an initial version of the automatic Romanization tool7 using it. However, after some initial trail runs by the family history workers in Kaohsiung, I found there were so many problems stemming from the required Java Virtual Machine that it was seriously hindering the widespread use of this tool. I decided that it would be worth restarting the project from scratch using C++ and provide an automatic installer program for the Windows version. This approach virtually eliminated all problems in distribution, and allowed for widespread use of GedTools.

After an initial design based on the Church’s documentation for estimating dates, the processes was mainly trial and error, making slight alterations and then comparing the results to estimations already completed by hand. It took several months, including a lot of help and feedback from family history workers in Taiwan with the testing, but finally I completed a satisfactory working product. GedTools is available for download at ouuuuch.phoenixteam.org along with the source code, the website includes an instruction page in Traditional Chinese, and the user interface of the program has been fully translated into Traditional Chinese. The program is in wide use in Family History Centers throughout Taiwan, saving them hundreds of hours, and I have hope for it to spread throughout the rest of the Asia area as well.

References

漢語拼音(Hanyu Pinyin). See Chapter 2 of the GEDCOM Specification version 5.5
<http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gcch2.htm>
“填寫家庭資料表方法.” 香港家庭歷史服務中心編纂2000年7月.(“Filling out the Pedigree Chart.” Hong Kong Family History Service Center, comp. July 2007.)
Adam C. Olson, ―Turning Hearts in a Land of Temples,‖ Ensign, Oct 2007, 64–69
Genealogical Data Communications file format <http://www.familysearch.org/eng/Home/FAQ/faq_gedcom.asp>
“Unihan Database.” Unicode.org. The Unicode Consortium. <http://unicode.org/charts/unihan.html>
e.g. 曾 is pronounced Céng in common situations, but Zēng when it is a family name.
Pin-Ming. <http://ouuuuch.phoenixteam.org/released/pinMing/>

Brigham Young University

Journal of Undergraduate Research

GedTools Computerized Chinese Genealogy Entry Automation

Nicholas Vrvilo and Dr. Dah-Jye Lee, Electrical and Computer Engineering

References