Logan Kearsley and Dr. Alan Melby, Linguistics and English Language
Lexicography (the compilation of dictionaries and study of relations between lexical items) and terminology (the study of technical terms used in specific fields or contexts) are conceptually very similar, both being based in the study and categorization of lexical items in human language. Both dictionaries and termbases are essential tools for second-language learners and professional translators and contain much of the same kinds of information on definitions, grammatical characteristics, and usages of words. Till now, however, tools for creating and curating dictionaries and termbases have existed largely in isolation from each other, with no simple way of sharing data between projects in lexicography and terminology, partly due to the significantly different models used to organize data in the different fields.
Two different ISO standards exist for representing terminological and lexicographical data in a customizable, language-agnostic way: the Terminological Markup Framework (TMF, an abstract model with a concrete representation provided by the TBX Termbase eXchange Format)1 and the Lexical Markup Framework (LMF)2. In this project we have created a new data model capable of subsuming the features of both TMF and LMF, sharing as much information as possible between the lexicographical and terminological halves, and exporting data for viewing in either perspective. Additionally, we have created reference software that implements this data model and a standard API3 for manipulating it.
Methods
A previously developed prototype system for alternating between lexicographical and terminological views of lexical information was used as the basis for this project. The data model from this previous project was formalized as a UML diagram and revised several times after consultation with professors involved in lexicography projects and industry experts, including Kara Warburton, chair of ISO TC37, and Hanne Smaadahl, chair of TerminOrgs4. Based on this revised data model, we then designed a REST5- inspired API for manipulating lexicons, termbases, and lexical entries as HTTP resources. Sample data files for English and Russian lexicons were created by hand to verify the completeness of usability of the data model and test the API.
Reference software to implement the database management and API was written in Python using the Django web framework. Django’s built-in ORM6 system was used to transform our abstract data model into a traditional relational database format. Collaborative development was done using the Git version control system and the source code for this reference implementation has been made publicly available on GitHub7, where it remains under development. Code was deployed to a shared server hosting other terminology-related projects at lexterm.gevterm.net for public testing.
Results
We were successfully able to design a new data model to merge TMF and LMF data and demonstrate a trivial mapping to one or the other standard format. Our final data model, however, remained split into two distinct sections along a different boundary: one section for containing actual lexical data, and a second for language-specific metadata. This is equivalent to the division between data files and XCS configuration files in the TBX standard and allows our system to remain completely languageagnostic by allowing users to configure it to properly format data for any new language features (such as additional parts of speech) as needed.
While the complete LexTerm system is not inherently an internet-based application, and can have server and client components bundled together locally on a single machine, the decision to design an HTTP-based API for the database had several positive consequences. In addition to allowing multiple clients to work on the same database simultaneously over the internet, this architecture made it possible to eliminate the complexity of TBX and LMF export from the database management code, and instead produce separate data export and import programs that attach to the database like any other client. This required some minor modifications to the API from our original design, but we were able to demonstrate the capacity to extract all necessary information and produce TBX exports. This architecture also makes it trivial to add support for additional import and export formats (such as SIL Toolbox MDF format8, or printable PDF) without modifying existing code.
Discussion
Initially we had planned to actually implement a much more full-featured terminology management system that could be used in a professional setting; implementing all of the necessary user permissions management, revisions control, and security features required for a professional product turned out, however, to be a much more significant software engineering task than originally expected. Rather than spending time on these incidentals, we instead focused on improving the core data management system. We intend to integrate this into a larger and more robust system as part of future work in collaboration with other open-source lexicography and terminology projects. While not yet suited for large-scale organizational use, our system is usable for transferring information between formats used in lexicographical and terminological projects.
References
- ISO 30042:2008 (2008). Terminology and other language and content resources — Computer applications in
terminology — TermBase eXchange Format Specification (TBX). Geneva: International Organization for
Standardization. - ISO 24613:2008 (2008). Lexical resource management – Lexical markup framework (LMF). Geneva: International
Organization for Standardization. - Application Programming Interface, a specification of the actions that a piece of software can perform and methods
for triggering those actions from other software as distinct from the means by which those actions are implemented. - http://www.terminorgs.net/
- Representational State Transfer, http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
- Objection Relational Mapping
- https://github.com/LexTerm
- http://www-01.sil.org/computing/shoebox/MDF.html