Andrew Tate and Dr. Christopher Oscarson, Department of Comparative Arts and Letters
I wanted to understand how Norwegians’ perceptions of their own country, particularly with regards to ecology, changed between the years 1868 and 1921. The Norwegian Tourist Association (or Den Norske Turistforening) has been publishing yearbooks since 1868. These publications are packed with articles and poems by various contributors– mostly focused in some way on enjoying nature–along with reports about the organization itself, tourist maps, rules for cabin use, etc.
Dr. Oscarson and I chose these publications for a couple of reasons. Since they are published yearly they carry a dimension of time, which I hoped would allow me to track changes through the years. Since they are about nature, I assumed they would bear upon ecology and the environment.
My methodology was to first gather a corpus of digitized yearbooks and extract the corresponding articles with any relevant metadata (such as author name, or geographical area). Gathering and preparing the corpus included using a computer scripting language (Python) to break up the journals into articles, years, etc. This was done to allow me to separate the information into distinct sections so that I would know which ones were applicable at any given moment.
Then, I intended to analyze them using a process called topic modelling. During topic modelling, a statistical analysis algorithm is applied to a corpus of text which groups words that occur commonly together into “topics” (i.e. tree, river, deer, waterfall, and fern). These “topics” are measurably traceable through the entire corpus and allow a researcher to gain insight into how themes develop in large bodies of literature at a macro level. This method contrasts with the traditional, scholarly deep-reading approach which certainly has its place, but can rarely produce better than anecdotal evidence in support of research claims about writing at this scale.
There are plenty of other procedures that I had in mind to undertake, such as analyzing the frequency of place names. This could have keyed me in to trends of shifting interest. For example, place name mentions may make a marked shift from northern Norway early in the yearbooks, to the south as time goes on. This may be a great area of future research.
While I was not able to finish the topic modelling and move forward to drawing literary conclusions, I invested great time and effort into the first part of my methodology and at last accomplished it.
The trouble was that putting together and preparing the corpus proved a far more involved project than expected. I was able to obtain a few of the digitized texts in .txt format through correspondence with the Oslo National Library and I found two of them at Project Runeberg (http://runeberg.org/), however, the vast majority of the forty-nine tourist journals were much more difficult to track down and convert to the correct format.
Eventually, after a trip to the U of U to obtain PDFs for the remaining journals from Hathitrust (https://www.hathitrust.org/), I had all of them (in some form or another) except one. I requested a physical copy of the last journal through interlibrary loan from the University of California, Berkley and digitized it myself. Then, the PDFs needed to be converted into .txt files in preparation for the topic modelling. I sent them to BYU’s archives department where they were eventually converted using a process called OCR.
Finally, I undertook to separate the yearbook .txt files into chunks, one chunk for each article. I did this by painstakingly inserting a specially marker character between each article, and writing a program to understand the markings and chunk the files, also extracting author name and article title. At last, the chunking was done. My main accomplishment turned out to be preparing the corpus itself, during which I learned a lot. This corpus will allow my mentor, Dr. Oscarson, to run the analysis and draw literary conclusions in the future, augmenting his work on the Swedish tourist journals, and the works of Selma Lagerløf.
My efforts were as part of a team of ORCA recipients who became the Nordic Digital Humanities Lab (http://nordicdh.org). Together we accomplished more than any of us could have alone. For example, Emily Livingston and Erin Modersitzki collaborated on the process of identifying topics in the Selma Lagerløf corpus, until I was able to run topic modelling on the Swedish Tourist Association’s journals for Erin so that she could begin the process of naming the topics for that project.
In addition, I was able to assist in the testing of a new topic modelling program which another member of the lab had developed, and provide valuable technical insight throughout the year.
I have very much appreciated the generosity of the donors in providing this magnificent experience. It has allowed me to explore the science and techniques of natural language processing and the opportunity to lay the groundwork for future research in the area of digital analysis of Norwegian Tourist Journals.