Christ Rytting and Dr. Kerk Phillips, Economics Department
Drivendata.org says “In the year 2000, the member states of the United Nations agreed to a set of goals to measure the progress of global development. The aim of these goals was to increase standards of living around the world by emphasizing human capital, infrastructure, and human rights.”
The goals are the following, according to drivendata.org:
• To eradicate extreme poverty and hunger
• To achieve universal primary education
• To promote gender equality and empower women
• To reduce child mortality
• To improve maternal health
• To combat HIV/AIDS, malaria, and other diseases
• To ensure environmental sustainability
• To develop a global partnership for development
I want to ask two questions concerning this data:
1. To what extent is the UN achieving these goals, and to what extent will they achieve them in future years
2. Which of the variables in the dataset contribute most to these goals.
A secondary question to answer in the event that we can’t answer the first questions, or maybe in addition to the first questions, could be the following:
1. Predict, given a time series or group of time series, which continent does the country belong to?
These questions are of utmost importance to the well-being of peoples across the world. The UN’s goals have directly to do with the world’s citizens’ welfare; reducing disease increases welfare and productivity; increasing universality of education bolsters economies and well-being; improving the wellbeing of mothers and women make for better lives for a significant percentage of the population. To quantify which of these goals need our attention can increase welfare universally and therefore is of utmost importance.
This quantification can be done using two machine learning techniques: random forests and a certain kind of deep neural net called an LSTM. It is standard procedure to forecast time series using these two architectures.
The data I downloaded is from drivendata.org. It consists of all of the World Bank macroeconomic indicators as a zipped CSV. Now, let’s get started by making some necessary imports.
The model I developed was developed with heavy aid from the Taegyun Jeon’s work on time series prediction using LSTM. He has developed certain architectures that perform impressively in time series forecasting, and I studied his code extensively to see how to build my own architecture.
After laying the groundwork of the model, though, there are a number of ways in which we can tinker with the estimator and the model itself in order to increase accuracy; these are as follows:
I was toying with different n_classes but realized that there aren’t any classes in the target, as we are dealing with continuous data. This, then, is best left as 0 or 1.
The Adagrad optimizer did better than Adam and SGD for certain learning rates, but for smaller learning rates, Adam surpassed its competitors by orders of magnitude. Therefore, this is the optimizer we use in the end.
I played around with step size a bit, trying out values of 10e-4, 10e-3, 10e-2, and 10e-1, of which I found 10e-2 to be optimal.
I also found that generally, the higher number of dense layers, until about 25, the more accurate the net was in predicting the test set. The size I decided on eventually was [25,25], which results in quite an impressive accuracy compared to other results.
I also toyed with batch sizes. I tried 100 and 50 before realizing that these batch sizes were too big for the data (because I am not passing in batches of 100 arrays, but rather 100 element long batches, which is effectively a batch size of 36, since that’s how long my time series are. I proceeded to try 20, 15, 10, 5, 2, and what I found was that a batch size of 10 was optimal. I hypothesize that this is due to the long-term nature of the data. Perhaps we want the net to focus more on what has happened in the past 10 years than in the years preceding it. So we pass in batch size of 10 for best performance.
I realized, throughout the process, that I shouldn’t have used an LSTM architecture. The strength of LSTMs is that they have both long term and short term memory, so they can remember certain dependencies from much earlier in the series. However, the nature of my data is short-lived, since the measurements are taken yearly rather than daily or monthly. This leads to a small amount of data points rather than an abundance of them, and the benefit of LSTMs’ long-term memory is outweighed by the cost in complexity of the model. If I were to do it over again (which I won’t right now, seeing as how I have spent over 20 hours on the project already), I would use a more simple RNN architecture.
Honestly, I was disappointed in my results. I wanted to use multidimensional arrays to do these predictions but could only manage to do it with 1-d. This makes for bad training because of the small amount of information available to the net while training. I could not get my network to function, however, with multiple dimensions. With more time, I am sure that I will be able to make this work, but for right now, an LSTM that took arrays that were multiple dimensions was too ambitious. I have spent the majority of my time getting the network to compile at all and cleaning the data, and I could not develop the topology of the network as thoroughly as I would have liked. Despite all this, however, the method worked well on the 1-dimensional data.