James Carroll and Professor Todd Peterson, Computer Science
Reinforcement Learning is a process whereby actions are acquired using reinforcement signals. A signal is given to an autonomous agent indicating how well that agent is performing an action. The agent then attempts to maximize this reinforcement signal. One common method in reinforcement learning is Q-learning where the agent attempts to learn the expected temporally discounted value function for performing an action a in a state s Q(s,a). This function is updated according to:
This process is very slow and computationally complex. Furthermore simple Q-learning cannot generalize to unseen states and changing goals. The value function is task specific and does not generalize trivially to similar problems as very similar problems can have considerably different value functions. From a practical standpoint this means that Q-learning is only applicable to very simple tasks, and is unfeasible in complex situations.
The purpose of our research is to create a task library that contains many tasks learned over the lifetime of a learning agent. Because many of these tasks are similar information form past tasks can be used to bias the learning of new tasks. This is not a trivial problem because of the vary nature of the value functions that are being learned.
To accomplish this long-term goal it was first necessary to develop methods whereby information from one learned task could bias the learning of another task. Others have developed several methods that we surveyed, and we also developed and published some additional methods.i Research in this area indicated that task transfer was extremely useful when the tasks were sufficiently similar, but when the tasks where not sufficiently similar such methods could do much more harm than good.
The next step in the creation of a task library is the generation of some sort of similarity metric that could be applied before a task was fully learned. This was difficult because simple mean squared error on the Q-values (the obvious choice) was not useful until after the task was completely learned. We explored several different metrics. Mean squared error weighted by the computed estimate of the accuracy of the Q-value (for example unvisited states have an accuracy of 0) was one possibility that we explored. Model similarity was also explored. This tended to work well because a simple action model can gain local accuracy before the entire task is learned.
Once these distance metrics were created a simple clustering algorithm generated clusters of tasks with related features. Once a new task was determined to belong to one of these clusters of tasks, the invariants that those tasks all shared, whether model invariants or invariants in Q-values in certain areas can be immediately transferred to the new task. This could be done because we can assume that when a task is similar in one measured aria, it will most likely be similar in other areas. This algorithm allowed Q-learning to generalize to unseen states, and to use information from past experience to learn in new situations.
We tested this algorithm in two different simple stochastic maze worlds and found that the algorithm worked well. It was robust, avoiding the drawbacks of unrelated transfer, while showing all the advantages from related transfer. We found that using different distance metrics generated different cluster trees, and therefore different invariants could be pulled from the previous tasks.
Future work in this area will be to expand this research into the continuous domain with either the double arm pendulum task or with the robot soccer task. We also hope to be able to use multiple distance metrics simultaneously, transferring different invariants from each tree generated. We also hope to be able to develop an exploration policy based upon information gain provided within the structure of the past tasks. For example, if half of the past tasks all had a feature, while the other half did not, the agent may try an action specifically to test whether the new task had that feature, if the utility of the invariants that could be transferred if that feature existed was considered worth the experiment given the statistical prior probability that the task had that feature before the experiment was tried.
This research has the potential to increase the complexity of tasks that can be acquired through reinforcement learning. Three conference papers have been generated by this project to date and one journal paper and two more conference papers are currently being prepared for submission. The research from this project will be expanded over the next year to become my Masters thesis.
___________________________________
i Todd S. Peterson, Nancy E. Owens and James L. Carroll Towards Automatic Shaping in Robot Navigation ICRA 2001. James L. Carroll, Todd S. Peterson, Nancy E. Owens Memory-guided Exploration in Reinforcement Learning IJCNN 2001. James L. Carroll, Todd S. Peterson Fixed vs. Dynamic Sub-transfer in Reinforcement Learning. April 30, 2002 ICMLA 2002. M. Arif Wani. Las Vegas Nevada, USA June 24-27, 2002 CSREA Press.