Memory-guided Exploration in Reinforcement Learning

James L. Carroll and Professor Todd Peterson, Computer Science

Traditional reinforcement learning techniques learn a single task by giving the agent positive and negative rewards. In one type of reinforcement learning, called Q-learning, the agent stores Qvalues, which are the expected reward for performing an action in a given state. Task transfer is a method of transferring information learned in one task to another related task. Most work in transfer has focused on classification techniques. The purpose of our research has been to extend classification techniques to reinforcement learning.

Information transferred from the source task to the target task can be thought of as an acquired bias for the learning of the target task. There are several ways that the source task can bias the learning of the target task.

1. The source information can bias the Q-value updates on the target task.

2. The source information can bias the initialization of the Q-values on the target task.

3. The source information can bias the exploration of the target task.

4. The source information can bias the function approximators used to learn the target task.

5. The source information can bias the model information used to learn the target task.

In our research we compared task transfer methods that use approaches 1,2 and 3. We plan to explore the more complicated methods 4 and 5 in future research. This work has practical application in shaping, simulator to real world transfer, and multi-agent interactions (or learning in any other non-stationary environment).

Our research compared direct transfer of Q-values, soft transfer of Q-values and Memory-guided exploration. Direct transfer and soft transfer are techniques where the initialization values of the target tasks are biased to varying degrees by the source task. These are the simplest ways to transfer information. Memory-guided exploration is a more complex technique that biases the agent’s initial exploration of its environment towards areas of the state space that were fruitful in the past.

There are several problems associated with direct transfer. Direct transfer can take longer to unlearn incorrect portions of the policy than it would take to learn those portions from scratch. For this reason soft transfer of Q-values often learned faster than direct transfer. Soft transfer attempts to solve this problem. In soft transfer a weighted average between the standard initialization value and the final values of the source task are used to initialize the Q-values of the target task. This effectively means that the Q-values of the target task are biased towards the source task, while remaining easy to change (thus “softened”).

Memory-guided exploration is another method that attempts to solve the problems with direct transfer. Memory-guided exploration biases the agent’s initial exploration of the environment towards areas of the state space that have proved fruitful in the past. Eventual convergence is guaranteed because only the initial exploration is affected. The evaluation function for memory guided exploration is:

eval(s, a) (1-W)Q(s, a) + W X Qold (s, a)

Where W is a measure of how much the agent trusts the information contained in the source task.

There are several divergence issues involved in choosing the actions of a reinforcement-learning agent according to a different distribution than is used for the updates. We initially thought that these divergence issues would not affect our controller because only the initial exploration was affected. However we found that the agent’s initial behavior was often unproductive, and the Q-values skewed towards obstacles (and other unvisited transitions) until W decayed. This prevented the agent from learning during this period. Once W decayed the agent began learning the task from scratch. Although eventual convergence was guaranteed, all the benefits of transfer were neutralized. It was therefore necessary to deal with these divergence issues.

The first method we used to deal with the divergence issues, was to bias the agent’s updates by the same amount that the agent’s exploration was biased, therefore ensuring that the agent’s actions were chosen from the same distribution that was used for the Q-value update equation. This solved many of the convergence problems, however when the tasks were sufficiently dissimilar, the agent would often move to the location of the old goal, and remain there until W decayed, never finding the new goal just a few states away. As before, once W decayed, the agent behaved as if learning from scratch, but by then there was no chance for the agent to exploit any information from the source task. What was needed was a method that allowed the agent to balance exploration of the new task vs. exploitation of the information in the source task. We found that by using a local value for W that decayed in proportion to the amount of experience the agent has in that portion of the target task, the agent behaved as we desired, moving to the location of the old goal, and then exploring outward in an ever widening circle, while exploiting the source information in the rest of the state space. Keeping local values for W also solved many of the divergence issues, because in areas of divergence W dropped rapidly, while remaining high in the rest of the state space, allowing exploitation.

The initial research was published in Towards Automatic Shaping in Robot Navigation ICRA2001, San Diego CA. in which Dr. Peterson was the primary author and Nancy Owens and myself were contributing authors. Further research on divergence issues with memory-guided exploration was published in Memory-guided Exploration in Reinforcement Learning, IJCNN 2001, Washington DC. where I was the primary author and Dr. Peterson was a contributing author.

Related to this project was the research published by Nancy Owens, who explored the effectiveness of various transfer mechanisms in transferring information from a simulated task to the real world. Other researchers 1 have proposed that task transfer in reinforcement learning can be accomplished by fixing related portions of the policy and then allowing other parts of the policy to adapt. I did some work in this area as well, proposing a related method that I called dynamic sub-transfer. These results are published in a technical report Fixed and Dynamic Subtransfer in Reinforcement Q-Learning, April 14, 2001, in which I was the primary author, and Dr. Peterson was a contributing author. Future research will entail exploring model-based approaches, and attempting to expand the transfer paradigm to include multiple source tasks.

References

M. Bowling and M. Veloso. Reusing learned policies between similar problems. In Proceedings of the AIIA-98 Workshop on new trends in robotics. Padua, Italy, 1998.

Brigham Young University

Journal of Undergraduate Research

Memory-guided Exploration in Reinforcement Learning

James L. Carroll and Professor Todd Peterson, Computer Science

References