Charles Parkinson Fry and Professor Todd Peterson, Computer Science
Continuous state spaces can be quite useful in Q-learning. Many real world problems are simply not discrete. An attempt to represent continuous values using a discrete state space is inherently problematic, as the selected level of discritization will likely be imperfect and unable to adapt to change. Such problems thus favor a continuous representation of the state space. Continuous state spaces, however, introduce new difficulties. They are infinitely large, and Q-learning is no longer guaranteed to converge.
The drawbacks of continuous state spaces in Q-learning can be overcome, to a degree, through the use of RBF networks, which reduce the size of the effective state space and make it more manageable. One problem for which this approach is especially useful is that of physical navigation. A robot could be equipped, for example, with a number of sonars that allowed it to determine the distance of obstacles in any given direction. The real-valued (or continuous) inputs from these sonars could constitute the state space. Using Q-learning with an RBF network, the robot could learn to navigate around obstacles and to find predefined goals.
In the previous example, one key decision that would need to be made is the number of sonars that the robot would be given, and their placement around the robot. The more sonars (or sensors) the robot has, the more information it has about the environment. However, each sonar adds another dimension to the state space, thus increasing its complexity and possibly making it more difficult to learn using Q-learning. It is thus necessary to balance the information gained by more sensors with the overhead involved in understanding them and in learning from them.
Another key decision that must be made in this example is the number of actions that the robot should have. Unless a continuous action space is implemented, the robot must choose from a discrete number of actions. Perhaps it could have the ability to turn in any increment of 5 degrees, or of 45 degrees. More actions will obviously allow finer movement, but they may also negatively effect the robot’s learning. Again, it is necessary to balance the increased mobility gained by more actions with the overhead involved in learning to use them.
Current implementations of this type of problem currently rely on the manual selection of most parameters, including the number of sensors and actions. The simplest manner of doing this is to simply select what seems like reasonable values for both parameters, and to work with them. This solution is obviously less than ideal. It is also possible to determine these values by experimentation, though such a process can be quite time-consuming.
Manually selecting values for these parameters is an imperfect solution for several other reasons. The number of sensors and the number of actions might be dependent variables, in which case determining their optimal values would involve examining all possible combinations of both parameters. Even if optimal values are selected, they will still not be able to deal with a variable or a changing environment. Different granularities of complexity would require different numbers of sensors and actions, and a manually fixed value would not be able to adjust itself according to current needs.
One way to circumvent the problems inherent in Manual sensor and action selection is to allow automatic manipulation of these values. Such a method would allow on-the-fly adjustments to accommodate for changing conditions. There are a number of methods by which these parameters can be automatically controlled. One option would be to have an external learner who was in charge of the parameters, and who adjusted them according to the agent’s performance. Another option, which I choose to examine further, is to feed the number of sensors and the number of actions back into the Q-learning algorithm, and give it additional actions which would allow it to adjust those parameters as necessary.
Despite its appeal, this is still an imperfect solution. It introduces additional overhead, and it increases the size of both the state and action spaces. In order for it to be truly effective, these disadvantages must be mitigated by the increased flexibility and performance that automatic sensor and action selection would bring the agent. When applied to an appropriate problem, this algorithm could potentially bring a significant increase in performance.
My experiments showed that Q-learning was able to select an optimal number and spread of sonars and actions. When I manually tried all possible combinations of these parameters, I found that there was a range of optimal values, as opposed to just one optimal setting. The agent would initially play around with various values for all of these parameters, and then finally settle down on the best combination he had come across. Unsurprisingly, the exact values he converged to varied over different seeds, but in all cases they fell in the optimal range.
Although the final performance of automatic selection was as good as performance when the said parameters were specified manually, it did take longer to converge. This is to be expected, as the agent had four new parameters to learn. What is more important is that the amount of time it took the agent to learn those parameters using Q-learning was significantly less than the time it took me to manually optimize them C it took several additional rounds as opposed to days.
The primary benefit of automatic sensor and action selection that I was able to demonstrate was the time savings it introduced over manual selection. This, however, is only a secondary benefit of using automatic selection. The most promising advantage is the ability it should have to adjust to variable and changing environments. In an environment whose complexity and granularity were not fixed, it would be impossible to effectively select the number of sensors and actions beforehand. Automatic selection could eliminate this need and dynamically adjust to a variable or changing environment.
There are other more important and farther reaching implications of the ability of Q-learning to automatically select these parameters by inserting them in its feedback loop. The fact that Q-learning was able to learn these parameters suggests that it might in a similar way be able to learn other parameters. Ideal parameters to be automatically selected would include parameters that were problematic to manually select, or that were dependent on potentially changing aspects of the environment.