Michael Rimer and Dr. Tony Martinez, Computer Science
Neural networks are a statistics-based learning paradigm in which specific acquired data samples are studied to form general rules on how to correctly handle new, unseen data. They have been used to make informed decisions in complex systems, such as predicting stock market trends, controlling robots in unpredictable environments, and performing voice or face recognition.
Neural networks consist of a set of nodes (“neurons”). Certain nodes act as inputs, receiving external data signals on whatever is being studied. Others act as outputs, providing decisions about that data. Inputs and outputs are somehow internally linked, with the data signal often passing through one or more layers of internal nodes. The internal nodes collectively aid in making more complex decisions on the data encountered. In effect, their presence increases the decision-making ability of the network.
A neural network “learns” by repeatedly evaluating given “training” data, which must be representative to be useful. The network then provides a statistically optimal solution by which to make decisions regarding future “test” data. A solution can exist only within the flexibility, or decision-making power, of a given network. Thus, providing relevant training data and a proper network topology are prime factors in training a network to successfully learn valid concepts on which to base its decisions in the future.
While neural networks can be very accurate and robust to error, choosing an inappropriate design often results in poor accuracy. It is problematic and time-consuming to develop an optimal network topology for each problem faced. Also, their complex architectures often make it difficult for a human to understand how they arrive at decisions.
I have developed a method of constructing networks that can overcome these difficulties in multicategorical systems. Integrating the principles of binary decision trees and generating a branching neural network, tailor-made to its training data, accomplishes this.
First, a minimal network is implemented with the aim of finding the largest feature differences existing among data samples belonging to different output classes. This network, called the splitter, is given little decision-making flexibility in order that it learns to detect only the most fundamental differences. It serves as a pattern classifier to collectively group the most similar classes of data together while separating the most contrasting ones. The splitter then has internal nodes added to provide a more powerful, accurate means for dividing samples along the most fundamental split.
Once this is accomplished, one half of the possible decisions made by the net exist largely on either side of the partition. The original data set can be filtered into two data sets, one for the resultant sample classes on each side. Two sub-networks are generated, each making decisions for the data classes remaining on either side. The advantage in this is that whereas there once were n ways for the network to decide, now there are only n/2 decisions remaining for each sub-network to resolve. Hence, each subnet is much simplified over a single network that has to decide among all available output classes. According to Occam’s razor, when two systems categorize known data equally well, the simpler of the two will be more accurate in categorizing unknown data. It follows that in this situation simpler networks provide greater generalization.
Following this methodology, the remaining classes on either side can be bisected again by finding the most fundamental split. This process can be continued, creating more and more branches to the tree, until only a single class remains at each terminating branch, at which point the problem is solved.
It is unduly optimistic to believe that a simple split can be made at a given level that completely separates all samples from one set of classes from another’s. As each branch of the network tree is only equipped to handle half of the previously possible decision cases, it is impossible for it to correctly handle any data erroneously passed down its half of the tree. Thus, the use of this topology is advantageous providing the increase in the sub-branches’ accuracy over the standard single network outweighs the inaccuracy caused by the splitter’s incorrect choices at each branching.
I tested this new topology in optical character recognition (OCR) over a large data set. A branching network of roughly six layers was compared against an array of 46 separate networks (one for each of 26 letters, upper and lower case). I found the new topology to be more accurate in recognizing test character samples than the separate nets. The new network was also superior in that it classified the test character samples much faster than the 46 separate networks. This is to be expected, as the binary search down the tree optimally requires only log2n decisions to arrive at the correct letter, while n separate nets require n decisions (e.g., 5.5 versus 46).
First, it has been shown that this method can produce networks with greater accuracy and speed than standard nets for real-world problems. This is possible for any data distribution for, we must conclude, correctly partitioning all data classes simultaneously (as a standard neural network attempts to do) is no simpler than merely trying to divide the classes in half. Results are identical only when just two data classes are to be separated. Second, the decision-making process of the resultant network is more easily understood for at least two reasons: (1) rough class organization is easily observed, given the network’s hierarchical nature, and (2) the simplicity of the splitter at each of the branches permits study to discover which features proved most salient in determining class divisions.
In the future, I plan to extend empirical testing of the new neural network topology to additional problem domains. I also plan to investigate extending the network to allow generating more than two branches at each split. I also want to propose different metrics for splitters to use to decide how to partition data classes more optimally.