Attribute Extraction for Web Page Clustering in Web People Searches

Joseph Park and Dr. David Embley, Computer Science

The disambiguation of person names in web people searches is a long standing problem within the semantic search community. A query such as the name “Henry Eyring” would produce thousands of results with references to more than one entity with that same name. In order to mitigate this problem, person-related attributes such as birth dates are extracted and used to group pages that refer to the same entity. The level of confidence that the pages are correctly grouped together is thus directly dependent upon the level of confidence that the person-related attributes were correctly extracted and properly associated to the correct entity.

One information extraction technique currently used to extract and properly associate person-related attributes is the use of an extraction ontology, a model used to represent objects and their relationship to the world. Ontological commitment, from the field of philosophy, is an addition to this model and uses the existence of textual objects such as names to infer the existence of abstract objects such as a person. Relationships between abstract objects and textual objects may then be formed by associating the abstract objects to the textual objects, or by associating abstract objects to other abstract objects.

In order to perform the process of disambiguation, the objects and relationships produced using the extraction ontology are converted into triples, which consist of a subject, predicate, and object. These are treated as facts and are given confidence factors to define the correctness of extracting the person-related attributes. Each fact is grouped into a mathematical set based on its type. Each set of the same type is then compared to a corresponding set of the same type between two person entities and using the Stanford Certainty Theory1, a confidence factor that the two person entities are the same is produced. This process is repeated over all person entities and a confidence matrix is generated. A threshold is set to either group a given person entity into an existing group or to form an entirely new group.

Originally, this process was to be used before the summer of 2011 in a competition called Web People Search2. The competition was not held during the anticipated time frame so the above process was modified for use in the HIP „11 workshop. Due to time constraints, we were unable to implement the disambiguation algorithm described above but we did still use the rest of the described process as a proof of concept. For the HIP „11 workshop, we used our extraction ontology based system on a corpus of 830 historical documents known as The Ely Ancestry. We simplified our model to handle person-birthdate relationships, person-deathdate relationships, person-son relationships, person-daughter relationships, and person-child relationships. During the process of extraction and proper association of person-related attributes, we found 8,740 person-birthdate facts; 3,803 person-deathdate facts; 2,394 person-son facts; 2,294 person-daughter facts; and 5,020 person-child facts for a total of 22,251 facts. Of these facts, we chose 100 at random to compare against a gold standard and calculated a precision of 52%. We also chose 2 entire pages at random and calculated a precision of 40% and a recall of 33%.

Based on our results, our model needs improvement before it can be deemed a viable solution. Many of our precision and recall errors were a result of incorrectly extracting person names and incorrectly associating person-related attributes. Our system, however, did show us that when given correctly extracted person names it can achieve precision as high as 90% and recall around that same level.

Though this project did not accomplish its original intent, it was still successful in some aspects. It has helped me gain enough understanding of the disambiguation problem to pursue a Master‟s in that general area. I have spent the past year and a half attempting to solve this problem and will continue to work on it and its sub-problems. So far, this work has resulted in a presentation given at the Student Research Conference 2011, a paper entitled Enable search for facts and implied facts in historical documents3, and a poster presentation given to promote ORCA at both the Wilkinson Student Center and at the President•fs Leadership Council.

I want to thank all of the individuals who worked on our system and also those involved in the Data Extraction Research Group. I also want to thank ORCA for their efforts in providing a mentored learning experience for students and I also want to thank those who contributed to the ORCA grant that allowed me to have a mentored learning experience.

References

G. Luger and W. Stubblefield. Artificial Intelligence: Structures and Strategies for Complex Problem
Artiles, Javier; Gonzalo, Julio; Sekine, Satoshi. WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task, WePS-2, ’09
David W. Embley, Spencer Machado, Thomas Packer, Joseph Park, Andrew Zitzelberger, Stephen W. Liddle, Nathan Tate, and Deryle W. Lonsdale. 2011. Enabling search for facts and implied facts in historical documents. In Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP ’11). ACM, New York, NY, USA, 59-66. DOI=10.1145/2037342.2037353 http://doi.acm.org/10.1145/2037342.2037353

Brigham Young University

Journal of Undergraduate Research

Attribute Extraction for Web Page Clustering in Web People Searches

Joseph Park and Dr. David Embley, Computer Science

References