Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

详细信息

作者：Ekaterina Chernyak ; Boris Mirkin
关键词：Taxonomy refinement ; String ; to ; text relevance ; Utilizing Wikipedia ; Suffix tree
刊名：Annals of Data Science
年：2015
出版者：Springer Berlin Heidelberg
期：1
DOI：10.1007/s40745-015-0032-1
来源：SpringerLink
类型：期刊

摘要

A step-by-step approach to taxonomy construction is presented. On the first step, the upper layer frame of taxonomy is built manually according to educational materials. On the next steps, the frame is refined at a chosen topic using the Wikipedia category tree and articles, both cleaned of noise. Our main tool in this is a naturally defined string-to-text relevance score, based on annotated suffix trees. The relevance scoring is used at several tasks: (1) cleaning the Wikipedia tree or page set of noise; (2) allocating Wikipedia categories to taxonomy topics; (3) deciding whether an allocated category should be included as a child to the taxonomy topic, etc. The resulting fragment of taxonomy consists of three parts: the manually set upper layer topic, the adopted part of the Wikipedia category tree and Wikipedia articles as leaves. Every leaf is assigned a set of so-called descriptors; these are phrases explaining aspects of the leaf topic. The method is illustrated by its application to two domains in the area of Mathematics: (a) 鈥淧robability theory and mathematical statistics鈥? (b) 鈥淣umerical mathematics鈥?(both in Russian). Keywords Taxonomy refinement String-to-text relevance Utilizing Wikipedia Suffix tree