Quality evaluation of topics identification algorithms
详细信息   
文摘
The need for effective text retrieval tools,such as search engines,is omnipresent in the corporate marketplace and defence industry alike. The task of indexing large quantities of text from various sources,such as news and social media is too enormous to be accomplished by humans alone. Automatically identifying keywords,or topics,from unstructured text is an important challenge. Extensive computational experiments were conducted using topic identification methods: the Retrieval Activation and Decay (ReAD) algorithm,the Priming Activation Indexing (PAI) algorithm and the Term Frequency- Inverse Document Frequency (TFIDF) method. These experiments were conducted with a subset of the well known Reuters financial dataset. The computational experiments were conducted to identify the parameters that would return higher quality topics using several well known topics quality evaluation methods: the Fl,the precision,the recall and the Normalized Mutual Information (NMI) measures. Two novel evaluation measures were also proposed: Simple Match Five (SM5) and Expanded Match Five (EM5). The results were generated using the parameters that would return high quality topics according to different computational measures. An online survey with volunteer evaluators was conducted in order to validate these results. The parameters that yielded higher topic qualities were inconsistent from one type of measurement to the next. For the chosen parameters,it was found that TFIDF produced higher quality topics than PAI,and PAI produced higher quality topics than ReAD when submitted to human evaluations. It was found that neither the proposed measures nor the established Fl measure were adequate indicators of topic quality. Keywords: Topics Identification,Topics Evaluation,Topics Quality

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700