Quality evaluation of topics identification algorithms

详细信息

作者：Decarie ; Francois Andre Martin
学历：Master
年：2013
关键词：Communication and the arts ; Applied sciences
毕业院校：Royal Military College of Canada
专业：Information science;Computer science
ISBN：9780499002327
CBH：MS00232
Country：Canada
语种：English
FileSize：6412152
Pages：150

文摘

The need for effective text retrieval tools,such as search engines,is omnipresent in the corporate marketplace and defence industry alike. The task of indexing large quantities of text from various sources,such as news and social media is too enormous to be accomplished by humans alone. Automatically identifying keywords,or topics,from unstructured text is an important challenge. Extensive computational experiments were conducted using topic identification methods: the Retrieval Activation and Decay (ReAD) algorithm,the Priming Activation Indexing (PAI) algorithm and the Term Frequency- Inverse Document Frequency (TFIDF) method. These experiments were conducted with a subset of the well known Reuters financial dataset. The computational experiments were conducted to identify the parameters that would return higher quality topics using several well known topics quality evaluation methods: the Fl,the precision,the recall and the Normalized Mutual Information (NMI) measures. Two novel evaluation measures were also proposed: Simple Match Five (SM5) and Expanded Match Five (EM5). The results were generated using the parameters that would return high quality topics according to different computational measures. An online survey with volunteer evaluators was conducted in order to validate these results. The parameters that yielded higher topic qualities were inconsistent from one type of measurement to the next. For the chosen parameters,it was found that TFIDF produced higher quality topics than PAI,and PAI produced higher quality topics than ReAD when submitted to human evaluations. It was found that neither the proposed measures nor the established Fl measure were adequate indicators of topic quality. Keywords: Topics Identification,Topics Evaluation,Topics Quality

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700