面向不确定文本数据的余弦相似性查询方法

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

面向不确定文本数据的余弦相似性查询方法

详细信息查看全文 | 推荐本文 |

英文篇名：Methods for Similarity Query on Uncertain Data with Cosine Similarity Constraints
作者：朱命冬 ; 徐立新 ; 申德荣 ; 寇月 ; 聂铁铮
英文作者：ZHU Mingdong;XU Lixin;SHEN Derong;KOU Yue;NIE Tiezheng;Department of Computer Science and Technology, Henan Institute of Technology;School of Computer Science and Engineering, Northeastern University;
关键词：不确定数据 ; 分布式算法 ; 余弦相似度 ; 相似性查询
英文关键词：uncertain data;;distributed algorithm;;cosine similarity;;similarity query
中文刊名：KXTS
英文刊名：Journal of Frontiers of Computer Science and Technology
机构：河南工学院计算机科学与技术系;东北大学计算机科学与工程学院;
出版日期：2017-03-07 13:59
出版单位：计算机科学与探索
年：2018
期：v.12;No.112
基金：国家自然科学基金No.61472070;; 国家重点基础研究发展计划(973计划)No.2012CB316201~~
语种：中文;
页：KXTS201801007
页数：16
CN：01
ISSN：11-5602/TP
分类号：54-69

摘要

最近邻查询在多个领域具有广泛的应用,如组合过滤、基于位置的服务、决策支持系统等。而且随着Web信息实体抽取、隐私保护信息转化、图像识别等技术的发展和普及,在诸多领域,不确定性文本数据普遍存在,基于信息论的TF-IDF算法,可以将文本型的相似匹配转化为数值型的向量的计算,具有严密性和有效性。但TF-IDF信息的余弦距离不属于度量空间,难于构建索引。为此主要研究了面向不确定文本数据基于余弦相似度的相似性查询方法。通过分析不确定性余弦相似度计算的特性,提出了快速相似度计算方法。通过对余弦距离的计算进行转换,构建改进的索引结构s MVP-tree(statistic multiple vantage point tree),并给出了基于余弦相似度面向不确定性数据的相似度计算方法。最后,结合该相似度计算方法提出了分布式环境下k NN查询和Rk NN查询算法。大量的基于真实数据的实验验证了算法的正确性和有效性。
Nearest neighbor queries have been used in a wide variety of applications such as collaborative filtering,location-based services and decision support systems. Meanwhile, with the development of entity extraction in Web information, information transformation in privacy protection, text recognition in images, in many fields, uncertain text information is ubiquitous. In the field of information theory, the calculation of textual similarity is transformed to the computation of vector similarity by TF-IDF algorithm, which is rigorous and efficient. However, cosine distance based on TF-IDF does not belong to metric distance function, and it is difficult to build indices on cosine similarity. To this end, this paper studies methods for nearest neighbor queries on uncertain data with cosine similarity constraints. Existing methods are efficient either for numerical data or for certain data, but there is no method that can efficiently support uncertain and character data. So this paper first analyzes the property of cosine similarity to boost up similarity computation. Secondly, this paper proposes an efficient method for similarity queries on uncertain data by transforming cosine similarity computation, and designs an improved tree index for metric space,s MVP-tree(statistic multiple vantage point tree). Lastly, this paper extends the framework to a distributed environment and presents k NN query and Rk NN algorithms. The experimental results show that the proposed method is effective and efficient.

引文

[1]Zhang Peiwu,Cheng R,Mamoulis N,et al.Voronoi-based nearest neighbor search for multi-dimensional uncertain databases[C]//Proceedings of the 29th International Conference on Data Engineering,Brisbane,Apr 8-12,2013.Washington:IEEE Computer Society,2013:158-169.
    [2]Peng Liping,Diao Yanlei,Liu Anna.Optimizing probabilistic query processing on continuous uncertain data[C]//Proceedings of the 37th International Conference on Very Large Data Bases,Seattle,Aug 29-Sep 3,2011.New York:ACM,2011:1169-1180.
    [3]Ma Youzhong,Meng Xiaofeng.Set similarity join on massive probabilistic data using Map Reduce[J].Distributed and Parallel Databases,2014,32(3):447-464.
    [4]Ye Mao,Lee W C,Lee D L,et al.Distributed processing of probabilistic top-K queries in wireless sensor networks[J].IEEE Transactions on Knowledge&Data Engineering,2013,25(1):76-91.
    [5]Yi Ke,Li Feifei,Kollios G,et al.Efficient processing of topK queries in uncertain databases with X-relations[J].IEEE Transactions on Knowledge&Data Engineering,2008,20(12):1669-1682.
    [6]Dallachiesa M,Palpanas T,Ilyas I F.Top-k nearest neighbor search in uncertain data series[J].Proceedings of the VLDB Endowment,2014,8(1):13-24.
    [7]Huang Chenghui,Yin Jian,Hou Fang.A text similarity measurement combining word semantic information with TF-IDF method[J].Chinese Journal of Computers,2011,34(5):856-864.
    [8]Robertson S.Understanding inverse document frequency:on theoretical arguments for IDF[J].Journal of Documentation,2004,60(5):503-520.
    [9]Han Min,Tang Changjie,Duan Lei,et al.TF-IDF similarity based method for tag clustering[J].Journal of Frontiers of Computer Science and Technology,2010,4(3):240-246.
    [10]Wu H C,Luk R W P,Wong K F,et al.Interpreting TF-IDF term weights as making relevance decisions[J].ACM Transactions on Information Systems,2008,26(3):55-59.
    [11]Agarwal P K,Efrat A,Sankararaman S,et al.Nearest-neighbor searching under uncertainty[C]//Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,Scottsdale,May 20-24,2012.New York:ACM,2012:225-236.
    [12]Agarwal P K,Aronov B,Har-Peled S,et al.Nearest neighbor searching under uncertainty II[C]//Proceedings of the32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,New York,Jun 22-27,2013.New York:ACM,2013:115-126.
    [13]Zheng Kai,Fung G P C,Zhou Xiaofang.K-nearest neighbor search for fuzzy objects[C]//Proceedings of the 2010 International Conference on Management of Data,Indianapolis,Jun 6-10,2010.New York:ACM,2010:699-710.
    [14]Li Chen,Shen Derong,Zhu Mingdong,et al.k NN query processing approach for content with location and time tags[J].Journal of Software,2016,27(9):2278-2289.
    [15]Bernecker T,Emrich T,Kriegel H P,et al.Efficient probabilistic reverse nearest neighbor query processing on uncertain data[C]//Proceedings of the 37th International Conference on Very Large Data Bases,Seattle,Aug 29-Sep 3,2011.New York:ACM,2011:669-680.
    [16]Abeywickrama T,Cheema M A,Taniar D.k-nearest neighbors on road networks:a journey in experimentation and inmemory implementation[J].Proceedings of the VLDB Endowment,2016,9(6):492-503.
    [17]Wang Pei,Xiao Chuan,Qin Jianbin,et al.Local similarity search for unstructured text[C]//Proceedings of the 2016 International Conference on Management of Data,San Francisco,Jun 26-Jul1,2016.New York:ACM,2016:1991-2005.
    [18]Jestes J,Li Feifei,Yan Zhepeng,et al.Probabilistic string similarity joins[C]//Proceedings of the 2010 International Conference on Management of Data,Indianapolis,Jun 6-10,2010.New York:ACM,2010:327-338.
    [19]Wen Qingfu,Wang Jianmin,Zhu Han,et al.Distributed learning to hash for approximate nearest neighbor search[J].Chinese Journal of Computers,2017,40(1):192-206.
    [20]Wang Jiannan,Li Guoliang,Feng Jianhua.Extending string similarity join to tolerant fuzzy token matching[J].ACM Transactions on Database Systems,2014,39(1):7.
    [21]Akdogan A,Demiryurek U,Kashani F B,et al.Voronoibased geospatial query processing with Map Reduce[C]//Proceedings of the 2nd International Conference on Cloud Computing,Indianapolis,Nov 30-Dec 3,2010.Washington:IEEE Computer Society,2010:9-16.
    [22]Manning C D,Raghavan P,Schtze H.Introduction to information retrieval[M].New York:Cambridge University Press,2008:113-133.
    [23]Bozkaya T,Ozsoyoglu M.Indexing large metric spaces for similarity search queries[J].ACM Transactions on Database Systems,1999,24(3):361-404.
    [24]Bustos B,Navarro G,Chávez E.Pivot selection techniques for proximity searching in metric spaces[J].Pattern Recognition Letters,2003,24(14):2357-2366.
    [25]Yang Shiyu,Cheema M A,Lin Xuemin,et al.Reverse k nearest neighbors query processing:experiments and analysis[J].Proceedings of the VLDB Endowment,2015,8(5):605-616.
    [26]Shu Liangcai,Long Bo,Meng Weiyi.A latent topic model for complete entity resolution[C]//Proceedings of the 25th International Conference on Data Engineering,Shanghai,Mar 29-Apr 2,2009.Washington:IEEE Computer Society,2009:880-891.
    [7]黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864.
    [9]韩敏,唐常杰,段磊,等.基于TF-IDF相似度的标签聚类方法[J].计算机科学与探索,2010,4(3):240-246.
    [14]李晨,申德荣,朱命冬,等.一种对时空信息的k NN查询处理方法[J].软件学报,2016,27(9):2278-2289.
    [19]文庆福,王建民,朱晗,等.面向近似近邻查询的分布式哈希学习方法[J].计算机学报,2017,40(1):192-206.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700