基于minwise哈希的文档复制检测的研究及应用

英文题名：Reserch and Application on Document Similarity Detection Based on Minwise Hashing
作者：袁鑫攀
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：文档复制检测 ; minwise哈希 ; 相似度估计 ; 指纹 ; 分数位 ; 连接位
英文关键词：document similarity detection ; minwise hashing ; similarity estimation ; fingerprint ; fractional bit ; connected bit
学位年度：2012
导师：桂卫华 ; 张祖平 ; 龙军
学科代码：081203
学位授予单位：中南大学
论文提交日期：2012-11-01

摘要

WEB正经历着爆炸性增长,海量文档中存在大量的相似信息,这些相似性文档一方面消耗了高额的检索资源,另一方面影响了用户的使用。文档的数字化和易获性也使得非法复制、剽窃等行为越来越猖獗。为保护知识产权和提高信息检索效率,文档复制检测技术应运而生并得到迅速发展。
     文档复制检测就是判断给定文档是否抄袭、剽窃或者相似于一篇或多篇文档的内容。论文以某基金项目相似性检测为实际应用背景,为了在海量数据环境中快速而准确地检测出文档的相似性,主要研究相似性检测系统中涉及的关键技术,重点研究相似度估计算法、相似度检索算法和基于SIMD优化的相似度比对等关键技术,具体进行了如下的研究工作：
     (1)针对文档相似性检测系统中精度和存储空间只能取离散值、粒度过粗的问题,提出了分数位minwise哈希算法,验证了分数位minwise哈希算法的可行性,构造了使得估计方差最小的最优分数位。分数位minwise哈希算法将整数位minwise哈希扩展到分数位,突破了b整数位的限制,使得相似度可以使用分数位来估计,不仅完善了minwise哈希算法的理论体系,也为实际系统中的用户对于相似度估计的精度和存储空间更加细粒度可选择性需求提供支撑。
     (2)针对文档相似性检测效率不高的问题,提出了连接位minwise哈希算法。连接位minwise哈希算法将位连接起来进行相似性度量,证明与推导及实验结果显示算法虽然牺牲5%精度,却能成倍地减少比对的次数,大大提升算法的时间性能。一方面,连接位无需任何复杂的操作,方便构建；另一方面,亿万级文档的相似度的估计,通过损失一定的精度误差,获得了性能的成倍提升具有很强的实际应用意义。
     (3)针对海量文档相似性检索中相似度阂值不能设置过低,初始指纹数少等问题,提出了指纹分组合并检索算法,理论推导及实验结果表明算法能够在低相似度阈值(比如70%)下快速地从已有的文档集中检索目标文档,从而实现相似性文档查询的实时性,并且由于降低了相似度阈值,也增大了相似性检索的应用范围。
     (4)针对某基金对相似性证据快速采集和清晰呈现的特殊需求,提出了基于SIMD优化的相似性比对算法。通过使用SIMD指令集和OpenCL框架对相似度比对算法进行了一系列的优化,实验结果表明优化算法提升了可提升11.6%-170%的时间性能,一方面使得相似性有迹可循;另一方面也有利于人工复审工作。
     (5)针对某基金项目相似性检测系统中存在的项目数据难以准确快速提取、海量项目数据比对时间超长、比对结果难以清晰呈现等关键问题,论文论述了如何采用所研究的关键技术形成完整的基金项目相似性检测系统,并为基金项目形式审查提供支持。
With the explosive information growth of Web, there are a large number of massive web of similar information. On the one hand, these similar documents consumed high resources of index, the other affected users. Document similarity detection technology is an important topic in the information processing field, and it is a powerful tool to protect the author's intellectual property and to improve the efficiency of information retrieval.
     Duplicate Document Detection (DDD) is widely used to find out similar documents, or in other words, to detect plagiarism in documents. Plagiarism does not only include intact copy, but also close imitation of the language and thoughts of another author and the representation of them as one's own original work. Takeing fund projects similarity detection as a research background, In order to quickly and accurately detect the similar documents in the environment of massive amounts of data, this paper focuses on the theories and methods of document similarity detection for a more in-depth study, especially on the similarity estimation algorithm, the similarity retrieval algorithm, and similarity matching techniques based on SIMD optimization. Research work has been done as following:
     (1)f-fractional bit minwise hash algorithm is proposed for a wider range of selectivity for accuracy and storage space requirements. This paper studied the feasibility of f-fractional bit minwise hash algorithm, and constructs the optimal fractional bit to make the minimum variances of estimator. The algorithm's innovation is extending the b bit into f-fractional bit. It broke through the limit of b integer; the similarity could be estimated by fractional bit. It not only improves the theoretical system of minwise hash algorithm, but also provides support for the diverse needs of precision and storage space in the actual system.
     (2) Connected bit minwise hash algorithm is proposed to improve the efficiency of similarity estimation since the half of comparisons is greatly reduced with5%loss of accuracy. Connected bit is convenient to be built and the performance increases exponentially with a strong practical significance in the environment of massive amounts of data.
     (3) Fingerprint group merging retrieval algorithm is proposed in large part to address both sides of a problem:similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. Fingerprint group merging retrieval algorithm could quickly find documents with higher similarity in existing documentation set with the lower similarity threshold. Due to the reduction of the similarity threshold, the application level is wider.
     (4) For the problem that comparison's results are difficult to clearly show, an optimized similarity comparison algorithm is proposed and implemented by using Intel SIMD technology and GPUs, based on the analysis of the whole non-parallel algorithm and the data statistics. Experimental results demonstrate that certain performance improvements could be obtained. As evidence of similarity to be significant expressed in the system. On one hand, the similarity can be tracked, on the other hand, in favor of manual review.
     (5)To solve the key problems of fund projects similarity detection system that the existing project data is difficult to extract quickly and accurately; the time of mass projects data comparision is too long; comparision's results are difficult to clearly show, this paper uses key technologies to form a complete fund projects similarity detection system for the fund projects formal review.

引文

[1]Shi vakumar N, Molina HG. Finding Near-replicas of Documents and Servers on the Web. Proceedings of the International Workshop on World Wide Web and Databases, London, UK:Springer-Verlag,1999.204-212
    [2]麻会东,刘国华,李旭,等.基于提取关键词的中文文档复制检测研究,计算机工程与科学,2007,29(10)：63-64,88
    [3]李明德,许超.著作权法.北京：法律出版社,2003：25-42
    [4]鲍军鹏,沈钧毅,刘晓东,等.自然语言文档复制检测研究综述.软件学报,2003,14(10)：1753-1760
    [5]宋擒豹,杨向荣,沈钧毅,等.数字商品非法复制的检测算法.计算机学报,2002,25(11)：1206-1211
    [6]Terada M, Kuno H, Hanadate M, et al. Copy Prevention Scheme for Rights Trading Infrastructure. Proceedings of the Fourth Working Conference on Smart Card Research and Advanced Applications. MA, USA:Kluwer Academic Publishers Norwell,2001.20-22
    [7]Brassil J, Low S, Maxemchuk N, et al. Electronic Marking and Identification Techniques to Discourage Document Copying. IEEE Journal on Selected Areas in Communications,1995,13(8):1495-1504
    [8]Popek GJ, Kline CS. Encryption and Secure Computer Networks. ACM Computer Surveys,1979,11(4):331-356
    [9]Griswold GN. A Method for Protecting Copyright on Networks. Proceedings of the Joint Harvard Workshop on Technology Strategies for Protecting Intellectual Property in the Networked Multimedia Environment. USA:Boston,1993.31-35
    [10]Luerssen MH, Powers DM. Evolving Encapsulated Programs as Shared Grammars. Genetic Programming and Evolvable Machines,2008,9(3); 203-228
    [11]Wang HX, Lu ZM. Robust Blind Video Watermarking with Adaptive Embedding Mechanism. International Journal of Innovative Computing, Information and Control,2005,1(2):247-259
    [12]Shivakumar N, Garcia-Molina H. SCAM:A copy detection mechanism for digital documents. Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries (DL'95).1995.398-409
    [13]Shivakumar N, Garcia-Molina H. Building a scalable and accurate copy detection mechanism. Proceedings of the 1st ACM Conference on Digital Libraries (DL'96). NY, USA:ACM.1996
    [14]Si A, Leong HV, Lau RWH. CHECK:A document plagiarism detection system. Proceedings of the ACM Symposium for Applied Computing. NY, USA: ACM,1997.70-77
    [15]Song QB, Shen JY. On illegal coping and distributing detection mechanism for digital goods. Journal of Computer Research and Development,2001,38(1): 121-125
    [16]Bao JP, Shen JY, Liu HY. A Fast Document Copy Detection Model. Soft Computing,2006,10(1):41-46
    [17]Manber U. Finding similar files in a large file system. Proceedings of the Winter USENIX Conference. CA,USA:USENIX Association Berkeley,1994.1-10.
    [18]Heintze N. Scalable document fingerprinting. Proceedings of the 2nd USENIX Workshop on Electronic Commerce.1996
    [19]Broder AZ. On the resemblance and containment of documents. Proceedings of Compression and Complexity of Sequences. Washington, DC, USA:IEEE Computer Society,1997.21-29
    [20]Broder AZ, Glassman SC, Manasse MS, et al. Syntactic clustering of the web. Journal Computer Networks and ISDN Systems,1997,23(8):1157-1166
    [21]Ye SZ, Wen JR, Ma WY. A systematic study on parameter correlation in large scale duplicate document detection. Knowledge and Information Systems,2008, 14(2):217-232
    [22]张祖平,徐昕,龙军,等.文本相似性度量中参数相关性及优化配置,小型微型计算机系统,2011,(05)：983-988
    [23]徐昕.文本相似性度量中参数相关性与优化配置研究：[硕士学位论文].长沙：中南大学,2009
    [24]Wise MJ. YAP3:Improved detection of similarities in computer programs and other texts. Proceedings of the SIGCSE'96. NY, USA:ACM,1996.130-134.
    [25]Monostori K, Zaslavsky A, Schmidt H. MatchDetectReveal:Finding overlapping and similar digital documents. Proceedings of the Information Resources Management Association International Conference (IRMA2000). PA, USA:IGI Publishing Hershey.2000.955-957
    [26]Monostori K, Zaslavsky A, Schmidt H. Parallel overlap and similarity detection in semi-structured document collections. Proceedings of the 6th Annual Australasian Conference on Parallel And Real-Time Systems (PART'99).1999
    [27]Monostori K, Zaslavsky A, Schmidt H. Document overlap detection system for distributed digital libraries. Proceedings of the ACM Digital Libraries 2000 (DL2000). NY, USA:ACM,2000.226-227
    [28]Monostori K, Zaslavsky A, Schmidt H. Parallel and distributed overlap detection on the Web. Proceedings of the Workshop on Applied Parallel Computing (PARA2000). UK:Springer-Verlag London,2000.206-214
    [29]Santosh Vempala. The Random Projection Method. Providence, RI:American Mathematical Society,2004
    [30]Broder AZ, Charikar M, Frieze AM, et al. Min-wise independent permutations. Journal of Computer Systems and Sciences,2000,60(3):630-659
    [31]Charikar MS. Similarity estimation techniques from rounding algorithms. STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. New York, USA:ACM,2002.380-388
    [32]Li P, Konig AC. b-bit minwise hashing. WWW'10 Proceedings of the 19th international conference on World wide web. New York, USA:ACM, 2010.671-680
    [33]Monostori K, Zaslavsky A, Vajk I. Suffix vector:A space-efficient representation of a suffix tree. Technical Report,2001
    [34]Kalpakis K, Tang S. Collaborative data gathering in wireless sensor networks using measurement co-occurrence. Computer Communications,2008,31(10): 1979-1992
    [35]Dourisboure Y, Geraci F, Pellegrini M. Extraction and classification of dense implicit communities in the web graph. ACM Transactions on the Web (TWEB), 2009,3(2):1-36
    [36]Bendersky M, Croft W B. Finding text reuse on the web. WSDM '09 Proceedings of the Second ACM International Conference on Web Search and Data Mining. New York, USA:ACM,2009.262-271
    [37]Buehrer G, Chellapilla K. A scalable pattern mining approach to web graph compression with communities. WSDM '08 Proceedings of the international conference on Web search and web data mining. New York, USA:ACM, 2008.95-106
    [38]Indyk P. A small approximately min-wise independent family of hash functions. Journal of Algorithms,2001,38(1):84-90
    [39]Toshiya I, Yoshinori T, Jun Tarui. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions. NY, USA:ACM, 2003.710-718
    [40]Li P, Church KW. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005),2007,33(3):305-354
    [41]Li P, Church KW, Hastie TJ. One sketch for all:Theory and applications of conditional random sampling. NIPS (Preliminary results appeared in NIPS 2006), BC, Canada:Vancouver.2008
    [42]Manasse M, McSherry F, Talwar K. Consistent weighted sampling. Technical Report, MSR-TR-2010-73,2010
    [43]Feigenblat G, Porat E, Shiftan A. Exponential time improvement for min-wise based algorithms. SODA'11 Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM,2011.57-66
    [44]Li P, Konig AC. Theory and Applications of b-Bit Minwise Hashing. Communications of the ACM,2011,54(8):101-109
    [45]Cohen E. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst.1997,55(3):441-453
    [46]Cohen E, Kaplan H. Tighter estimation using bottom k sketches. Journal Proceedings of the VLDB Endowment,2008,1(1):213-224
    [47]Cohen E. Kaplan H. Summarizing data using bottom-k sketches. PODC '07 Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. NY, USA:ACM,2007.225-234
    [48]Datar M, Muthukrishnan S. Estimating rarity and similarity over data stream windows. Proceedings of 10th Annual European Symposium on Algorithms volume 2461 of Lecture Notes in Computer Science. London, UK: Springer-Verlag,2002.323-334
    [49]Broder AZ. Identifying and filtering near-duplicate documents. COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. London, UK:Springer-Verlag,2000.1-10
    [50]Haveliwala TH, Gionis A, Klein D. et al. Evaluating strategies for similarity search on the web. WWW '02:Proceedings of the 11th international conference on World Wide Web. NY, USA:ACM,2002.432-442
    [51]Bachrach Y, Herbrich R, Porat E. Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems. SPIRE '09:Proceedings of the 16th International Symposium on String Processing and Information Retrieval. London, UK:Springer-Verlag,2009.344-352
    [52]Bachrach Y, Porat E, Rosenschein JS. Sketching techniques for collaborative filtering. IJCAI '09 Proceedings of the 21st international jont conference on Artifical intelligence. San Francisco, CA, USA:Morgan Kaufinann Publishers Inc,2009.2016-2021
    [53]Manku GS, Jain A, Sarma AD. Detecting near-duplicates for web crawling. WWW '07 Proceedings of the 16th international conference on World Wide Web. NY, USA:ACM,2007.141-150
    [54]Cormode G, Muthukrishnan S. What's new:finding significant differences in network data streams. IEEE/ACM Transactions on Networking,2005,13(6): 1219-1232
    [55]Ganguly S, Garofalakis M, Rastogi R. Processing set expressions over continuous update streams. SIGMOD '03 Proceedings of the 2003 ACM SIGMOD international conference on Management of data. NY, USA:ACM, 2003.265-276
    [56]Das AS, Datar M, Garg A, et al. Google news personalization:scalable online collaborative filtering. WWW '07:Proceedings of the 16th international conference on World Wide Web. NY, USA:ACM,2007.271-280
    [57]Gibbons P.B, Tirthapura S. Estimating simple functions on the union of data streams. SPAA '01:Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures. NY, USA:ACM,2001.281-291
    [58]Yang H, Callan J. Near-duplicate detection by instance-level constrained clustering. SIGIR'06:Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. NY, USA: ACM,2006.421-428
    [59]Henzinger M. Finding near-duplicate web pages:a large-scale evaluation of algorithms. SIGIR'06:Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. NY, USA: ACM,2006.284-291
    [60]Krauthgamer R, Mehta A, Raman V, et al. Greedy list intersection. ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. Washington, DC, USA:IEEE Computer Society,2008.1033-1042
    [61]Bachrach Y, Herbrich R. Fingerprinting Ratings For Collaborative Filtering Theoretical and Empirical Analysis. SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval. Berlin, Heidelberg:Springer-Verlag,2010.25-36
    [62]Cohen E, Datar M, Fujiwara S, et al. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering,2001, 13(1):64-78
    [63]Saks M, Srinivasan A, Zhou S, et al. Low discrepancy sets yield approximate min-wise independent permutation families. RANDOM-APPROX '99 Proceedings of the Third International Workshop on Approximation Algorithms for Combinatorial Optimization Problems:Randomization, Approximation, and Combinatorial Algorithms and Techniques. London, UK:Springer-Verlag, 1999.29-32
    [64]Patrascu M, Thorup M. On the k-independence required by linear probing and minwise independence. ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming. Berlin, Heidelberg:Springer-Verlag,2010.715-726
    [65]Broder AZ, Charikar M, Mitzenmacher M. A derandomization using min-wise independent permutations. Journal of Discrete Algorithms,2003.1(1):11-20
    [66]Aho AV, Hopcroft JE. The Design and Analysis of Computer Algorithms. Boston MA USA:Addison-Wesley Longman Publishing Co. Inc,1974
    [67]Bar-Yossef Z, Jayram TS, Kumar R, et al. Counting distinct elements in a data stream. RANDOM '02:Proceedings of the 6th International Workshop on Randomization and Approximation Techniques. London, UK:Springer-Verlag, 2002.1-10
    [68]Kane DM, Nelson J, Woodruff DP. An optimal algorithm for the distinct elements problem. PODS'10 Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. NY, USA:ACM,2010.41-52
    [69]Indyk P, Rajeev M. Approximate nearest neighbors:towards removing the curse of dimensionality. STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing. New York, USA:ACM,1998.604-613
    [70]Hamming RW. Error detecting and error correcting codes. Bell System Technical Journal,1950,29(2):147-160
    [71]Li P, Konig AC, Gui W. b-bit minwise hashing for estimating three-way similarities. Neural Information Processing Systems (NIPS) BC, Canada: Vancouver.2010
    [72]Li P, Moore J, K6nig AC. b-bit minwise hashing for large-scale linear SVM. Neural Information Processing Systems (NIPS) BC, Canada: Vancouver.2011
    [73]Li P, Konig AC. Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing. Technical Report, New York:Cornell university library,2011
    [74]Wilson RA, Keil FC. The MIT Encyclopedia of the Cognitive Sciences. Cambridge, MA:MIT Press,2006.763-765
    [75]Ejarque P, Hernando J. Score Equalization in SVM Multimodal Fusion for Person Recognition. E-business and Telecommunications, Communications in Computer and Information Science,2009,23:152-161
    [76]Faloutsos C, Barber R, Flickner M, et al. Efficient and effective querying by image content. Journal of Intelligent Information Systems,1994,3(3-4):231-262
    [77]Hafner J, Sawhney HS, Equitz W, et al. Efficient Color Histogram Indexing for Quadratic Form Distance Functions. IEEE Transactions on Pattern Analysis and Machine Intelligence,1995,17(7):729-736
    [78]Levenshtein Ⅵ. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission,1965,1(1):8-17
    [79]Huttenlocher DP, Klanderman GA, Rucklidge WA. Comparing Images Using the Hausdorff Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence,1993,15(9):850-863
    [80]Broder AZ. Some applications of Rabin's fingerprinting method. Sequences II: Methods in Communications, Security, and Computer Science. Springer-Verlag, 1993.143-152
    [81]Yuan XP, Long J, Zhang ZP, et al. f-Fractional bit minwise hashing. Journal of Software,2012,7(1):228-236
    [82]Davison AC. Statistical models. Cambridge, Britain:Cambridge University Press, 2003.37-41
    [83]Chavez E, Navarro G, Baeza-Yates R, et al. Searching in metric spaces. ACM Computing Surveys (CSUR),2001,33(3):273-321
    [84]Zezula P, Amato G, Dohnal V, et al. The Metric Space Approach. NY, USA: Springer Science &Business Media, Inc,2006
    [85]Burkhard WA, Keller RM. Some approaches to best-match file searching. Communications of the ACM,1973,16(4):230-236
    [86]Baeza-Yates RA, Cunto W, Manber U, et al. Proximity Matching Using Fixed-Queries Trees. CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching. London, UK:Springer-Verlag,1994.198-212
    [87]Chavez E, Marroquin JL, Navarro G. Fixed Queries Array:A Fast and Economical Data Structure for Proximity Searching. Multimedia Tools and Applications, 2001,14(2):113-135
    [88]Yianilos PN. Data structures and algorithms for nearest neighbor search in general metric spaces. SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms. PA, USA:Society for Industrial and Applied Mathematics,1993.311-321
    [89]Yianilos PN. Excluded middle vantage point forests for nearest neighbor search. Technical report. Princeton:NEC Research Institute,1999
    [90]Kalantari I, McDonald G. A Data Structure and an Algorithm for the Nearest Point Problem. IEEE Transactions on Software Engineering,1983,9(5):631-634
    [91]Uhlmann JK. Satisfying General Proximity/Similarity Queries with Metric Trees. Information Processing Letters,1991,40(4):175-179
    [92]Ruiz EV. An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognition Letters,1986,4(3):145-157
    [93]Vidal E. New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA). Pattern Recognition Letters,1994,15(1):1-7
    [94]Mico ML, Oncina J, Enrique Vidal, et al. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognitions Letters, 1994,15(1):9-17
    [95]Chavez E, Marroquin JL, Baeza-Yates R. Spaghettis:An Array Based Algorithm for Similarity Queries in Metric Spaces. SPIRE'99:Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware. Washington, DC, USA:IEEE Computer Society,1999.38
    [96]Bozkaya T, Ozsoyoglu M. Indexing large metric spaces for similarity search queries. ACM Transactions on Database Systems,1999,24(3):361-404
    [97]Brin S. Near Neighbor Search in Large Metric Spaces. VLDB'95:Proceedings of the 21th International Conference on Very Large Data Bases. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc,1995.574-584
    [98]Navarro G. Searching in Metric Spaces by Spatial Approximation. SPIRE'99: Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware. Washington, DC, USA:IEEE Computer Society,1999.141
    [99]Navarro G. Searching in metric spaces by spatial approximation. The VLDB Journal—The International Journal on Very Large Data Bases,2002,11(1): 28-46
    [100]Ciaccia P, Patella M, Zezula P. M-tree:An Efficient Access Method for Similarity Search in Metric Spaces. VLDB'97:Proceedings of the 23rd International Conference on Very Large Data Bases. San Francisco, CA, USA:Morgan Kaufinann Publishers Inc,1997.426-435
    [101]Gennaro C, Savino P, Zezula P. Similarity search in metric databases through hashing. MULTIMEDIA '01 Proceedings of the 2001 ACM workshops on Multimedia:multimedia information retrieval. NY, USA:ACM Press,2001.1-5
    [102]Dohnal V, Gennaro C, Savino P. D-Index:Distance Searching Index for Metric Data Sets. Multimedia Tools and Applications,2003,21(1):9-33
    [103]Dohnal V. d-index:Indexing Structures for Searching in Metric Spaces:[PH.D. THESIS]. Brno, Czech Republic:Faculty of Informatics Masaryk University, 2004
    [104]Falchi F, Gennaro C, Zezula P. A content-addressable network for similarity search in metric spaces. DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing. Berlin, Heidelberg:Springer-Verlag,2007.98-110
    [105]Falchi F. A Content-Addressable Network for Similarity Search in Metric Spaces: [PH.D. THESIS]. Brno, Czech Republic:Faculty of Informatics Masaryk University,2007
    [106]Jagadish HV, Ooi BC, Tan KL, et al. iDistance:An Adaptive B+-Tree Based Indexing Method for Nearest Neighbor Search. ACM Transactions on Database Systems (TODS),2005,30(2):364-397
    [107]Novak D, Batko M, Zezula P. Metric Index:An efficient and scalable solution for precise and approximate similarity search. Information Systems,2011,36(4): 721-733
    [108]Novak D, Batko M. Metric Index:An Efficient and Scalable Solution for Similarity Search. SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications. Washington, DC, USA:IEEE Computer Society,2009.65-73
    [109]Novak D, Zezula P. M-Chord:a scalable distributed similarity search structure. InfoScale '06 Proceedings of the 1st international conference on Scalable information systems. New York, NY, USA:ACM,2006
    [110]Tao YF, Yi K, Sheng C. Quality and efficiency in high dimensional nearest neighbor search. SIGMOD '09 Proceedings of the 35th SIGMOD international conference on Management of data. NY, USA:ACM,2009.563-576
    [111]Diefendorff K, Dubey DK. How Multimedia Workloads Will Change Processor Design. IEEE Computer,1997,30(9):43-45
    [112]魏芳,李学明.H.264中变换和量化的SIMD优化.计算机工程与应用,2004,(17)：24-27
    [113]林进,张兆庆,祝明发.基于SIMD机器的优化数据传输的并行循环分割.计算机学报,1998,21(7)：577-585
    [114]Senda Y, Harasaki H. A realtime software MPEG transcoder using a novel motion vector reuse and a SIMD optimization techniques. ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing. Washington, DC, USA:IEEE Computer Society,1999.2359-2362
    [115]Stephenson M, Babb J, Amarasinghe S. Bitwidth Analysis with Application to Silicon Compilation. ACM SIGPLAN conference on Programming Language Design and Implementation. British Columbia:Vancouver.2000.108-120
    [116]Bik AJC, Girkae M, Grey PM, et al. Automatic Intra-Register Vectorization for Intel Architecture. International Journal of Parallel Programming,2002,30(2): 65-98
    [117]Jiang WH, Zhu JH, Zang BY, et al. Boosting the performance of multimedia applications by using SIMD instructions. Proceedings of 14th International Conference on Compiler Construction (CC). Edinburgh:Springer-Verlag, 2005.59-75
    [118]Intel Corporation. The ia-32 intel architecture software developer's manual[EB/OL]. http://www.intel.com/content/www/us/en/processors/architect ures-software-developer-manuals.html,2001
    [119]Intel corporation. Intel vtune performance analyzer [EB/OL]. http://www.intel.com/software/products/vtune/,2010-04-25
    [120]Intel corporation._mm_movemask_epi8[EB/OL]. http://msdn.microsoft.com /en-us/library/s090c8fk%28VS.80%29.aspx/,2009-04-22
    [121]Intel corporation. Pentium(?) Processors with MMXTM Technology [EB/OL]. http://msdn.microsoft.com/en-us/library/698bxz2w(v=vs.80).aspx, 2011-03-11
    [122]Nvidia corporation. OpenCL Programming for the CUD A Architecture[EB/OL]. http://www.nvidia.com/content/cudazone/download/Ope nCL/NVIDIA_OpenCL_ProgrammingOverview.pdf,2009-08-31
    [123]Govindaraju NK, Lloyd B, Wang W, et al. Fast computation of database operations using graphics processors. SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data. New York, NY, USA:ACM,2005.215-226
    [124]Govindaraju N, Gray J, Kumar R, et al. Gputerasort:high performance graphics co-processor sorting for large database management. Proceedings of the 2006 ACM SIGMOD international conference on Management of data. New York, NY, USA:ACM,2006.325-336

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700