彩铃智能搜索引擎的设计与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
彩铃业务是一项由被叫(或主叫)用户定制,为主叫用户提供一段悦耳的音乐或一句问候语来替代普通回铃音的业务。用户申请开通彩铃业务之后,可以自行设定个性化回铃音,在其做被叫时,为主叫用户播放个性化定制的音乐或录音,来代替普通的回铃音。
     近几年来,随着彩铃业务的迅猛发展,彩铃平台中的铃音数量与日俱增,数以万计的铃音出现在用户的眼前,各家铃音制作商创作的千奇百怪的彩铃使用户越发不知所从,难以挑选,现有的各种接入方式中的传统铃音查找方式已经不能满足用户的需要。另一方面,由搜索巨头Google公司所引领的搜索技术革新使得搜索领域有了突飞猛进的发展,各种分词、索引、排序等算法不断涌现,并出现了以Lucene、Nutch等为代表的开源搜索引擎工具,搜索技术已经日趋成熟。
     垂直搜索是目前搜索领域的重点发展方向之一。它是搜索引擎的细分和延伸,是对网页库中的某类专门的信息进行一次整合,定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户。垂直搜索引擎和普通的网页搜索引擎的最大区别是对网页信息进行了结构化信息抽取,将非结构化数据抽取成特定的结构化信息数据,网页搜索是以网页为最小单位,而垂直搜索是以结构化数据为最小单位。然后将这些数据存储到数据库,进行进一步的加工处理。
     本文所介绍的彩铃智能搜索引擎正是利用现有的搜索技术,针对彩铃平台所开发的一套高效、智能的垂直搜索引擎。第一章引言简单介绍了目前垂直搜索引擎的发展现状。第二章对彩铃平台做了一个总体的介绍,从组网、数据、接入方式等方面分析了彩铃平台的特点。第三章介绍了目前搜索引擎领域中所用到的关键技术,以及今后的发展趋势。第四章是本文的重点之一,在对彩铃平台中各项数据进行了统计分析后,研究了在彩铃平台中应用搜索引擎技术的可行性,并提出了目标系统所应具备的能力,随后对各种不同搜索方式的搜索流程进行了设计,在全面分析了系统功能后,提出了一套较详细的系统框架设计方案,并定义了与外部功能实体间的交互协议。在第五章中,重点说明了彩铃智能搜索中所用到的分词、模糊匹配、权值算法等关键技术;其中SKM算法是针对彩铃平台的数据特点开发出的模糊匹配算法,在本章中做了详细论述;本章第三节则重点讨论了在搜索结果排序过程中所使用的一套独特的权值算法,对单字、关键词、铃音等对象的权重计算方法做了详细的阐述。第六章则利用现有测试数据对算法的效率与已知算法进行比较,并对算法性能做了详细讨论。
Color Ring Back Tone(CRBT) is a business service that customed by recipient user, providing a pleasant music or a salutation to replace ordinary ring tones. After registering CRBT service, customers can set their own personalized ring tones, which will be played to the caller to replace the ordinary ring tones when they are called.
     In recent years, with the rapid development of CRBT service, CRBT platform in the growing number of ring tones, tens of thousands of ring tones in the user's immediate, the various kinds of ring tones made by individual ring tones producers make customers feel it's getting more difficult to make selection.AH existing access in the search approach has been unable to meet the needs of users. On the other hand, search giant Google's search technology have made rapid development of various search innovations, sub-term, indexing, sorting algorithms are constantly emerging, and there to Lucene, Nutch as the representative of the open Source search engine tools, search technology matures.
     Vertical search is one of the key development direction for searching technology. It's a kind of detailed and extended search engine, an integration for the websites of certain types of specialized information, targeting at the needs of field data extracted after treatment in some form back to the user. The biggest difference between vertical search engines and the general web search engines is the information on the website of structured information collected, unstructured data will be collected into a specific structure of the information and data. For the web search engine, web page is the smallest unit, while for the vertical search is structured data. These data are then stored to the database, for further processing.
     This paper introduces the CRBT intelligent search engine, which is a vertical search engine that uses existing search technology, oriented to CRBT platform for the development of a highly efficient and intelligent data searching. Chapter one briefly illustrates vertical search engine of the current status of development; in chapter two, the CRBT platform is described as a whole, in the view of network, data type and access way. Chapter three shows the key technologies in the area of search engine, as well as the development trend of the future. Chapter four is one of the emphases of this article, after statistician and analysis on CRBT data, it studies the feasibility to use search engine technology on CRBT platform, and puts forward the target system should have the ability to, and then designs search processes for all different access ways, in a comprehensive analysis of the system, sets forth a framework for more detailed system design, and the definition of functions and external interaction agreement between the entities. In the fifth chapter, it focuses on the CRBT intelligent search in the sub-term, fuzzy matching algorithm and key-weights algorithm technologies. The SKM algorithm is expounded verbosely, which is a kind of fuzzy-matching algorithm developed aiming to the data type of CRBT. The third section of this chapter is focused on the ranking in the search results in the course of the use of a unique algorithm weights of the word, keyword, ring items and other objects calculation of the weight of a detailed exposition. Chapter six uses existing test data to compare the efficiency with the well known algorithms, moreover discussed the performance of algorithms in detail.
引文
[1]G.Salton and M.J.McGill,Introduction to Modern Information Retrieval.Computer Series.McGraw-Hill,New York,NY,1983
    [2]Sergey Brin,Lawrence Page,The anatomy of a large-scale hypertextual Web search engine,Proceedings of the seventh international conference on World Wide Web 7,April 1998,Brisbane,Australia,pp107-117
    [3]Dan Gusfield,Algorithms on strings,trees,and sequences:computer science and computational biology,Cambridge University Press,New York,NY,1997
    [4]Sun Kim,A new string-pattern matching algorithm using partitioning and hashing efficiently,Journal of Experimental Algorithmics(JEA),4,1999,pp2
    [5]E.P Markatos,On caching search engine query results,Computer Communications,v.24,n.2,February 2001,pp137-143
    [6]Goetz,B.(2002)"The Lucene Search Engine:Powerful,Flexible and Free",Javaworld http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html,September 2000
    [7]Rohit Khare,Ph.D.,Nutch:A Flexible and Scalable Open-Source Web Search Engine,CommerceNet Labs,December 2004
    [8]刘策,垂直搜索引擎发展前景分析,中国科技成果,2006年,第13期,pp46-47
    [9]钟敏娟,CDSE:一个面向领域的智能搜索引擎,计算机工程,第32卷第24期,2006.12,pp206-208
    [10]沈奇威,廖建新,王纯,朱晓民,彩铃业务的研究和设计,第九届全国青年通信学术会议论文集,重庆,中国,2004.5,pp484-489
    [11]Mohri M.,Edit-distance of Weighted Automata,General Definitions and Algorithms[J].General Definitions and Algorithms,2003,14(6),pp957-982
    [12]邹旭楷,汉字/字符串编辑距离和编辑路径的有效求解技术,计算机研究与发展,第33卷第8期,1996.8,pp574-580
    [13][美]OTIS GOSPODNETIC著,谭鸿、黎俊鸿、周鹏、高承山译,LUCENE IN ACTION中文版,电子工业出版社,2007.1
    [14]沈奇威,廖建新,王纯,朱晓民,彩铃业务的研究和设计,第九届全国青年通信学术会议论文集,重庆,中国,电子工业出版社,2004.5,pp484-489
    [15]廖建新,移动增值业务发展趋势,电信工程技术与标准化,第5期,2004年5月,pp1-5
    [16]林江华、蔡志祥、朱用波,彩铃业务的技术实现方式,移动通信,第27卷,第12期,2003年12月,pp72-75
    [17]庄毅,黎浩宏,搜索引擎技术现状及发展动向,计算机时代,2002.8
    [18]赵志荣,垂直网站与垂直搜索引擎,中国信息导报,2000年11期
    [19]彩铃业务发展现状报告,http://down.qcheng.com/show.aspx?id=2260,2006年2月
    [20]艾瑞市场咨询(iResearch),中国网民彩铃市场研究报告,2005年2月
    [21]齐华,语音识别技术向键盘挑战,中国信息导报,第12期,1998年12月,pp41-41
    [22]谭保华、熊健民、刘幺和,基于语音识别的IVR系统设计,数据通信,第1期,2005年1月,pp37-39
    [23]W3C.Simple Object Access Protocol(SOAP)1.1 W3C Note.http://www.w3.org/TR/SOAP,September 2001
    [24]RFC959,File Transfer Protocol(FTP),The Internet Engineering Task Foree(IETF),October 1985
    [25]Thomas H.Cormen,Charles E.Leiserson等著,潘金贵、顾铁成、李成法、叶懋译,算法导论(原书第2版),机械工业出版社,2006.9
    [26]黄青松,中文全文信息检索中索引项技术及分词系统的实现,云南省计算机学会通讯,2000.3

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700