分布式关系数据库上的关键字查询

英文题名：Keyword Search on Distributed Relational Databases
作者：李一罡
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：分布式数据库 ; 关键字 ; top-k
英文关键词：distributed databases ; keyword search ; top-k
学位年度：2009
导师：高宏
学科代码：081202
学位授予单位：哈尔滨工业大学
论文提交日期：2009-06-01

摘要

在信息飞速增长的时代,分布式数据库成为大型企业存储信息的首选方式,方便快速的查询关系数据成为一个科研难题。随着网络技术和搜索技术的兴起,关键字查询与传统的SQL查询相比,显示出巨大的优势。首先用户不需要知道数据库的模式信息;其次用户不需要掌握复杂的数据库查询语言,如SQL等。如何将关键字查询技术运用到分布式数据库上就变得格外重要。本文主要研究分布式数据库上的关键字查询问题。
     本文首先提出单数据库上的关键字查询算法。该算法首先给出了一种新的相关性评价函数,新的评价函数重新定义了元组对关键字的包含关系,通过分析数据库模式与查询内容的语义信息来评价元组与查询关键字的相关性;接着基于新的评价函数,提出基于数据块迭代的TOP-K查询算法,该算法通过对未产生结果分值的估计有效的降低了算法的IO时间。
     本文接着在单数据库查询算法的基础上,提出了分布式数据库上的关键字查询算法。该算法首先给出了分布式数据库的数据模型,该模型之上关键字查询的结果定义以及适应于分布式环境的结果评价函数;接着提出扩展的连接表达式生成算法;为了降低分布式环境下查询的执行代价,设计了过滤无效查询的可达性索引以及索引的更新策略;最后给出了分布式环境下的TOP-K查询算法。
     基于以上提出的算法,设计并实现了真实分布式数据库环境下的关键字查询系统。该系统可以有效的支持单节点以及多节点上的查询。在该系统下,我们从多个角度设计了实验内容,实验结果表明本文算法在精确性和高效性都有所提高。
For the information overload problem is acute day by day, distributed database become the first choice for large enterprises. How to query the relational data from it in a convenient and efficient way turns to be an important research problem in the database area. With the popularization of networking and searching technique, keyword search shows great advantage comparing with the traditional SQL query. By using keyword search, the users don't need to know the schema of the database, neither to learn the complex query language such as SQL. Therefore, it is an essential job to apply keyword search onto the distributed databases. This paper mainly focuses on the keyword search problem on distributed relational databases.
     This paper first gives the semantic keyword search algorithm on single relational database. The algorithm adopts a new correlation ranking method which redefines containing relationship of the keywords. It evaluates the correlation between data tuple and query keyword by analyzing the semantic information of database schema and querying content. According the ranking function, we propose a top-k query algorithm based on iterating data block. This algorithm reduces execute time by estimating the score of candidate result.
     Based on the single database search algorithm, we then propose the keyword search algorithm on distributed relational databases. First we define the data model of the distributed relational database, the keyword search result and the result style. Then we give the ranking method of searching result. On the schema graph, we propose expending connection expression generation algorithm. In order to prune the invalid connection expression and reduce the cost of query executing, we design the reachability index and its update strategy. At last, we propose a top-k query processing algorithm in the distributed environment.
     In the end, by using the proposed algorithm, we design and implement a keyword search system in the real distributed databases environment. This system supports keyword search query on both single database and multi databases. We design a lot of test cases in all kinds of aspects to demonstrate our algorithms effectiveness and efficiency.

引文

1. Sanjay Agrawal, Surajit Chaudhuri, Gautam Das. DBXplorer: System for Keyword-Based Search over Relational Databases. Proceedings of 18th International Conference on Data Engineering, CA, United states, 2002:5-16
    2. Vagelis Hristidis, Yannis Papakonstantinou. DISCOVER. Keyword Search in Relational Databases. Proceedings of the 28th international conference on Very Large Data Bases, Hong Kong, China, 2002:670-681
    3. Vagelis Hristidis, Luis Gravano, Yannis Papakonstantinou. Efficient IR-Style Keyword Search over Relational Databases. Proceedings of the 29th international conference on Very large data bases, Berlin, Germany, 2003:850-861
    4. Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe. Keyword Searching and Browsing in Databases using BANKS. Proceedings of the 18th International Conference on Data Engineering, CA, United states, 2002:431-440
    5. Varun Kacholia, Shashank Pandit, Soumen Chakrabarti. Bidirectional Expansion For Keyword Search on Graph Databases. Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway, 2005: 505-516
    6. Bolin Ding, Jeffrey Xu Yu, Shan Wang. Finding Top-k Min-Cost Connected Trees in Databases. Proceedings of the 23rd International Conference on Data Engineering (ICDE 2007),Istanbul,Turkey,2007:836-845
    7. Y.Luo, X.M.Lin, W.Wang,X.F.Zhou. SPARK: Top-k keyword query in relational databases. Proceedings of the 2007 ACM SIGMOD international conference on Management of data, Beijing, China, 2007:115-126
    8. Fang Liu, Clement T. Yu, Weiyi Meng, Abdur Chowdhury. Effective keyword search in relational databases. Proceedings of the 2006 ACM SIGMOD international conference on Management of data, Chicago, IL, USA, 2006:563-574
    9. Andrey Balmin, Vagelis Hristidis, Yannis Papakonstantinou. ObjectRank: Authority-Based Keyword Search in Databases. Proceedings of the Thirtieth international conference on Very large data bases, Toronto, Canada, 2004:564-575
    10. B. Kimelfeld and Y. Sagiv. Finding and Approximating Top-k Answers in Keyword Proximity Search. Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, Chicago, IL, USA, 2006:203-214
    11. Konstantin Golenberg, Benny Kimelfeld, Yehoshua Sagiv. Keyword proximity search in complex data graphs. Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008:927-940
    12. Qi Su, Jennifer Widom. Indexing Relational Database Content Offline for Efficient Keyword-Based Search. Proceedings of the 9th International Database Engineering & Application Symposium, DC, USA, 2005:297-306
    13. Guoliang Li, Jianhua Feng, Lizhu Zhou. Retune: Retrieving and Materializing Tuple Units for Effective Keyword Search over Relational Databases. ER 2008:469-483
    14. Ken Q. Pu, Xiaohui Yu: FRISK. Keyword Query Cleaning and Processing in Action. Proceedings of the VLDB Endowment , Vancouver, Canada,2008:1531-1534
    15. Manish Bhide, Venkatesan T. Chakaravarthy, Krithi Ramamritham, Prasan Roy. Keyword Search over Dynamic Categorized Information. ICDE 2009:258-269
    16. Sandeep Tata, Guy M. Lohman. SQAK: doing more with keywords. Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008:889-902
    17. Mayssam Sayyadian, Hieu LeKhac, AnHai Doan, Luis Gravano. Efficient Keyword Search Across Heterogeneous Relational Databases. ICDE 2007:346-355
    18. Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou. EASE: Efficient and Adaptive KeywordSearch on Unstructured, Semi-structured and Structured Data. Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008:1289-1300
    19. Guoliang Li, Jianhua Feng, Jianyong Wang, Lizhu Zhou. An effective and versatile keyword search engine on heterogenous data sources. Proceedings ofthe VLDB Endowment, 1(2). 2008:1452-1455
    20. Bei Yu, Guoliang Li, Karen R. Sollins, Anthony K. H. Tung. Effective keyword-based selection of relational databases. Proceedings of the 2007 ACM SIGMOD international conference on Management of data, Beijing, China, 2007:139-150
    21. Quang Hieu Vu, Beng Chin Ooi, Dimitris Papadias, Anthony K. H. Tung. A graph method for keyword-based selection of the top-K databases. Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008:915-926
    22. Guoliang Li, Jianhua Feng, Jianyong Wang, Xiaoming Song, Lizhu Zhou. Sailer: an effective search engine for unified retrieval of heterogeneous xml and web documents. Proceeding of the 17th international conference on World Wide Web, Beijing, China,2008:1061-1062
    23. Jungkee Kim, Geoffrey Fox. Scalable Hybrid Search on Distributed Databases. International Conference on Computational Science 2005:431-438
    24. Mohammad Hassan, Reda Alhajj, Mick J. Ridley, Ken Barker. Simplified access to structured databases by adapting keyword search and database selection. Proceedings of the 2004 ACM symposium on Applied computing, Nicosia, Cyprus, 2004:674-678
    25. Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan. Keyword search on external memory data graphs. Proceedings of the VLDB Endowment, 2008:1189-1204
    26. Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano. Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. Proceedings of the 2009 IEEE International Conference on Data Engineering, 2009:405-416
    27. Alexander Markowetz, Yin Yang, Dimitris Papadias. Keyword search on relational data streams. Proceedings of the 2007 ACM SIGMOD international conference on Management of data, Beijing, China, 2007:605-616
    28. Vagelis Hristidis, Oscar Valdivia, Michail Vlachos, Philip S. Yu. Continuous keyword search on multiple text streams. Proceedings of the 15th ACM international conference on Information and knowledge management, Arlington, Virginia, USA, 2006:802-803
    29. Ian De Felipe, Vagelis Hristidis, Naphtali Rishe. Keyword Search on Spatial Databases. ICDE 2008:656-665
    30. Dongxiang Zhang, Yeow Meng Chee, Anirban Mondal, Anthony K. H. Tung, Masaru Kitsuregawa. Keyword Search in Spatial Databases Proceedings of the 29th International Conference on Data Engineering, 2009:688-699
    31. Lin Guo, Feng Shao, Chavdar Botev, Jayavel Shanmugasundaram. XRANK: Ranked Keyword Search over XML Documents. Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, California, 2003:16-27
    32. Vagelis Hristidis, Yannis Papakonstantinou, Andrey Balmin. Keyword Proximity Search on XML Graphs. ICDE 2003:367-378
    33. Vagelis Hristidis, Nick Koudas, Yannis Papakonstantinou, Divesh Srivastava. Keyword Proximity Search in XML Trees. IEEE Transactions on Knowledge and Data Engineering(TKDE) 18(4):525-539 (2006)
    34. Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv: XSEarch. A Semantic Search Engine for XML. Proceedings of the 29th international conference on Very large data bases, Berlin, Germany, 2003:45-56
    35. Yu Xu, Yannis Papakonstantinou. Efficient Keyword Search for Smallest LCAs in XML Databases. Proceedings of the 2005 ACM SIGMOD international conference on Management of data, Baltimore, Maryland, 2005:537-538
    36. Ziyang Liu, Yi Chen. Identifying meaningful return information for XML keyword search. Proceedings of the 2007 ACM SIGMOD international conference on Management of data, Beijing, China, 2007:329-340
    37. Ziyang Liu and Yi Chen. Reasoning and Identifying Relevant Matches for XML Keyword Search In 34th International Conference on Very Large Data Bases (VLDB) / Journal of Data Management Research, Vol. 1, 2008.
    38. Andrey Balmin, Vagelis Hristidis, Nick Koudas, Yannis Papakonstantinou. A System for Keyword Proximity Search on XML Databases. Proceedings of the 29th international conference on Very large data bases, Berlin, Germany, 2003:1069-1072
    39. Ziyang Liu, Yi Chen. Answering Keyword Queries on XML Using Materialized Views. ICDE 2008:1501-1503
    40. Zhifeng Bao, Tok Wang Ling, Bo Chen. Effective XML Keyword Search withRelevance Oriented Ranking. Proceedings of the 29th International Conference on Data Engineering, 2009:517-528
    41. Feng Shao, Lin Guo, Chavdar Botev, Anand Bhaskar. Efficient Keyword Search over Virtual XML Views. Proceedings of the 33rd international conference on Very large data bases, Vienna, Austria, 2007:1057-1068
    42. Sihem Amer-Yahia, Irini Fundulaki, Laks V. S. Lakshmanan. Personalizing XML Search in PIMENTO. Proceedings of the 29th International Conference on Data Engineering, 2007:906-915
    43. Sihem Amer-Yahia, Nick Koudas, Amélie Marian. Structure and Content Scoring for XML. Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway, 2005:361-372
    44. Yunyao Li, Cong Yu, H. V. Jagadish. Schema-Free XQuery. Proceedings of the Thirtieth international conference on Very large data bases, Toronto, Canada, 2004:72-83
    45. Fernando Farfán, Vagelis Hristidis, Anand Ranganathan. XOntoRank: Ontology-Aware Search of Electronic Medical Records. Proceedings of the 29th International Conference on Data Engineering, 2009:820-831
    46. Jungkee Kim, Geoffrey Fox. A Hybrid Keyword Search across Peer-to-Peer Federated Databases. ADBIS 2004:1023-1034
    47. Zhigang Chen, Zhongding Huang, Bo Ling, Jiang Li. P2P-Join: A Keyword Based Join Operation in Relational Database Enabled Peer-to-Peer Systems. Proceedings of the 17th International Conference on Database and Expert Systems Applications, Washington, DC, USA, 2006:637-641
    48. Hanhua Chen, Hai Jin, Jiliang Wang, Lei Chen. Efficient multi-keyword search over p2p web. Proceeding of the 17th international conference on World Wide Web, Beijing, China, 2008:989-998
    49. Y. Huhtala et al. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal,1999,42(2): 100-111
    50. T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. Proceedings of the 2002 ACM SIGMOD international conference on Management of data,. Madison, Wisconsin, 2002:435-447
    51.文继军,王珊. SEEKER:基于关键词的关系数据库信息检索.软件学报,2005,16(7):1270-1279.
    52.蔡宏艳,姚佳丽,王珊. DETECTOR:基于关系数据库通用的在线关键词查询系统.计算机研究与发展,2007,44(1):119-125.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700