数字图书馆中数据预处理子系统的设计与实现

作者：田艳芳
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：数字图书馆 ; 文本文档 ; 分类 ; XML ; 元数据 ; 模糊聚类 ; 朴素贝叶斯
英文关键词：Digital Library ; Text Document ; Classification ; XML ; Metadata ; Fuzzy Clustering Naive Bayes
学位年度：2001
导师：邓胜兰
学科代码：081203
学位授予单位：国防科学技术大学
论文提交日期：2001-12-01

摘要

计算机网络的飞速发展为信息的传播与检索提供了技术基础。但是，由于当前需要存储和传播的信息量越来越大，信息的种类和形式越来越丰富，信息更新的速度也越来越快，现有的资源管理和应用模式已经远远不能满足用户的要求了。而数字图书馆作为新一代因特网上信息资源的管理模式，已成为高性能网络信息技术的研究热点之一。
由于现有的数字图书馆软件平台不能很好的完成数据入库以前的一些基本工作，所以，本文对数据的预处理工作，作了详细的设计和实现。首先介绍了数字图书馆的研究背景，数字图书馆的整体结构和数据预处理子系统的结构；然后详细阐述了数据预处理子系统中各个模块使用的技术，以及它们的实现。其中关键的技术是：分类标准的确定；智能分类技术的研究和实现；元数据的确定；文本、图像和元数据的提取；XML技术在数字图书馆中的应用；自动入库功能的实现。
The rapid increase of computer network provided a technical foundation for the spread and search of information. But,because of the more and more information need to be spread,the more enrich kind and form of information,the more rapid speed of information updating,the existing management and application pattern of resource can no longer meet the needs. But,digital library,as a new management model of information resource in internet,has become one of the research focuses of high performance information technology in the network.
The existing digital library software cannot accomplish the basic preceding work for storing data to DBs. Therefore,this paper designed and implemented the work of data pretreatment. First,the research background,the whole structure of digital library,and the structure of pretreatment subsystem are introduced. Then,the techniques and implement of each module in data pretreatment subsystem are presented. The key techniques include,standard of classifying is confirmed,research and implement of intelligent classification technique,metadata is confirmed,text,image,and metadata are distilled,the technique of XML is applied to digital library,the function of storing data to DBs is implemented.

引文

[1] 齐治昌，谭庆平，宁洪等编著。软件工程，高等教育出版社，1997，7。
    [2] 高倬贤《中国图书馆分类法》与《日本十进分类法》比较研究。图书馆学研究，1999，6。
    [3] 刘普寅等编著。模糊理论及其应用，国防科技大学出版社，长沙，1998。
    [4] 邹涛。信息的采集、文档的识别与分类，计算机世界，1999年4月19日。
    [5] 李晓黎等。基于支撑向量机与无监督聚类相结合的中文网页分类器，计算机学报，2001，1，pp62—68。
    [6] 刘炜，赵亮等。上图数字图书馆元数据方案，http://www.istis.sh.cn/istis/dlib/report/。
    [7] 范明等译。数据挖掘概念与技术，北京：机械工业出版社，2001，8。
    [8] 李介谷，蔡国廉等编。计算机模式识别技术，上海：上海交通大学出版社，1986，2。
    [9] 高文，刘峰等著。数字图书馆原理与技术实现，北京：清华大学出版社，2000，10。
    [10] 杜大鹏等译。XML实用大全，北京：中国水利水电出版社，2000，4。
    [11] 许菊芳等译。XML轻松进阶，北京：电子工业出版社，2000，1。
    [12] 谭浩强编著。C程序设计，北京：清华大学出版社，1991，7。
    [13] 丘仲潘等译。GNU C++ for Linux编程技术，北京：电子工业出版社，2000，9。
    [14] 尤晋元等译。UNIX环境高级编程，北京：机械工业出版社，2000，2。
    [15] 肖珑。元数据格式在数字图书馆中的应用，大学图书馆学报，1999，4：pp18-24。
    [16] http://www.acm.org/。
    [17] http://www.xml.org.cn/。
    [18] http://info.internet.isi.edu/in-notes/rfc/files/rfc2413.tXt。
    [19] http://www.yesky.com/33554432/34603008/34635776/55332.htm。
    [20] http://www.xml.org.cn:8188/resource/resource.html。
    [21] http://www.w3.org/XML/。
    [22] http://www.w3.org/MarkUp/。
    [23] http://www.datachannel.com/products。
    [24] http://www.brio.com/products_solutions。
    [25] http://www.ibm.com/software/data/eip/。
    [26] http://www.swm.com.cn/rj/2000-1/20000124.html。
    [27] http//dublincore.org/。
    [28] http://www.e-solution.com.cn/class/index.php。
    [29] http://cora.whizbang.com/。


    [30] http://sunsite.berkeley.edu/info.
    [31] http://www.diRlib.org/.
    [32] http://www.nlc-bnc.ca/ifla/II/diglib.htmo
    [33] http://www.hugin.dk .
    [34] http://aerade.cranfield.ac.uk/.
    [35] http://www.cstc.org/.
    [36] http://www.istis.sh.cn/istis/dlib/report/report.htm.
    [37] http://www-3． ibm.com/software/data.
    [38] Jacqueline W.T.WONG, W.K.KAN, Gilibert YONG. ACTION: Automatic Classification For Full-Text Documents. 1994．
    [39] Jane Yung-jen Hsu and Wen-tau Yih. Template-Based Information Mining from HTML Documents. AAAI 1997．
    [40] Salton, G, and McGill, M. J. 1983． Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series. New York: McGraw-Hill.
    [41] Blum, A., and Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the 11~(th) Annual Conference on Computational Learning Theory (COLT'98, pp. 92-100．
    [42] Joachims, T. Text categorization with Support Vector Machines: Learning with many relevant features. In Machine Learning: ECML-98, Tenth European Conference (ICML '97) , pp. 137-142．
    [43] Mclachlan, G.J., and Krishnan, T. The EM Algorithm and Extensions. John Wiley and Sons, New York.
    [44] Yang, Y. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval. To appear.
    [45] Yang, Y, and Pederson, J. O. Feature selection in statistical learning of text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML '97) , pp.412-420．
    [46] Christian Borgelt. Using Fuzzy Clustering to Improve Naive Bayes Classifiers and Probabilistic Networks. Proc. 8th IEEE International Conference on Fuzzy Systems CD-ROM. IEEE Press, Piscataway, NJ, USA 2000．
    [47] McCallum. Automating the Construction of Internet Portals with Machine Learning. In: Information Retrieval Journal, volume 3, 2000．
    [48] D. Nauck. Foundations of Neuro-Fuzzy Systems. J. Wiley & Sons, Chichester, England 1997．
    [49] 虞万荣。数字图书馆图像检索系统设计与实现,国防科技大学研究生院学位论文,2002,1。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700