基于用户行为挖掘的数据流管理技术研究

作者：李军
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：流分析 ; 流管理 ; 用户行为 ; 集成学习 ; 过滤器排序
英文关键词：Data Stream Analysis ; Data Stream Management ; User
英文关键词：Behavior ; Ensemble Learning ; Shared Filter Ordering
学位年度：2012
导师：方滨兴
学科代码：0812
学位授予单位：北京邮电大学

摘要

日益增长的网络安全威胁促使各种安全防御机制应运而生,这些安全机制大都需要分析网络数据流,以发现违规行为和有害信息。当前的数据流安全管理主要基于关键字分析,未全面考虑关键字所在的网络上下文环境信息,比如关键字所在网页的内容语义和浏览该页面的网络用户类别信息。为此,本文从影响数据流安全管理的用户行为、内容分析和管理调度三个方面开展研究,主要贡献为：
     1.在用户维度,提出一种用户行为预测模型,该模型使得数据流安全管理系统可以针对用户类别实行差别化管理。具体来说,该模型通过收集网络用户的网页点击行为数据和搜索行为数据,构造一个关联用户行为和用户类别的预测模型。对比于以往的用户行为分析方法,该模型有以下创新：(i)构建全而的行为类别体系和行为特征空间,借鉴概率潜在语义分析思想,提出了一种用户潜在行为倾向发现方法来挖掘“用户-行为”共现中的倾向语义；(ii)该模型结合安全管理的应用背景,发现倾向的描述能力较弱,设计了一种“倾向-类别”映射关系学习算法,同时对该学习算法的信息转换等价性进行了理论分析；(iii)针对预测结果,设计了相应的度量指标和评估办法,实验证明：在不对用户标注的情况下,该模型可准确预测用户的行为类别。
     2.在流内容维度,提出了多分类器快速内容判别模型。对每一个高速到来的元组,联合多个分类器对其进行综合判别。虽然该方法提高了判别的精度和稳定性,但是判别的速度会严重下降。为此,考虑利用多分类器之间的共享部分来提高判断的速度。具体来说,设计了两种集成模型索引结构(E-Tree和SVM-Index),理论上证明了这两种结构可以达到亚线性(O(logN)和O(1))判别速度。进一步地,在UCI公开数据集上的实验结果验证了预测开销平均可以降至原来的25%和3%左右。
     3.在整体调度上,提出利用数据挖掘和机器学习的方法来构建自适应的过滤器排序模型。一方面,针对较稳定的数据流环境,基于K-means思想,提出了一种层次化聚类排序模型KHO,来提升过滤器排序算法的鲁棒性；另一方面,针对非平稳数据流环境,基于指数平滑和层次决策的思想,提出了一种自适应的平滑排序模型AHES。以上方法解决了当前数据流过滤器排序算法无法随着数据流上下文环境自适应调节的问题。最后,大量实验结果证明：提出的模型能够表现出较好的性能和环境感知能力。
     4.基于上述关键技术的研究探索,设计并实现了一个用户行为数据安全管理引擎IceStream,并详细介绍了核心模块的主要功能和设计思路。
Recently, network threats incur a variety of security challenges, which demands analysis of the network data streams. Network behavior data streams not only have a large number of continuous data, but also have browsing and search behavior data. Therefore, analyzing user behavior data streams calls for dealing with three challenges:the user information, the content information, and the system management. Specifically, the contribution of the dissertation is four folder:
     1. From the user dimension, a new user behavior prediction model is pro-posed that can categorize users into different behavior categories. Specifi-cally, this model collects user behavior data, including user web click data and search keywords, to mine the relationship between user behavior and security class label. Compared to existing user behavior analysis models, the proposed method has the following contributions:(ⅰ) It is developed to predict users' behavior categories, and uses the probability latent seman-tic analysis to discover the tendencies of user behaviors;(ⅱ) It builds a mapping function between user tendency label and the behavior label;(ⅲ A new metric is used to measure the utility of the model. Experiments have demonstrated that the model can accurately predict the user behavior class label without labeling expense.
     2. From the content dimension, a fast ensemble prediction model is pro-posed. The model uses multiple classifiers to predict the class label of each incoming stream record. Despite the accurate and stable merits of the models, the prediction efficiency drops heavily with the number of base classifiers in the ensemble increasing. Therefore, we propose an en-semble indexing method that can use the shared patterns among the base classifiers in the ensemble to reduce the prediction cost. Specifically, t-wo indexing models (E-Tree and SVM-Index) are proposed to achieve the sub-linear time costs (O(logN) and O(1) respectively). Experiments on UCI data have demonstrated that the models can reduce25%and3%re-spectively of the original models.
     3. From the system perspective, the data mining and machine learning meth-ods are used to construct adaptive filter framework. Specifically, for sta-ble stream environment, the K-means method is used to build hierarchical sorting model KHO to improve the robustness of the sorting algorithm of the filters. On the other hand, based on the ideological level decision-making (AHP) and exponential smoothing adaptive filter, a new AHES model is proposed for unstable data streams. These two methods enable us to incorporate the context information on data streams for filter sorting. Experiments have demonstrated the utility of the models.
     4. Based on the three key techniques, a new data stream management engine IceStream is designed. The key modules and functions of IceStream are introduced.

引文

[1]Knuth D, Morris Jr J, Pratt V. Fast pattern matching in strings [J]. SIAM journal on computing,1977, 6:323.
    [2]Boyer R, Moore J. A fast string searching algorithm [J]. Communications of the ACM,1977,20 (10): 762-772.
    [3]Aho A, Corasick M. Efficient string matching:an aid to bibliographic search [J]. Communications of the ACM,1975,18 (6):333-340.
    [4]Wu S, Manber U. A fast algorithm for multi-pattern searching [R].1994.
    [5]Tsymbal A. The Problem of Concept Drift:Definitions and Related Work [R].2004.
    [6]Zhang P, Zhu X, Tan J, et al. Classifier and cluster ensembles for mining concept drifting data streams [C]. In Data Mining (ICDM),2010 IEEE 10th International Conference on,2010:1175-1180.
    [7]Zhang P, XZhu, Shi Y, et al. Robust Ensemble Learning for Mining Noisy Data Streams [J]. Decision Support Systems.2011,50 (2):469-479.
    [8]Babu S, Widom J. StreaMon:an adaptive engine for stream query processing [C]. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data,2004:931-932.
    [9]Liu Z, Parthasarathy S. Ranganathan A, et al. Near-optimal algorithms for shared filter evaluation in data stream systems [C]. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data,2008:133-146.
    [10]Quinlan J. C4.5:Programs for Machine Learning [M]. Morgan Kaufmann Publishers,1993.
    [11]Muthukrishnan S. Data streams:Algorithms and applications [M]. Now Publishers Inc,2005.
    [12]Natsev A, Tesic J, Xie L, et al. IBM multimedia search and retrieval system [C]. In Proceedings of the 6th ACM international conference on Image and video retrieval,2007:645-645.
    [13]Munagala K, Babu S, Motwani R, et al. The pipelined set cover problem [J]. Database Theory-ICDT 2005,2005:83-98.
    [14]Munagala K, Srivastava U, Widom J. Optimization of continuous queries with shared expensive fil-ters [C]. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Prin-ciples of database systems,2007:215-224.
    [15]Babcock B, Babu S, Datar M, et al. Models and issues in data stream systems [C]. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2002:1-16.
    [16]Arvind D, Arasu A, Babcock B, et al. Stream:The stanford stream data manager [C]. In IEEE Data Engineering Bulletin,2003.
    [17]Arasu A, Babu S, Widom J. The CQL continuous query language:semantic foundations and query execution [J]. The VLDB Journalal The International Journal on Very Large Data Bases,2006,15 (2): 121-142.
    [18]Babcock B, Babu S, Datar M, et al. Operator scheduling in data stream systems [J]. The VLDB Journalal The International Journal on Very Large Data Bases,2004,13 (4):333-353.
    [19]Arasu A, Widom J. Resource sharing in continuous sliding-window aggregates [C]. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30,2004:336-347.
    [20]Babcock B, Datar M, Motwani R. Load shedding for aggregation queries over data streams [C]. In Data Engineering,2004. Proceedings.20th International Conference on,2004:350-361.
    [21]Abadi D, Carney D, Cetintemel U, et al. Aurora:a new model and architecture for data stream management [J]. The VLDB Journal,2003,12 (2):120-139.
    [22]Carney D, Cetintemel U, Cherniack M, et al. Monitoring streams:a new class of data management applications [C]. In Proceedings of the 28th international conference on Very Large Data Bases,2002: 215-226.
    [23]Abadi D, Ahmad Y, Balazinska M, et al. The design of the borealis stream processing engine [C]. In Second Biennial Conference on Innovative Data Systems Research (CIDR 2005). Asilomar. CA. 2005:277-289.
    [24]Ahmad Y, Berg B, Cetintemel U, et al. Distributed operation in the borealis stream processing engine [C]. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005:882-884.
    [251 Hwang J, Balazinska M, Rasin A, et al. High-availability algorithms for distributed stream pro-cessing [C]. In Data Engineering,2005. ICDE 2005. Proceedings.21st International Conference on, 2005:779-790.
    [26]Balazinska M, Balakrishnan H, Madden S, et al. Fault-tolerance in the Borealis distributed stream processing system [J]. ACM Transactions on Database Systems (TODS),2008,33 (1):3.
    [27]Hwang J, Xing Y, Cetintemel U, et al. A cooperative, self-configuring high-availability solution for stream processing [C]. In Data Engineering,2007. ICDE 2007. IEEE 23rd International Conference on,2007:176-185.
    [28]Chandrasekaran S, Cooper O, Deshpande A, et al. TelegraphCQ:continuous dataflow processing [C]. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, 2003:668-668.
    [29]Madden S, Shah M, Hellerstein J, et al. Continuously adaptive continuous queries over streams [C]. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data,2002: 49-60.
    [30]Urhan T, Franklin M. Dynamic pipeline scheduling for improving interactive query performance [C]. In Proceedings of the International Conference on Very Large Data Bases,2001:501-510.
    [31]Avnur R, Hellerstein J. Eddies:Continuously adaptive query processing [C]. In ACM SIGMoD Record,2000:261-272.
    [32]Cranor C, Gao Y, Johnson T, et al. Gigascope:High performance network monitoring with an SQL interface [C]. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data,2002:623-623.
    [33]Cranor C, Johnson T, Spataschek O, et al. Gigascope:A stream database for network applications [C]. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, 2003:647-651.
    [34]Johnson T, Muthukrishnan S, Shkapenyuk V, et al. A heartbeat mechanism and its application in gigascope [C]. In Proceedings of the 31st international conference on Very large data bases,2005: 1079-1088.
    [35]Altinel M, Franklin M J. Efficient filtering of XML documents for selective dissemination of infor-mation [J]. Proc. of VLDB 2000.
    [36]Nguyen B, Abiteboul S, Cobena G, et al. Monitoring XML data on the web [C]. In ACM SIGMOD Record,2001:437-448.
    [37]Vaidya P, Lee J, Bowen F. et al. Symbiote:a reconfigurable logic assisted data streammanagement system (RLADSMS) [C]. In Proceedings of the 2010 international conference on Management of data,2010:1147-1150.
    [38]Bonnet P, Gehrke J, Seshadri P. Towards sensor database systems [C]. In Mobile Data Management, 2001:3-14.
    [39]Zhu Y, Shasha D. Statstream:Statistical monitoring of thousands of data streams in real time[C]. In Proceedings of the 28th international conference on Very Large Data Bases,2002:358-369.
    [40]Arasu A. Manku G. Approximate counts and quantiles over sliding windows [C]. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2004:286-296.
    [41]Babcock B, Datar M, Motwani R, et al. Maintaining variance and k-medians over data stream win-dows [C]. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2003:234-243.
    [42]Guha S, McGregor A. Approximate quantiles and the order of the stream [C]. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2006: 273-279.
    [43]Cormode G, Korn F, Muthukrishnan S, et al. Space-and time-efficient deterministic algorithms for biased quantiles over data streams [C]. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2006:263-272.
    [44]Zhang L, Guan Y. Variance estimation over sliding windows [C]. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2007:225-232.
    [45]Giannella C, Han J, Pei J, et al. Mining frequent patterns in data streams at multiple time granulari-ties [J]. Next generation data mining,2003,212:191-212.
    [46]Manku G, Motwani R. Approximate frequency counts over data streams [C]. In Proceedings of the 28th international conference on Very Large Data Bases,2002:346-357.
    [47]Chang J, Lee W. Finding recent frequent itemsets adaptively over online data streams [C]. In Pro-ceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,2003:487-492.
    [48]Cormode G, Muthukrishnan S. What's hot and what's not:tracking most frequent items dynam-ically [C]. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2003:296-306.
    [49]Jin C, Qian W, Sha C, et al. Dynamically maintaining frequent items over a data stream [C]. In Proceedings of the twelfth international conference on Information and knowledge management, 2003:287-294.
    [50]Mozafari B, Thakkar H, Zaniolo C. Verifying and mining frequent patterns from large windows over data streams [C]. In Data Engineering,2008. ICDE 2008. IEEE 24th International Conference on, 2008:179-188.
    [51]Domingos P, Hulten G. Mining high-speed data streams [C]. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining.2000:71-80.
    [52]Hulten G, Spencer L, Domingos P. Mining time-changing data streams [C]. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining,2001: 97-106.
    [53]Khan M, Ding Q, Perrizo W. k-nearest neighbor classification on spatial data streams using P-trees [J]. Advances in Knowledge Discovery and Data Mining.2002:517-528.
    [54]Aggarwal C. Han J. Wang J. et al. On demand classification of data streams [C]. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,2004: 503-508.
    [55]Guha S, Mishra N, Motwani R. et al. Clustering data streams [C]. In Foundations of computer science,2000. proceedings,41st annual symposium on.2000:359-366.
    [56]Aggarwal C. A framework for diagnosing changes in evolving data streams [C]. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data,2003:575-586.
    [57]Aggarwal C, Han J, Wang J, et al. A framework for projected clustering of high dimensional data streams [C]. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30,2004:852-863.
    [58]Nasraoui O, Rojas C. Robust clustering for tracking noisy evolving data streams [C]. In Proc.2006 SIAM Conf. on Data Mining (SDM 2006),2006:80-99.
    [59]Cao F, Ester M, Qian W, et al. Density-based clustering over an evolving data stream with noise [C]. In Proceedings of the 2006 SIAM International Conference on Data Mining,2006:328-339.
    [60]Zhou A, Cao F, Yan Y, et al. Distributed data stream clustering:A fast EM-based approach [C]. In Data Engineering,2007. ICDE 2007. IEEE 23rd International Conference on,2007:736-745.
    [61]Chen Y, Tu L. Density-based clustering for real-time stream data [C]. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining,2007:133-142.
    [62]Subramaniam S, Palpanas T, Papadopoulos D, et al. Online outlier detection in sensor data using non-parametric models [C]. In Proceedings of the 32nd international conference on Very large data bases,2006:187-198.
    [63]Angiulli F, Fassetti F. Detecting distance-based outliers in streams of data [C]. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management,2007: 811-820.
    [64]Pokrajac D, Lazarevic A, Latecki L. Incremental local outlier detection for data streams [C]. In Computational Intelligence and Data Mining,2007. CIDM 2007. IEEE Symposium on,2007:504-515.
    [65]Kleinberg J. Bursty and hierarchical structure in streams [J]. Data Mining and Knowledge Discov-ery,2003,7 (4):373-397.
    [66]Zhu Y, Shasha D. Efficient elastic burst detection in data streams [C]. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,2003:336-345.
    [67]Fung G, Yu J, Yu P, et al. Parameter free bursty events detection in text streams [C]. In Proceedings of the 31st international conference on Very large data bases,2005:181-192.
    [68]Siebes A, Vreeken J, van Leeuwen M. Item sets that compress [C]. In SDM,2006:393-404.
    [69]Wang X, Zhai C, Hu X, et al. Mining correlated bursty topic patterns from coordinated text streams [C]. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining,2007:784-793.
    [70]Metwally A, Agrawal D, El Abbadi A. Efficient computation of frequent and top-k elements in data streams [J]. Database Theory-ICDT 2005.2005:398-412.
    [71]Das G. Gunopulos D, Koudas N, et al. Ad-hoc top-k query answering for data streams [C]. In Proceedings of the 33rd international conference on Very large data bases,2007:183-194.
    [72]Lin X, Yuan Y, Wang W, et al. Stabbing the sky:Efficient skyline computation over sliding windows [C]. In Data Engineering,2005. ICDE 2005. Proceedings.21st International Conference on,2005: 502-513.
    [73]Tao Y, Papadias D. Maintaining sliding window skylines on data streams [J]. Knowledge and Data Engineering, IEEE Transactions on,2006,18 (3):377-391.
    [74]Gao L, Yao Z, Wang X. Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching [C]. In Proceedings of the eleventh international conference on Information and knowledge management,2002:485-492.
    [75]Liu X, Ferhatosmanoglu H. Efficient k-NN search on streaming data series [J]. Advances in Spatial and Temporal Databases,2003:83-101.
    [76]Koudas N, Ooi B, Tan K, et al. Approximate NN queries on streams with guaranteed er-ror/performance bounds [C]. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30,2004:804-815.
    [77]Bohm C, Ooi B, Plant C, et al. Efficiently processing continuous k-nn queries on data streams [C]. In Data Engineering,2007. ICDE 2007. IEEE 23rd International Conference on,2007:156-165.
    [78]Sakurai Y, Faloutsos C, Yamamuro M. Stream monitoring under the time warping distance [C]. In Data Engineering,2007. ICDE 2007. IEEE 23rd International Conference on,2007:1046-1055.
    [79]Kwon D, Lee S, Lee S. Indexing the current positions of moving objects using the lazy update R-tree [C]. In Mobile Data Management,2002. Proceedings. Third International Conference on,2002: 113-120.
    [80]Lee M, Hsu W, Jensen C, et al. Supporting frequent updates in R-trees:a bottom-up approach [C]. In Proceedings of the 29th international conference on Very large data bases-Volume 29,2003:608-619.
    [81]Lin B, Su J. Handling frequent updates of moving objects [C]. In Proceedings of the 14th ACM international conference on Information and knowledge management,2005:493-500.
    [82]Xiong X, Aref W. R-trees with update memos [C]. In Data Engineering,2006. ICDE'06. Proceed-ings of the 22nd International Conference on.2006:22-22.
    [83]Biveinis L, Saltenis S, Jensen C. Main-memory operation buffering for efficient R-tree update [C]. In Proceedings of the 33rd international conference on Very large data bases,2007:591-602.
    [84]Gagie T. Bounds for Compression in Streaming Models [J]. Arxiv preprint arXiv:0711.3338,2007.
    [85]Cocci R, Tran T, Diao Y, et al. Efficient data interpretation and compression over RFID streams [C[. In Data Engineering.2008. ICDE 2008. IEEE 24th International Conference on,2008:1445-1447.
    [86]Zhou A. Cai Z, Wei L, et al. M-Kernel merging:Towards density estimation over data streams [C]. In Database Systems for Advanced Applications.2003.(DASFAA 2003). Proceedings. Eighth International Conference on,2003:285-292.
    [87]Procopiuc C, Procopiuc O. Density estimation for spatial data streams [J]. Advances in Spatial and Temporal Databases,2005:923-923.
    [88]Heinz C, Seeger B. Cluster kernels:Resource-aware kernel density estimators over streaming data [J]. Knowledge and Data Engineering, IEEE Transactions on,2008,20 (7):880-893.
    [89]Quinlan J. Induction of decision trees [J]. Machine learning,1986,1 (1):81-106.
    [90]Quinlan J. Generating production rules from decision trees [C]. In Proceedings of the Tenth Inter-national Joint Conference on Artificial Intelligence,1987:304-307.
    [91]Quinlan J. Simplifying decision trees [J]. International journal of man-machine studies,1987,27 (3):221-234.
    [92]Quinlan. The ID3 Algorithm [J]. Available online:http://www.cise.ufl.edu/ddd/cap6635/Fall-97/Short-papers/2. htm.
    [93]Quinlan J. C4.5:programs for machine learning [M]. Morgan kaufmann,1993.
    [94]Vapnik V. The nature of statistical learning theory [M]. Springer-Verlag New York Inc,2000.
    [95]Vapnik V Statistical learning theory.1998.
    [96]Vapnik V, Golowich S, Smola A. Support vector method for function approximation, regression estimation, and signal processing [C]. In Advances in Neural Information Processing Systems 9, 1996.
    [97]Bottou L, Cortes C, Denker J, et al. Comparison of classifier methods:a case study in handwritten digit recognition [C]. In Pattern Recognition,1994. Vol.2-Conference B:Computer Vision & Image Processing., Proceedings of the 12th IAPR International. Conference on,1994:77-82.
    [98]Friedman J. Another approach to polychotomous classifcation [R].1996.
    [99]Knerr S, Personnaz L, Dreyfus G, et al. Single-layer learning revisited:A stepwise procedure for building and training a neural network [J]. Optimization Methods and Software,1990,1:23-34.
    [100]Kim H, Pang S, Je H, et al. Constructing support vector machine ensemble [J]. Pattern recognition, 2003,36(12):2757-2767.
    [101]Platt J, Cristianini N, Shawe-Taylor J. Large margin DAGs for multiclass classification [J]. Ad-vances in neural information processing systems,2000,12 (3):547-553.
    [102]Takahashi F, Abe S. Decision-tree-based multiclass support vector machines [C]. In Neural In-formation Processing,2002. ICONIP'02. Proceedings of the 9th International Conference on,2002: 1418-1422.
    [103]Bose R, Ray-Chaudhuri D. On a class of error correcting binary group codes* [J]. Information and control,1960,3(1):68-79.
    [104]Weston J, Watkins C. Multi-class support vector machines [R].1998.
    [105]John S, Rakesh A, Manish M, et al. SPRINT:A Scalable Parallel Classifier for Data Mining [J].
    [106]Gehrke J, Ganti V, Ramakrishnan R, et al. BOATaloptimistic decision tree construction [C]. In ACM SIGMOD Record,1999:169-180.
    [107]Gehrke J, Ramakrishnan R, Ganti V. RainForestala framework for fast decision tree construction of large datasets [J]. Data Mining and Knowledge Discovery,2000,4 (2):127-162.
    [108]Chen R, Sivakumar K, Kargupta H. Distributed web mining using Bayesian networks from multiple data streams [C]. In Data Mining,2001. ICDM 2001, Proceedings IEEE International Conference on, 2001:75-82.
    [109]Chen R, Sivakumar K, Kargupta H. An approach to online Bayesian learning from multiple data streams [C]. In Proceedings of Workshop on Mobile and Distributed Data Mining, PKDD,2001: 31-45.
    [110]Gao J, Fan W, Han J. On appropriate assumptions to mine data streams:Analysis and practice [C]. In Data Mining,2007. ICDM 2007. Seventh IEEE International Conference on,2007:143-152.
    [111]Guha S, Gunopulos D, Koudas N. Correlating synchronous and asynchronous data streams [C]. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,2003:529-534.
    [112]Syed N, Liu H, Sung K. Handling concept drifts in incremental learning with support vector ma-chines [C]. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge dis-covery and data mining,1999:317-321.
    [113]Wang H, Fan W, Yu P, et al. Mining concept-drifting data streams using ensemble classifiers [C]. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,2003:226-235.
    [114]Street W, Kim Y. A streaming ensemble algorithm (SEA) for large-scale classification [C]. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining,2001:377-382.
    [115]Kolter J, Maloof M. Dynamic weighted majority:A new ensemble method for tracking concept drift [C]. In Data Mining,2003. ICDM 2003. Third IEEE International Conference on,2003:123-130.
    [116]Wang H, Yin J, Pei J, et al. Suppressing model overfitting in mining concept-drifting data streams [C]. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining,2006:736-741.
    [117]Kolter J, Maloof M. Using additive expert ensembles to cope with concept drift [C]. In Proceedings of the 22nd international conference on Machine learning,2005:449-456.
    [118]Hansen L. Liisberg C, Salamon P. Ensemble methods for handwritten digit recognition [C]. In Neural Networks for Signal Processing [1992] Ⅱ., Proceedings of the 1992 IEEE-SP Workshop, 1992:333-342.
    [119]Krogh P. Learning with ensembles:How over-fitting can be useful [C]. In Proceedings of the 1995 conference,1996:190.
    [120]Aha D. UC Irvine Machine Learning Repository [J]. Available online:http://archive.ics.uci.edu/ml/.
    [121]Gutta S, Wechsler H. Face recognition using hybrid classifiers [J]. Pattern Recognition.1997,30 (4):539-553.
    [122]Collobert R, Bengio S, Bengio Y, A parallel mixture of SVMs for very large scale problems [J]. Neural computation,2002,14 (5):1105-1114.
    [123]Liu X. Hall L. Bowyer K. Comments on arA Parallel Mixture of SVMs for Very Large Scale Problemsas [J]. Neural computation,2004,16 (7):1345-1351.
    [124]Street W N, Kim Y. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification [J|. Proc. of KDD 2001.
    [125]Zhang P, Zhu X, Shi Y. Categorizing and mining concept drifting data streams [C]. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining,2008: 812-820.
    [126]Dong Y, Han K. Boosting SVM classifiers by ensemble [C]. In Special interest tracks and posters of the 14th international conference on World Wide Web,2005:1072-1073.
    [127]Breiman L. Bagging predictors [J]. Machine learning,1996,24 (2):123-140.
    [128]Freund Y, Schapire R. A desicion-theoretic generalization of on-line learning and an application to boosting [C]. In Computational learning theory,1995:23-37.
    [129]Freund Y, Schapire R. Experiments with a new boosting algorithm [C]. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-,1996:148-156.
    [130]Freund Y, Schapire R, Abe N. A short introduction to boosting [J]. Journal-Japanese Society For Artificial Intelligence,1999,14(771-780):1612.
    [131]Schapire R, Singer Y. Improved boosting algorithms using confidence-rated predictions [J]. Ma-chine learning,1999,37 (3):297-336.
    [132]Srihari S. Reliability analysis of majority vote systems [J]. Information Sciences,1982,26 (3): 243-256.
    [133]Pampel F. Logistic regression:A primer [M]. Sage Publications, Inc,2000.
    [134]Opitz D, Shavlik J. Actively searching for an effective neural network ensemble [J]. Connection Science,1996,8 (3-4):337-354.
    [135]Kim D, Kim C. Forecasting time series with genetic fuzzy predictor ensemble [J]. Fuzzy Systems, IEEE Transactions on,1997,5 (4):523-535.
    [136]Xu L, Krzyzak A, Suen C. Methods of combining multiple classifiers and their applications to handwriting recognition [J]. Systems, Man and Cybernetics, IEEE Transactions on,1992,22 (3): 418-435.
    [137]Sugeno M, of Technology T I. Theory of fuzzy integrals and its applications [M]. Tokyo Institute of Technology,1974.
    [138]Murofushi T, Sugeno M. An interpretation of fuzzy measures and the Choquet integral as an inte-gral with respect to a fuzzy measure[J]. Fuzzy sets and Systems.1989,29 (2):201-227.
    [139]Tahani H, Keller J. Information fusion in computer vision using the fuzzy integral [J]. Systems, Man and Cybernetics, IEEE Transactions on,1990.20 (3):733-741.
    [140]Kuncheva L, Bezdek J, Duin R. Decision templates for multiple classifier fusion:an experimental comparison [J]. Pattern Recognition.2001,34 (2):299-314.
    [141]Ruta D, Gabrys B. An overview of classifier fusion methods [J]. Computing and Information sys-tems,2000,7 (1):1-10.
    [142]Hinton G. Training products of experts by minimizing contrastive divergence [J]. Neural compu-tation,2002,14(8):1771-1800.
    [143]Pang S, Kim D, Bang S. Fraud detection using support vector machine ensemble [J]. ICONIP2001, 2001:1344-1349.
    [144]Kim H, Pang S, Je H, et al. Pattern classification using support vector machine ensemble [C]. In Pattern Recognition,2002. Proceedings.16th International Conference on,2002:160-163.
    [145]Belkin N, Croft W. Information filtering and information retrieval:two sides of the same coin? [J]. Communications of the ACM,1992,35 (12):29-38.
    [146]Maron M, Kuhns J. On relevance, probabilistic indexing and information retrieval [J]. Journal of the ACM (JACM),1960,7 (3):216-244.
    [147]Salton G, Buckley C. Improving retrieval performance by relevance feedback [J]. Readings in information retrieval,1997:355-364.
    [148]Sheth B. A learning approach to personalized information filtering [D]. [S.l.]:Massachusetts Institute of Technology,1994.
    [149]Sarwar B, Karypis G, Konstan J, et al. Analysis of recommendation algorithms for e-commerce [C]. In Proceedings of the 2nd ACM conference on Electronic commerce,2000:158-167.
    [150]Sarwar B, Karypis G, Konstan J, et al. Item-based collaborative filtering recommendation algo-rithms [C]. In Proceedings of the 10th international conference on World Wide Web,2001:285-295.
    [151]Aggarwal C, Wolf J, Wu K, et al. Horting hatches an egg:A new graph-theoretic approach to collaborative filtering [C]. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining,1999:201-212.
    [152]Yu K, Wen Z, Xu X, et al. Feature weighting and instance selection for collaborative filtering [C]. In Database and Expert Systems Applications,2001. Proceedings.12th International Workshop on, 2001:285-290.
    [153]Cao H, Jiang D, Pei J, et al. Context-aware query suggestion by mining click-through and session data [C]. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining,2008:875-883.
    [154]Cao H, Jiang D, Pei J, et al. Towards context-aware search by learning a very large variable length hidden markov model from search logs [C]. In Proceedings of the 18th international conference on World wide web.2009:191-200.
    [155]Cao H, Hu D, Shen D, et al. Context-aware query classification [C]. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval,2009: 3-10.
    [156]He Q, Jiang D, Liao Z, et al. Web query recommendation via sequential query prediction [C]. In Data Engineering,2009. ICDE'09. IEEE 25th International Conference on,2009:1443-1454.
    [157]Beitzel S, Jensen E, Frieder O, et al. Improving automatic query classification via semi-supervised learning [C]. In Data Mining, Fifth IEEE International Conference on,2005:8-pp.
    [158]Wu X, Yan J, Liu N, et al. Probabilistic latent semantic user segmentation for behavioral targeted advertising [C]. In Proceedings of the Third International Workshop on Data Mining and Audience Intelligence for Advertising.2009:10-17.
    [159]Tu S, Lu C. Topic-based user segmentation for online advertising with latent dirichlet allocation [J]. Advanced Data Mining and Applications,2010:259-269.
    [160]Hofmann T. Probabilistic latent semantic indexing [C]. In Proceedings of the 22nd annual interna-tional ACM SIGIR conference on Research and development in information retrieval,1999:50-57.
    [161]Group G. Gensim 1C Topic Modelling for Humans [J]. Available online: http://radimrehurek.com/gensim/.
    [162]Shen D, Sun J, Yang Q, et al. Building bridges for web query classification [C]. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval,2006:131-138.
    [163]Shen D, Pan R, Sun J, et al. Query enrichment for web-query classification [J]. ACM Transactions on Information Systems (TOIS),2006,24 (3):320-352.
    [164]Beitzel S, Jensen E, Lewis D, et al. Automatic classification of web queries using very large unla-beled query logs [J]. ACM Transactions on Information Systems (TOIS),2007,25 (2):9.
    [165]Li X, Wang Y, Acero A. Learning query intent from regularized click graphs [C]. In Proceedings of the 31 st annual international ACM SIGIR conference on Research and development in information retrieval,2008:339-346.
    [166]Ganti V, Konig A, Li X. Precomputing search features for fast and accurate query classification [C]. In Proceedings of the third ACM international conference on Web search and data mining,2010: 61-70.
    [167]Broder A, Fontoura M, Gabrilovich E, et al. Robust classification of rare queries using web knowl-edge [C]. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval,2007:231-238.
    [168]Rose D, Levinson D. Understanding user goals in web search [C]. In Proceedings of the 13th international conference on World Wide Web,2004:13-19.
    [169]Hu J, Wang G. Lochovsky F, et al. Understanding user's query intent with wikipedia [C]. In Pro-ceedings of the 18th international conference on World wide web,2009:471-480.
    [170]Sutton C. McCallum A. Collective segmentation and labeling of distant entities in information extraction[R].2004.
    [171]Hu D, Shen D, Sun J, et al. Context-aware online commercial intention detection [J]. Advances in Machine Learning,2009:135-149.
    [172]Peng J, Bo L. Xu J. Conditional neural fields [J]. Advances in Neural Information Processing Systems.2009,22:1419-1427.
    [173]Quattoni A, Collins M. Darrell T. Conditional random fields for object recognition [C]. In In NIPS, 2004.
    [174]Gunawardana A, Mahajan M, Acero A, et al. Hidden conditional random fields for phone classifi-cation [C]. In Ninth European Conference on Speech Communication and Technology,2005.
    [175]Wang S, Quattoni A, Morency L, et al. Hidden conditional random fields for gesture recognition [C]. In Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on, 2006:1521-1527.
    [176]Morency L, Quattoni A. Darrell T. Latent-dynamic discriminative models for continuous gesture recognition [C]. In Computer Vision and Pattern Recognition,2007. CVPR'07. IEEE Conference on, 2007:1-8.
    [177]Shen Y, Yan J, Yan S, et al. Sparse hidden-dynamics conditional random fields for user intent understanding [C]. In Proceedings of the 20th international conference on World wide web,2011: 7-16.
    [178]Zhu X, Zhang P, Lin X, et al. Active learning from data streams [C]. In Data Mining,2007. ICDM 2007. Seventh IEEE International Conference on,2007:757-762.
    [179]Zhu X, Zhang P, Wu X, et al. Cleansing noisy data streams [C]. In Data Mining,2008. ICDM'08. Eighth IEEE International Conference on,2008:1139-1144.
    [180]Tsymbal A. The problem of concept drift:definitions and related work [J]. Computer Science Department, Trinity College Dublin,2004.
    [181]Bifet A, Holmes G, Pfahringer B, et al. New ensemble methods for evolving data streams [C]. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining,2009:139-148.
    [182]Tsymbal A. The problem of concept drift:definitions and related work [J]. Available online: http://www.scss.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf.
    [183]Guttman A. R-trees:a dynamic index structure for spatial searching [M]. ACM,1984.
    [184]Beckmann N, Kriegel H, Schneider R, et al. The R*-tree:an efficient and robust access method for points and rectangles [M]. ACM,1990.
    [185]Sellis T, Roussopoulos N, Faloutsos C. The R+-tree:A dynamic index for multi-dimensional ob-jects [J],1987.
    [186]Zezula P. Similarity search:The metric space approach [M]. Springer-Verlag New York Inc,2006.
    [187]Muthukrishnan S. Efficient algorithms for document retrieval problems [C]. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms,2002:657-666.
    [188]Grossman D, Frieder O. Information retrieval:Algorithms and heuristics [M]. Kluwer Academic Pub,2004.
    [189]Comer D. Ubiquitous B-tree [J]. ACM Computing Surveys (CSUR),1979,11 (2):121-137.
    [190]Jensen C, Lin D, Ooi B. Query and update efficient B+-tree based indexing of moving objects [C]. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30,2004: 768-779.
    [191]Hsu C, Lin C. A comparison of methods for multiclass support vector machines [J]. Neural Net-works, IEEE Transactions on,2002,13 (2):415-425.
    [192]Fan W. Systematic data selection to mine concept-drifting data streams [C]. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,2004: 128-137.
    [193]Klinkenberg R. Learning drifting concepts:Example selection vs. example weighting [J]. Intelli-gent Data Analysis,2004,8 (3):281-300.
    [194]Lebanon G, Zhao Y. Local likelihood modeling of temporal text streams [C]. In Proceedings of the 25th international conference on Machine learning,2008:552-559.
    [195]Klawonn F, Angelov P. Evolving extended naive Bayes classifiers [C]. In Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on,2006:643-647.
    [196]Gama J, Rocha R, Medas P. Accurate decision trees for mining high-speed data streams [C]. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,2003:523-528.
    [197]Jin R, Agrawal G. Efficient decision tree construction on streaming data [C]. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,2003: 571-576.
    [198]Scholz M, Klinkenberg R. Boosting classifiers for drifting concepts [J]. Intelligent Data Analysis, 2007,11 (1):3-28.
    [199]Fan W, Huang Y, Wang H, et al. Active mining of data streams [C]. In Proc. of the 4th SIAM International Conference on Data Mining,2004:457-461.
    [200]White D, Jain R. Similarity indexing with the SS-tree [C]. In Data Engineering,1996. Proceedings of the Twelfth International Conference on,1996:516-523.
    [201]Berchtold S, Keim D, Kriegel H. The X-tree:An index structure for high-dimensional data [M]. Bibliothek der Universitat Konstanz,1996.
    [202]Baeza-Yates R, Ribeiro-Neto B, et al. Modern information retrieval [M]. Addison-Wesley New York,1999.
    [203]Bast H. Chitea A, Suchanek F, et al. ESTER:efficient search on text, entities, and relations [C]. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.2007:671-678.
    [204]Arge L, Samoladas V, Vitter J. On two-dimensional indexability and optimal range search indexing [C]. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.1999:346-357.
    [205]Bast H, Weber I. Type less, find more:fast autocompletion search with a succinct index [C]. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval,2006:364-371.
    [206]Bast H, Majumdar D. Weber I. Efficient interactive query expansion with complete search [C]. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge man-agement,2007:857-860.
    [207]Hartigan J, Wong M. Algorithm AS 136:A k-means clustering algorithm [J]. Journal of the Royal Statistical Society. Series C (Applied Statistics),1979,28 (1):100-108.
    [208]Saaty T. Analytic hierarchy process [M]. Wiley Online Library,1980.
    [209]Kamra A, Terzi E, Bertino E. Detecting anomalous access patterns in relational databases [J]. The VLDB Journal,2008,17(5):1063-1077.
    [210]Dalvi N, Sanghai S, Roy P, et al. Pipelining in multi-query optimization [C]. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2001: 59-70.
    [211]Mistry H, Roy P, Sudarshan S, et al. Materialized view selection and maintenance using multi-query optimization [J]. ACM SIGMOD Record,2001,30 (2):307-318.
    [212]Chen J, DeWitt D, Naughton J. Design and evaluation of alternative selection placement strategies in optimizing continuous queries [C]. In Data Engineering,2002. Proceedings.18th International Conference on,2002:345-356.
    [213]Chaudhuri S, Shim K. Optimization of queries with user-defined predicates [J]. ACM Transactions on Database Systems (TODS),1999,24 (2):177-228.
    [214]Babu S, Motwani R, Munagala K, et al. Adaptive ordering of pipelined stream filters [C]. In Pro-ceedings of the 2004 ACM SIGMOD international conference on Management of data,2004:407-418.
    [215]Condon A, Deshpande A, Hellerstein L, et al. Flow algorithms for two pipelined filter ordering problems [C]. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2006:193-202.
    [216]Hellerstein J, Stonebraker M. Predicate migration:Optimizing queries with expensive predicates [M]. ACM,1993.
    [217]Carzaniga A, Rosenblum D, Wolf A. Design and evaluation of a wide-area event notification ser-vice [J]. ACM Transactions on Computer Systems (TOCS),2001,19 (3):332-383.
    [218]Aguilera M, Strom R, Sturman D, et al. Matching events in a content-based subscription system [C]. In Proceedings of the eighteenth annual ACM symposium on Principles of distributed comput-ing,1999:53-61.
    [219]Cugola G, Di Nitto E, Fuggetta A. Exploiting an event-based infrastructure to develop complex distributed systems [C]. In Software Engineering,1998. Proceedings of the 1998 International Con-ference on,1998:261-270.
    [220]Cilia M, Bornhovd C, Buchmann A. Cream:An infrastructure for distributed, heterogeneous event-based applications [J]. On The Move to Meaningful Internet Systems 2003:CoopIS, DOA, and ODBASE,2003:482-502.
    [221]Petrovic M, Burcea I. Jacobsen H. S-topss:Semantic toronto publish/subscribe system [C]. In Proceedings of the 29th international conference on Very large data bases-Volume 29,2003:1101-1104.
    [222]Petrovic M, Liu H, Jacobsen H. G-ToPSS:fast filtering of graph-based metadata [C]. In Proceed-ings of the 14th international conference on World Wide Web,2005:539-547.
    [223]Kemper A, Moerkotte G, Steinbrunn M. Optimizing boolean expressions in object bases [C]. In PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1992:79-79.
    [224]Ross K. Conjunctive selection conditions in main memory [C]. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems,2002:109-120.
    [225]Cohen E, Fiat A, Kaplan H. Efficient sequences of trials [C]. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms,2003:737-746.
    [226]Feige U, Lovasz L, Tetali P. Approximating min sum set cover [J]. Algorithmica,2004,40 (4): 219-234.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700