不确定时间序列的相似性匹配问题研究

英文题名：Research on Uncertain Time Series Similarity Matching
作者：左彦飞
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：不确定时间序列 ; 不确定数据模型 ; 相似性匹配 ; 概率范围查询 ; 时间序列距离
英文关键词：uncertain time series ; uncertain data model ; similarity matching ; probabilistic range query ; time series distance
学位年度：2012
导师：刘国华
学科代码：081202
学位授予单位：东华大学
论文提交日期：2012-02-01

摘要

时间序列,就是按照时间先后顺序排列的记录序列。相似性匹配是时间序列的聚类、异常检测、模式发现等任务的基础操作之一。目前对时间序列相似性匹配的研究主要针对确定性数据,随着物联网、隐私保护等技术的发展,不确定时间序列将大量涌现,时间序列的相似性匹配技术面临新的挑战。
     在不确定时间序列的情况下,两条序列之间的距离也是不确定的,所以无法直接利用确定性时间序列的相似性匹配方法。为了解决不确定时间序列相似性匹配问题,我们建立了一种描述不确定时间序列的数据模型,在该模型下,不确定时间序列在每一时刻的数据点均由一个取样点(sample observations)的集合组成,并且每个取样点出现的概率相等,即服从离散型均匀分布；并且,时间序列中不同时刻的点相对独立。在此模型下,两条不确定时间序列之间的真实距离是由大量的可能距离(以一定的概率值出现)组成的,并且这些可能距离的数量为指数大小。所以,直接计算所有的可能距离的效率将非常低。因此,在所提出模型的基础上,本文提出了两种不确定时间序列相似性匹配算法：α-PRQ(均值法)和k-PRQ(聚类法)。
     (1)α-PRQ
     根据查询序列和数据库中所存储的时序数据是否为确定性数据,将不确定时间序列相似性查询分为三种不同的类型；然后,对于每种类型,通过均值法(averaging method)从不确定序列中提取出一条确定性序列来代表原序列,然后,采取确定性时间序列相似性匹配方法进行查询。
     (2)k-PRQ
     此算法主要通过两个步骤进行剪枝以降低计算复杂性：
     1)通过聚类减小取样大小(sample size),以聚类后的每一个簇为单位计算距离,从而大大降低了计算复杂度。
     2)通过预先计算出小于给定阈值ε的距离个数的上界与下界,就能够得到这些距离出现概率的上下界,从而通过概率的上下界过滤掉不必要的计算,减少计算量。
     实验表明,我们提出的两个不确定时间序列相似性匹配算法具有较好的性能和准确性。
A time series is a sequence records according with the chronological order. Similarity matching is one of the underlying operations for time series clustering, outlier detection and pattern discovery tasks.
     Currently, study of time series similarity matching mainly focuses on deterministic data. With the development of the Internet of things and privacy protection technology, uncertain time series will be in large numbers and time series similarity matching technology is facing new challenges.
     In the case of uncertain time series, the distance between the two sequences is uncertain, so the way of similarity matching on deterministic time series cannot use directly.
     In order to solve the problem of uncertain time series similarity matching, we have established a data model to describe the uncertain time-series. Under this model, the data point at each time slot was built up by the set of one sample observations. Each sampling point has the same probability of occurrence, that is uniformly distributed and different time points of the time series is relatively independent. In this model, the true distance between two uncertain time series are consisting of a large number of possible distance (with a certain probability value). Therefore, on the basis of the model proposed by this paper, two algorithms have be proposed for uncertain time series similarity matching:a-PRQ (mean method) and k-PRQ (cluster method).
     (1) a-PRQ
     According to the query sequence and time series data stored in database are whether deterministic, The uncertain timing sequence similarity query is divided into three different types; Then, for each type, by the means method (averaging method) extracted from the sequence of uncertainty out of a deterministic sequence to represent the original sequence take the deterministic time series similarity matching the query.
     (2)k-PRQ
     This algorithm is mainly through a two-step pruning to reduce the computational complexity:
     1) Through the cluster to reduce the sample size (sample size) to calculate the distance to each cluster after clustering as a unit, thereby greatly reducing the computational complexity.
     2) Pre-calculated a given thresholdε, from the number of upper and lower bounds, we can get the distance to the probability of the upper and lower bounds through probability of the upper and lower bounds, it filters out unnecessary calculations and thus reduce the computational complexity.
     The experiments show that the two uncertain time series similarity matching algorithm has better performance and accuracy.

引文

[1]J. Han, K. Michael. Data Mining:Concepts and Technique. Morgan Kaufmann Publisher,2000.1-14
    [2]何书元.应用时间序列分析[M].北京：北京大学出版社,2004.1-15
    [3]E. Keogh and S. Kasetty. On the need for time series data mining benchmarks:a survey and empirical demonstration. Data Mining and Knowledge Discovery, 2003,7(4):349-371
    [4]Agrawal R., Faloutsos C., Swami A.. Efficient Similarity Search in Sequence Databases, In FODO,69-84,1993
    [5]Bill P. Buckles, Frederick E. Petry. A Fuzzy Model for Relational Databases. International Journal of Fuzzy Sets and Systems,1982,7:213-226
    [6]Tomasz Imielinski, Witold Lipski Jr. Incomplete Information in Relational databases. Journal of the ACM,1984,31(4):761-791
    [7]Gosta Grahne. Horn Tables-An Efficient Tool for Handling Incomplete Information in Databases. In Proc. Of ACM PODS,1989
    [8]Serge Abiteboul, Paris Kamellakis, Gosta Grahne. On the Representation and Querying of Sets of Possible Worlds. Theoretical Computer Science,1991
    [9]Norbert Fuhr, Thomas Rolleke. A Probabilistic NF2 Relational Algebra for Imprecision in Databases. Unpublished Manunuscript
    [10]Lakshmanan, L.V.S., Leone, N., Ross, R., Subrahmanian, V.. ProbView:a flexible probabilistic database system. ACM TODS,1997,22(3):419-469
    [11]Barbara D., Garcia-Molina H., Porter D.. The management of probabilistic data. IEEE Trans. Knowl. Data Eng.1992,4(5):487-502
    [12]Das Sarma A., Nabar S., Widom J.. Representing uncertain data:uniqueness, equivalence, minimization, and approximation. Tech.rep., Stanford InfoLab (2005). http://dbpubs.stanford.edu/pub/2005-38
    [13]Das Sarma A., Benjelloun O., Halevy A., Widom J.. Working models for uncertain data. In:Proc. of ICDE (2006)
    [14]Widom J.. Trio:a system for integrated management of data, accuracy, and lineage. In:Proc. of CIDR (2005)
    [15]Benjelloun O., Das Sarma A., Halevy A., Widom J.. ULDBs:databases with uncertainty and lineage. In:Proc. of VLDB (2006)
    [16]Das Sarma A., Benjelloun O., Halevy A., Nabar S., Widom J.. Representing Uncertain Data:Models, Properties, and Algorithms. The VLDB Journal,2009, 18:989-1019
    [17]R. Cheng, D. Kalashnikov, S. Prabhakar. Evaluating probabilistic queries over imprecise data. In Proc. of SIGMOD(2003)
    [18]M. Hua, J. Pei, W. Zhang, X. Lin. Ranking Queries on Uncertain Data:A Probabilistic Threshold Approach. In:Proc. of SIGMOD(2008)
    [19]R. Cheng, D. V. Kalashnikov, S. Prabhakar. Querying Imprecise Data in Moving Object Environments. IEEE Trans. on Knowledge and Data Eng., 2004,16(9):1112-1127
    [20]A. Fuxman, E. Fazli,R. J. Miller. Conquer:Efficient Management of Inconsistent Databases. In:Proc. of SIGMOD(2005)
    [21]P. Andritsos A. Fuxman, R. J. Miller. Clean Answers over Dirty Databases:A Probabilistic Approach. In:Proc. of ICDE(2006)
    [22]M. A. Soliman, I. F. Ilyas, K. C.-C. Chang. Top-k Query Processing in Uncertain Databases. In:Proc. of ICDE(2007)
    [23]C. Re, N. Dalvi, D. Suciu. Efficient Top-k Query Evaluation on Probabilistic Data. In:Proc. of ICDE(2007)
    [24]K. Yi, F. Li, G. Kollios, D. Srivastava. Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations. In:Proc. of TKDE(2008)
    [25]C. Jin, K. Yi, L. Chen, J. Yu, X. Lin. Sliding-Window Top-k Queries on Uncertain Streams. In:Proc. of VLDB(2008)
    [26]X. Zhang, J. Chomicki. On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases. In:Proc. of DBRank(08)
    [27]Tingjian Ge, Zdonik S., Madden S.. Top-k Queries on Uncertain Data:On Score Distribution and Typical Answers. In:Proc. of SIGMOD(2009)
    [28][26] M. Chau, R. Cheng, B. Kao, J. Ngai. Uncertain Data Mining:An Example in Clustering Location Data. In:Proc. of PAKDD(2006)
    [29]W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, K. Y. Yip. Efficient Clustering of Uncertain Data. In:Proc. of ICDE(2006)
    [30]T. S. Jayram, A. McGregor, S. Muthukrishnan, E. Vee. Estimating Statistical Aggregates on Probabilistic Data Streams. In:Proc. of PODS(2007)
    [31]G. Cormode, M. N. Garofalakis. Sketching Probabilistic Data streams. In:Proc. of SIGMOD(2007)
    [32]C. Aggarwal, P. S. Yu. Framework for Clustering Uncertain Data Streams. In: Proc. of ICDE(2008)
    [33]G. Cormode, A. McGregor. Approximation Algorithms for Clustering Uncertain Data. In:Proc. of PODS(2008)
    [34]Guha S., Munagala K., Exceeding Expectations and Clustering Uncertain Data. In: Proc. of PODS(2009)
    [35]Das A., Ullman J., Widom J.. Schema Design for Uncertain Databases, In: Alberto Mendelzon Workshop on Foundations of Data Management,2007
    [36]X. Lian, L. Chen, and J. X. Yu. Pattern matching over cloaked time series. In Proc. of IEEE ICDE,2008.
    [37]J. Aβfalg, H. Kriegel, P. Kroger, and M. Renz. Probabilistic similarity search for uncertain time series. In Proceedings of the 21st International Conference on Scientific and Statistical Database Management,2009.
    [38]M. Yeh, K. Wu, P. S. Yu, and M. Chen. PROUD:A probabilistic approach to processing similarity queries over uncertain data streams. In Proceedings of the 12th International Conference on Extending Database Technology:Advances in Database Technology,2009.
    [39]Smruti R. Sarangi, Karin Murthy:DUST:a generalized notion of similarity between uncertain time series. KDD 2010:383-392
    [40]Yuchen Zhao, Charu C. Aggarwal, Philip S. Yu:On wavelet decomposition of uncertain time series data sets. CIKM 2010:129-138
    [41]周傲英,金澈清,王国仁,李建中.不确定性数据管理技术研究综述.计算机学报,2009,31(1)：10-16
    [42]Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger:The R*-Tree:An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990:322-331
    [43]Chan K.P. et al.. Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping, In IEEE TKDE,2003,686-705
    [44]H. Shatkay, H. S. Zdonik. Approximate queries and representations for large data sequences. in:Y. Stanley, W. Su eds., Proceedings of 12th International Conference on Data Engineering (ICDE), New Orleans, Louisiana, USA,1996. IEEE Computer Society,1996.536-545
    [45]E. Keogh, M. Pazzani. A simple dimensionality reduction technique for fast similarity search in large time series databases. in:T. Terano, H. Liu, AL. Chen eds., Proceedings of 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan,2000. Springer-Verlag,2000.122-133
    [46]Keogh E. et al.. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, In SIGMOD,2001,151-162
    [47]F. Korn, H. Jagadish, C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. in:P. Joan ed., Proceedings of ACM SIGMOD International Conference on Management of Data, Tuescon, AZ. USA,1997. Morgan Kaufmann Publisher,1997.289-300
    [48]E. Keogh, S. Lonardi, A. C. Ratanamahatana. Towards Parameter-Free Data Mining. in:S. Evangelos, J. Han, U. M. Fayyad eds., Proceedings of the 10th ACM Knowledge Discovery in Database (KDD), Seattle, WA. USA.2004. ACM Press,2004.206-215
    [49]J. Lin, E. Keogh, S. Londardi, et al. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. in:J. Z. Mohammed, A. C. Charu eds., Proceedings of the 8th ACM SIGMOD workshop on Research(DMKD), San Diego, CA, USA.2003. ACM Press,2003.55-68
    [50]曾海泉.时间序列挖掘与相似性查找技术研究：[博士学位论文].上海：复旦大学,2003
    [51]李俊奎.时间序列相似性问题研究：[博士学位论文].武汉：华中科技大学,2008
    [52]C. Shahabi, X. Tian, W. Zhao. TSA-tree:a wavelet-based approach to improve the efficiency of multilevel surprise and trend queries. In Proceedings of 12th International Conference of Scientific and Statistical Database Management, Berlin, Germany,2000. IEEE Computer Society,2000:55-68
    [53]Rafiei D., Mendelzon A.O.. Querying Time Series Data Based on Similarity, IEEE TKDE,2000,12(5):675-693
    [54]Spiros Papadimitriou, Feifei Li, George Kollios, Philip S. Yu:Time Series Compressibility and Privacy. VLDB 2007:459-470
    [55]D.J.Berndt, J.Clifford. Finding Patterns in Time Series:A Dynamic Programming Approach, in:D.Weld, B.Clancey eds., Advances in Knowledge Discovery and Data Mining, AAAI/MIT,Oregon,Portland,1996.The MIT Press,1996.229-248
    [56]Keogh E.. Exact indexing of dynamic time warping, In VLDB,2002,406-417
    [57]Eric Sven Ristad, Peter N. Yianilos:Learning String Edit Distance. ICML 1997: 287-295
    [58]F.R.John,S.Myra.A survey of temporal knowledge discovery paradigms and methods.IEEE Transactions on Knowledge and Data Engineering,2002,14(4): 750-767
    [59]L. Wu, C. Faloutsos, K. Sycara,et al. FALCON:feedback adaptive loop for content-based retrieval. in:A.E. Amr, B.L. Michael, C. Sharma, et al. eds., Proceedings of the 26th International Conference on Very Large Databases(VLDB), Cairo, Egypt,2000. Morgan Kaufmann Press,2000.296-306
    [60]Kim S.W. et al.. An Index-Based Approach for Similarity Search Supporting Time Warping in Large Sequence Database, In ICDE,2001 607-614
    [61]Hee-Sun Won, Sang-Wook Kim, Ju-Won Song:An Efficient Algorithm for Bulk-Loading of the Multilevel Grid File. JCIS 2002:382-386
    [62]Zhu Y., Shasha D..Warping Indexes with Envelope Transforms for Query by Humming,In SIGMOD,2003,181-192
    [63]V. Ljosa and A. K. Singh. APLA:Indexing arbitrary probability distributions. In Proc. of the 23rd Int. Conf. on Data Engineering (ICDE 2007),2007
    [64]H.-P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz. Probabilistic Similarity Join on Uncertain Data. In Proc.11th Int. Conf. on Database Systems for Advanced Applications (DASFAA'06), Singapore, Singapore,2006
    [65]E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR time series classification/clustering homepage. www.cs.ucr.edu/-eamonn/time_series_data

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700