时间序列分析技术的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

时间序列分析技术的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：The Research on Time Series Analysis Techniques
作者：VO ; THI ; THANH ; VAN
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：时间序列 ; 时间序列流 ; 降维 ; 模式匹配 ; 时间序列趋势分析 ; 时间序列预测分析 ; 商务智能管理
英文关键词：Time series ; Stream time series ; Dimensionality reduction ; Pattern matching ; Time
英文关键词：series trend analysis ; Time series predictive analysis ; Business intelligence management
学位年度：2013
导师：骆嘉伟
学科代码：081203
学位授予单位：湖南大学
论文提交日期：2013-06-07
答辩委员会主席：Bo Liao

摘要

数据挖掘是对观测数据集的分析，以便找到对应模型并且用新的更容易理解和使用的方式总结数据。以时间顺序抵达的数据，在许多其他的领域都有出现，如物理学，金融学，医学和音乐等。时间序列是一个重要的时态数据对象种类，并且它们很容易从金融领域和科学应用中获得。为了获得有意义的数据和其他的数据特征，时序分析由分析时序数据的方法和技术组成。考虑到时序数据的广泛出现，还有数据库一般呈指数级增长，时序数据挖掘目前成为了一个倍受关注的领域。因为在各种各样的设置中，大规模的时序数据集更为普遍，本文面临着开发有效分析方法的重大挑战。本文的作者旨在解决这些问题，即为时序分析设计快速的可扩展的算法。
     在大规模数据中，时序分析研究中诸如预处理和为预测目标转换数据的工作具有重要意义并且也是普遍做法。如果数据特别是时序数据能被预处理,那么可以提高效率并且解决挖掘和发现过程中的困境。现在大量的数据预处理技术，为了去除噪声并且纠正数据中的不一致，就要用到数据清理技术；为了把多源数据合并成为一个连贯的数据仓库，就要用到集成技术；为了标准化数据，就要用到转换技术。数据压缩在时序分析的预处理阶段是一个很有意义的技术，它可以通过聚集，消除冗余成分来减小数据规模。一般来说，时序预测是一个观测值序列。时序预测表明以时间顺序，过去可以在多大程度上决定未来。一个通过确定的线性过程生成的时间序列具有较高的可预测性，并且它的未来值能被过去值很准确的预测到。一个不相关的过程生成的时间序列具有较低的可预测性，并且它的过去值只能为未来值提供一个统计特征。
     之前的一些用来挖掘和发现时序的技术，如时序聚类，分类，预测和其他金融领域的应用也被引用到了本文中。简单地说，本文的主要目标是时序数据挖掘技术的研究，有如下方面：(1)数据预处理如降维，(2)时序数据的短期预测，这一过程被称为趋势分析，(3)未来值的预测，在具有大量数据流环境中，训练和检测历史样本，(4)时序分析的商业智能模型。其中的每一项研究都提供了实验评估和分析来验证这种方法的有效性。
     具体来说，本文主要有如下四点贡献：
     第一，本文中我们提出了数据预处理方法来降低时间序列的维度，与原数据相比仍保持形状。这种方法是基于时间序列中转折点的想法，而这些点被定义为时间序列数据趋势的改变。时间序列中的转折点被定义为分隔两个相邻趋势的点，并且在公告的发布时间中具有最短的距离。只有一些临界点被保留下来了，而那些被认为是干扰因子的临界点被移除了。这种方法只考虑特定时间内各个时序的临界点，以便减小数据规模，去除冗余成分。当这种数据预处理方法，被在挖掘过程之前使用时，显著地改进了模式挖掘的总体质量和实际挖掘所需的时间。所有的降维技术对大数据集的预处理都非常有意义，然后可以用它来分析和发现信息。第一个贡献提出了一种方法，其建立在转折点来减少时间序列数据维度之上，这项工作使得在数据流环境中，预测过程更加快速。这项贡献专注于转折点，这些点提取自时间序列数据中的最大或最小值点，证明对于在时间序列分析中预处理数据过程中更加高效。一个时间序列包含一系列局部的最大或者最小点，并且其中一些反映出了数据信息趋势的逆转。这些局部最大和最小点被称为临界点；换个说话，我们可以说一个时间序列是由一系列临界点组成的。这些临界点通常被称为转折点，因为它们显现出了时间序列数据趋势的变化。在这种方法中，转折点广泛地被用在数据挖掘分析领域中，因为它们比其他点包括更多的信息。转折点描述了时间序列趋势的变化并且他们能被用来识别事务周期的开始与结束。我们认为在时间序列Ti={ti, t2,…, tn}中，转折点ti是一个在两种情况下都被注明的点。第一种情况是，如果那个点在ti处结束上升的趋势并且开始一个下降的周期。在这种方法中，我们只考虑特定时期内各个时间序列的关键点。时间序列中的转折点被定义为分隔两个相邻趋势的点，并且在公告的发布时间中具有最短的距离。在构造初始临界点数列Ti’后，一个临界点挑选标准被应用来过滤掉对应噪声的临界点。时间序列Ti和Ti’分别被称为原始数列和预处理之后得出的数列。我们认为原始时间序列Ti中第一个和最后一个数据点被保留为Ti’中的第一个和最后一个点。挑选的方向是基于波动函数v和时间持续阈值t。我们方法中的时间持续阈值t是5个连续的点。在一个多元时间序列环境中，对于一个给定的序列Tij={t_(i1), t_(i2),, t_(im)}，序列中的一个转折点（波峰或者波谷）被定义为任一时间周期j的第i股流，这是在考虑了波动和时间持续的特定阈值之后，时间序列观测中下降或者上升的变化。一个点处于上升还是下降的趋势中是不确定的。为了使用更少的时间和内存来完成，在本文中，我们根据转折点提出了有六种情况的三个策略来做排除。在每种策略中，选中还是排除的选择权取决于参数v和时间阈值t。这意味着我们考虑了时间流环境中，每个时间序列中数据波动和时间持续的特殊性。为了保证转折点的观测中时间和值的变化，我们用一个步长范围来排除不重要的点。为了避免产生一个错误的转折点（即那些处于上升或下降趋势中的点），而识别一个真正的转折点，我们的策略是之前识别的转折点与现在这个点是相对的。这意味着一个波峰后必须紧跟一个波谷，并且它们之间没有其他的波峰了。这项工作已经通过降低大的历史数据的维度并且使用数据挖掘技术计算未来值的方法，解决了时间序列预测的问题。在采用了基于这种转折点处理的降维方法后，我们方法生成的时间序列仍然保持了原始数据趋势的形状。提出的这种方法对于大的数据集处理是非常有效的。
     第二，本文中的第二个贡献是时间序列趋势分析方法，它的功能是一个短期预测，这与领先一步预测（one-day-ahead）有关。组合方法的结果是预测值，通过交易规则这个预测值能被用来做决策。在这项工作中，聚类是第一个把数据聚集成簇的步骤，因此同一个簇中的所包含的对象之间的相似度要大于与另一个簇中的对象之间的相似度。之后，我们考虑数据的分类步骤，其中分类器被构造来预测趋势标签，比如金融数据中的“上升趋势”，“非趋势中”，“下降趋势”。预测趋势实现中的分类过程包裹两个子过程：即学习和分类。学习过程通过支持向量机（SVM）来分析数据并且学习分类器是以分类规则的形式描述的。然后下一个估算测试集准确度的子过程取决于分类规则。在准确度被测为合适的情况下，规则可以被应用来对新的未来值分类。详细来说，这项贡献中我们提出了一个新的技术，它是基于交易规则被用在有监督的和无监督的机器学习算法中来预测金融时间序列的趋势。这种方法是利用聚集数据组之间相似性的K-Means和用来训练和测试历史数据来执行领先一步趋势预测的支持向量机分类。为了保证这种方法的效率，我们比较传统的BP神经网络和单独的支持向量机结果。为了完成实验，我们收集了来自金融事件序列网站的数据并且过滤数据，然后我们提取出了股票时间流的指示器。在这种情况下，我们使用了指数加权法（EMA）作为指示函数。而作出这种选择的原因是EMA可以很好地折衷过度敏感的加权移动平均数和过慢的简单移动平均数。预测趋势阶段的详细过程为，我们结合K-Means算法聚类与SVM训练样本来实现该方法。组合方法的结果通过预先确定的交易规则将被用来做决策。这种组合方法的想法利用了K-SVMeans的优点，这是一个对于多重属性相关数据集的聚类算法，它结合了K-Means聚类和SVM。K-SVMeans是一个对于不同数据集的K-Means聚类算法，其中伴有一种数据类型的聚类在另一种类型中学习一种分类器，并且这个分类器影响聚类器的聚类决策。我们选择K-Means算法作为这个方法的一部分是因为这个算法是一个著名的非层次聚类算法并且需要使用者分配存在于数据集中的聚类的个数。K-Means算法将会为每个具有相同属性的聚类采集训练数据的样本。对于每个聚类，我们根据BRF核函数，用正规化参数C,(使用交叉验证)训练子集。通常，数据分类分为学习阶段和分类阶段。在我们的学习阶段，通过一个分类算法来分析训练数据集。这时，种类的标签属性被用来做决策，并且学习模型（分类器）以分类规则的形式呈现。在我们的分类阶段，测试数据被用来估算分类规则的准确性。如果准确性可以被接受了，那么这个规则可以被用来对新的数据进行分类。此外，为了使训练过程更加快速，我们对多类分类SVM选择“one againstone”策略。顺便指出一个N种分类的问题，N(N-1)/2个支持向量机被训练来从另一个种类的所用样本中辨别一个种类的样本。用这种方法，根据最大值表决，一种未知模式的分类被使用，其中每个SVM支持一种分类。我们的方法预测趋势并且输出对应的分类标签值。用SVM完成训练和测试样本需要五个步骤。第一步，我们要考虑输出参数，包括核参数γ，正规化参数C，还有聚类个数K。第二步是运行K-Means聚类算法，这个算法运行在原始数据上并且所有的聚类中心被认为用来构造分类器。第三步，以聚类的数据为基础来构建SVM分类器。第四步是通过启发式搜索策略调整输入函数。第五步是测试准确性和反应时间，如果组合方法能被接受那么第五步停止；否则这个算法将会返回第一步来测试输入参数新的组合。这项工作研究了时间序列趋势分析在有监督和无监督学习机中的问题。使用这种组合技术，提出了一个针对趋势预测问题，结合K-Means聚类算法和SVM训练算法的方法。总之，这种方法使用K-Means算法对输入数据聚类；然后从每个聚类训练SVM分类，这种方法预测一个时间序列趋势特别是进入那个数据分析的输出结果。这种情况下，趋势是上升的但是预测值是下降的，我们成为预测错误，反之亦然。这种模型的准确度被定义为准确分类样本数量与总的样本数量的百分比。这个实验结果证明提出的组合方法相较于其他方法具有较高的准确性。
     第三，本文的下一个贡献是预测未来值的方法，而其取决于在多重时间序列环境中的历史值。我们认为这是程序研究中重要的组成部分，因为这些数据结果常常为决策理论模型提供基础。模拟时间序列数据是一个统计问题；并且时间序列预测技术已经被应用到了许多真实世界应用中。预测技术用于计算过程中估算一个模型的参数，这个模型被用来分配有限的资源或者来描述如上面提到的随机过程。在本文中也提到了多重时间序列环境的时间序列预测分析问题。在学习机方法中，能被用于回归分析的支持向量机被称为支持向量回归机，支持向量回归机已成功地被应用于时间序列流分析，但是它的优化算法通常是由二次最优化包组成的。在两次最优化的大量数据集中，基于支持向量机算法的顺序最优算法可以提高操作速度并且减低较长的运行时间。这项贡献的详细描述如下，假设数据流中我们有n个时间序列{T1, T2,…, Tn}，在当前的时间戳(m-1)每个Ti包括m个有序值，也就是说，Ti={ti0, t_(i1),…, ti(m-1)}其中tij是在Ti中时间戳j的值。假设n股时间序列流只接收F时间戳后的数据。换而言之，对于每一个时间序列Ti，未来值t_(im), ti(m+1), ti(m+2),…,和ti(m+F-1)分别匹配时间戳m,(m+1),(m+2)…,和(m+F-1)，以批量形式到达同一时间戳(m+F)。时间戳由m到(m+F)的阶段，系统不知道F在每个时间序列中的未来值。这个方法的目标是要为n次时间序列流有效地预测n.F值，并且预测错误要尽可能的低而且准确度要尽可能的高。为了比较多重时间序列中的统计方法，在这一部分中，线性回归模型被用来表示时间序列数据流，一个时间序列集{T1, T2,..., Tn}其中Ti={ti0, t_(i1), t_(i2),... ti(m-1)} i<=n。对于数据流的每个时间序列，我们假设历史值集{ti0, t_(i1), t_(i2),... tiH}为因变量，预测值是线性回归模型的独立值。如果独立值是已知的，那么应用这个定义我们预测独立变量的均值。线性回归实现了一个统计模型，当独立变量和因变量差不多是线性关系的时候，这个模型给出了最优解。另一个我们选择线性回归来解释我们的方法的原因是因为线性回归是一个简单回归分析，它能较好地用来预测数值型的输出。此外，在多重时间序列环境中，如果每个时间序列都在主存中输入他们各自的核心矩阵，那么主存将会溢出。我们采用这个算法是因为使用基于SVM的序列最小最优化算法（SMO）只迭代调用核心矩阵，因而执行过程得到了改善。对于大的数据集，SVM的执行速度变慢了，以便我们挑选SMO来得到更好的执行时间和未来数据的精确度。SMO是一个来解决SVM最优化问题的迭代算法。SMO算法把问题分解成一系列最小可能的子问题，而这些问题能被分解得解决。由于线性等式约束涉及到拉格朗日乘数，最小可能问题包括两个这样的乘数。对于在多重时间序列中的每个对象，SMO反复执行这两个步骤。对于我们方法中的每个时间序列，算法的第一步是找到一个拉格朗日乘数α。第二步是挑选一个次要的乘数α*并且优化对(α, α*)。这个算法将会重复上面的这两个步骤直到收敛。当所有的拉格朗日乘数满足了卡罗需－库恩－塔克条件（KKT，一个自定义的耐受值），那么这个问题就被解决了。虽然这个算法保证是收敛的，但是为了加快收敛速度，我们使用了启发式算法来挑选乘数对。为了挑选拉格朗日乘数来优化，我们选取了第一种使用SMO算法的外部循环的拉格朗日乘数。外部循环首先在无边界的训练子集中进行迭代。如果某个样本违背了KKT条件，那么它就可以立即被优化。如果不存在这样的样本集，那么就在整个训练集中进行迭代。如果找到了一个违反样本，那么使用第二种启发式方法选择一个第二种乘数，并且这两个乘数要能共同被优化。支持向量机然后被更新，并且外部循环重新来寻找KKT违背者。在联合最优化的时候，SMO算法根据学习的最大化步长选择第二种拉格朗日乘数。|E1-E2|被用来估计SMO中的步长。选择第二种乘数的方法被描述为三个步骤。第一，循环访问所有的无边界样本，第二种乘数被选取自|E1-E2|最大的样本中。第二，检查，如果第一步没有取得积极进展，那么SMO开始循环访问无边界样本来搜索一个下一个样本，这个样本能获得积极进展。第三，如果第二步也没有取得什么进展，那么SMO开始循环访问整个训练集直到找到一个能取得积极进展的样本。第二和第三步都是随机开始的，实验结果表明该方法的有效性。
     第四，论文的另外一个贡献是提出一种商业智能管理方法。该方法解决了收集与筛选股票时间序列流的问题，另外，降低维数可快速优化、结合及测试不同的特征以执行基于应用需求的快速相似性搜索。所有准备和收集数据的操作都可以称为数据收集，其目的是获取信息并存储或者将信息传递给其他人。当进行数据收集时，获取高质量的信息是极其重要的，因为高质量信息是做出正确决策的可靠保证。对于这种方法，数据主要被收集来提供信息。收集到的数据不仅可以存储在贮存空间，还可以用来监测和评估。商业智能在做出有效的判断方面发挥了重要作用，通过系统性的信息处理，可以确定商业组织所处的环境，运用这些准确的判断可改善商业表现、增加商业机会。假设商业智能模型是用来收集历史数据、过滤必要数据并运用这些数据预测未来值等一系列任务。这个模型可用于改善商业组织的表现。商业智能是发挥其所有潜能并将其转换成商业组织的知识库。这个研究的主要目的是提供了一种商业智能模型，基于这个模型及预处理数据，可以预测商业表现，然后，采用预测算法产生预期数据。商业智能技术可提供商业表现的历史、现行及预测数据。论文中的方法包括四个主要程序：收集数据、预处理、未来预测及评估。调查的目标一旦确定，就执行数据收集的程序。数据可能来自不同的来源，因此可能需要进行数据综合。基于此目的，所提出方法的第一步是从商业网站收集数据。这一步要选取商业证券公司的名称并获取一段时间内的历史数据。在这一程序中，也要收集并存储其数据概要。为了减少运用该方法的所需时间及存储空间，利用模式匹配压缩一些时间序列数据点，这些数据点称为非重要数据点。假设经过匹配及压缩数据点后,初始时间序列Ti转换为Ti’。选择和压缩数据点的方法不仅基于波动参数lv还基于时间持续参数t。波动参数v定义为时间区间内确定值点的平均值。时间持续参数被定义为具有与连续点数量相同个数的滑动窗口， t=w。在解决许多时间序列流的流环境中，对于一个给定的时间流序列Ti={t_(i1), t_(i2),, t_(im)}，时间序列Ti检查过程中的一个窗口定义为宽度为w的第i股流在时间段j。存在同时满足参数t和v的四种情形，这意味着我们同时考虑了波动和时间持续的特殊性。为了说明更多的解决方案，我们假设滑动窗口有五个连续的点pF, p2, p3, p4, pL，然后我们检查是否p3<(p2, p3, p4)的平均值并且((p2p4))，然后我们保留pF, p3, pL并且去掉p2, p4。概括这项贡献，这种方法已经解决了多重时间序列环境中的问题，用以下的工作支持商务智能：聚集，过滤和储存，然后在使用它们作预测之前进行预处理。我们的方法同样也对支持商务智能的时间序列流提出了一个方法。使用基于支持向量回归的SMO技术对未来值的预测并且提供了准确度和普遍性的评价指标。这种方法把大的历史数据降低到一个能够匹配预先定义样本的较小的数据集，所以我们的性能得到了显著的改善。在使用基于模式匹配预定义样的减点方法之后，本方法生成的时间序列仍然可以保持原来的趋势形状。
     在这篇论文中，我们采用了一种框架来执行实验以展示这项研究的成果。目前，对大数据的单时间序列分析的框架有很多。我们提出了一种针对多时间序列数据的分析框架。这个框架在需要做决策来提高公司商业效率和通过信息系统过程理解组织环境。一般地说，我们提出的时间序列分析框架的主要目的是预测。这个框架的前两个步骤跟数据挖掘的系统过程一样，数据收集，数据转换，过滤，接着就是对准备的数据降维。接下来的两个过程分别为，标准化数据以及输出用以作决策的信息。根据商业规则进行信息评估和翻译。预测分析的准确度是根据统计的方法进行计算的。另外，为了实行这种提出的技术，实验使用的数据来源于雅虎财经网站上的金融时间序列数据，实验结果验证了该方法的有效性。本文提出的框架的第一个过程是数据收集。这个过程开始是收集初始数据和熟悉这些数据。主要的目的是了解数据的质量，初步理解数据，以及发现有兴趣的数据子集。数据理解可以进一步分为：初始数据的收集，数据的描述，数据的探测以及数据质量的核查。对于时间序列的数据发掘来说，第二步为数据预处理，这个过程包括构建最后数据集的所有活动，而最后的数据集用来加入到接下来的一个步骤的挖掘工具中或者挖掘算法中。它包括表，记录，和属性选择；数据清理；构建新的属性；数据转换。在这个框架中，预处理数据工作的主要目的是通过转换点和模式匹配来降维。本框架的第三个过程是挖掘数据和发现信息。为了获取短期和长期预测所需的进一步数据，它包括两个子过程。分析时间序列趋势的子过程将预测第二天的金融数据的趋势（这被叫做提前一天预测）。它将利用监督和非监督机器学习技术进行聚类然后训练和测试样品以预测未来值。这个值在输出结果之前必须用预先定义好的交易规则进行核查。剩下一个子过程使用两种算法在多时间序列环境下执行预测分析。这个子过程中的两个算法是基于支持向量机和线性回归的序列最小优化算法。比较统计预测和智能计算两种方法来为流式数据选择最好的方法。这个框架的第四个过程为评估准确度。这项工作使用统计计算来评估预测值的准确度。这个值可能随着几个参数而改变。
Data mining is the analysis of observed data sets in order to find the models and tosummarize the data in the new ways that are meant for both understandable and useful. Dataarriving in time order arises in fields ranging from many other areas of physics, finance,medicine, music, and so on. The time series is an important class of temporal data objects andthey can be easily obtained from financial and scientific applications. Time series analysiscomprises methods and techniques for analyzing time series data in order to extractmeaningful statistics and other characteristics of the data. Given the spread of the appearanceof time series data, and the exponentially growing sizes of databases, there has been recentlybeen an explosion of interest in time series data mining. As extremely large time series datasets grow more prevalent in a wide variety of settings, this thesis faces the significantchallenge of developing efficient analysis methods. The researches in this thesis address theproblem in designing fast, scalable algorithms for the analysis of time series.
     The research on time series analysis with the tasks such as preprocessing andtransformation data for the prediction purpose has a meaningful and popular in the case of bigsize data. If the data or time series data in particular can be preprocessed so as to improve theefficiency and lack of difficulty of the mining and discovering processes. There are a lot ofdata preprocessing data techniques; to remove the noise and correct incompatibilities in data,the cleaning techniques can be applied; to merge data from multi sources into coherent datastorage, the integration techniques can be used; to normalize data, the transformationtechniques can be referred. Data reduction is one of the meaningful techniques in thepreprocessing stage of time series analysis can reduce the data size by collecting, eliminatingredundant features. In general, time series predictability is a measure of how well futurevalues of a time series can be predicted, where a time series is a sequence of observations.Time series predictability indicates to what extent the past can be used to determine the futurein a time series. A time series generated by a deterministic linear process has highpredictability, and its future values can be predicted very well from the past values. A timeseries generated by an uncorrelated process has low predictability, and its past values provideonly a statistical characterization of the future values.
     This thesis makes four major contributions:
     Firstly, we propose the data preprocessing method to reduce the dimensions of timeseries in terms of the keeping the shape when compared to the original data in this thesis. The method based on the idea of turning points in a time series; these points are defined as thechange in the trend of the time series data. The turning points in time series are defined as thepoints that separate two adjacent trends and have the shortest distance from the release time ofannouncements. Only some of the critical points are preserved; those critical points, which areconsidered as interference factors are removed. This method only considers the critical pointsof each time series in a certain period in order to reduce the data size by eliminating redundantfeatures. This data preprocessing method, when applied before mining process, cansignificantly make better the overall quality of the patterns mined and the time required forthe actual mining. All of dimensionality reduction techniques are very meaningful topreprocess the large dataset and then use it to analyze and discover knowledge.
     Secondly, the next contribution mentioned in this thesis is the proposed method ofanalysis trend of the time series. This function is a short term prediction; this term is related toone-step-ahead prediction. The results of the combination method are the predicted valueswhich would be used for making the decisions by the trading rules. In this task, the clusteringis first the procedure of collecting the data into clusters; hence all the objects within a clusterwill have higher similarity than in comparison to one another but are very dissimilar to objectsin other clusters. After that, we consider the data classifcation procedure, where a classifer isconstructed to predict trend labels, such as “upward”,“no-trend” or “downward” for thefinancial data. The classifcation process for prediction trend implements in two sub-processes:learning and classification. The learning sub-process analyzes data by support vector machineand the learned classifer is represented in the form of classifcation rules. Then the next sub-process estimated the accuracy of test data depend on classifcation rules. In the case of theaccuracy is measured suitable, the rules can be applied to the classifcation of new futurevalues.
     Thirdly, the next contribution is the proposed method for predicting the future valuesdepend on historical values in the multiple time series environment. We think that it is animportant component of procedures research because these data results often supply thefoundation for decision making models. Modeling the time series data is a type of statisticalissue; and time series prediction techniques have been used in many real world applications.Prediction techniques are used in computational procedures to estimate the parameters of amodel being used to allocate limited resources or to describe random processes such as thosementioned above. And the problem of time series predictive analysis of the environment withmultiple time series also mentioned in this thesis. In learning machine approach, the supportvector machine can be used for regression is called support vector regression, support vectorregression has been applied successfully to stream time series analysis, but its optimization algorithm is usually built up from certain quadratic programming packages. A sequentialminimal optimization algorithm based on the support vector machine algorithm can improveoperation speed and reduce longer run time with large data sets of quadratic programming.
     Fourthly, the next contribution of this thesis is the proposed approach for businessintelligence management. The approach for business intelligence management solves theissues of collecting with filtering stock time series stream, then reducing dimensions with caneasily optimize, combine and test different features to execute a fast similarity search basedon the application's requirements. Data collection is any process of preparing and collectingdata, and the purpose of data collection is to obtain information to keep on record, to passinformation on to others. When collecting data, it is important that the data collected are ofhigh quality so that they can be reliably used as the basis to make decisions. Data areprimarily collected to provide information regarding this approach. The collected data can benot only stored in storage space, but also analyzed and used for monitoring or evaluationpurposes. Business intelligence has an important role in effective decision making to improvethe business performance and opportunities by understanding the organization’s environmentsthrough the systematic process of information. Consider that business intelligence model is agroup of tasks of gathering the historical data, filtering the necessary data and using them topredict future value. This model helps to improve the performance of the organization.

引文

[1] J. W. Han, M. Kamber. Data mining: Concepts and techniques (2nd Edition),2006,CA: Morgan Kaufman Publishers.
    [2] P.-N. Tan, M. Steinbach, V. Kumar. Introduction to Data Mining-First edition,2005, Addison Wesley.
    [3] V. Carlo. Business Intelligence: Data Mining and Optimization for DecisionMaking,2009, John Wiley&Sons.
    [4] P. Ponniah. Data Warehousing Fundamentals for It Professionals,2010, John Wiley&Sons.
    [5] A. L nnqvist, V. Pirttim ki. The measurement of business intelligence, InformationSystems Management,2006,23(1):32-40.
    [6] A. Martin, L. T. Miranda, V. V. Prasanna. An analysis on business intelligencemodels to improve business performance. In Proceedings International Conferenceon Advances in Engineering, Science and Management,2012,503-508.
    [7] J. Ranjan. Business justification with business intelligence, The Journal ofInformation and Knowledge Management Systems,2008,38(4):461-475.
    [8] S. Rouhani, M. Ghazanfari, M. Jafari. Evaluation model of business intelligence forenterprise systems using fuzzy TOPSIS, Expert Systems with Applications,2012,39(3):3764-3771.
    [9] V. Vapnik. Statistical learning theory,1998, Wiley.
    [10] V. N. Vapnik. The nature of statistical learning theory,1995, New York: Springer.
    [11] G. S. Atsalakis, K. P. Valavanis. Surveying stock market forecasting techniques–Part II: Soft computing methods, Expert Systems with Applications,2009,36(3):5932-5941.
    [12] G. E. P. Box, G. Jenkins. Time series analysis: Forecasting and control. Holden-Day (3rd Edition),1994, Prentice-Hall: New York.
    [13] D. Bao. A generalized model for financial time series representation and prediction,Applied Intelligence,2008,29(1):1-11.
    [14] T. C. Fu, F. L. Chung, K. Y. Kwok, Ng. Chak-Man. Stock time series visualizationbased on data point importance, Engineering Applications of Artificial Intelligence,2008,21(8):1217-1232.
    [15] T. C. Fu, F. L. Chung, R. Luk, C.-M. Ng. Representing financial time series basedon data point importance, Engineering Applications of Artificial Intelligence,2008,21(2):277-300.
    [16] T. Fu. A review on time series data mining, Engineering Applications of ArtificialIntelligence,2011,24(1):164-181.
    [17] C. Ratanamahatana, E. Keogh, A. J. Bagnall, S. Lonardi. A novel bit level timeseries representation with implication of similarity search and clustering. InProceedings of the9th Pacific-Asia Conference on Knowledge Discovery and DataMining,2005,771-777.
    [18] J. Lin, E. Keogh, S. Lonardi, B. Chiu. A symbolic representation of time series,with implications for streaming algorithms, In Proceedings of the8th ACMSIGMOD workshop on Research issues in Data Mining and Knowledge Discovery,2003,2-11.
    [19] P. Sebastiani, M. Ramoni, P. R. Cohen, J. Warwick, J. Davis. Discoveringdynamics using Bayesian clustering, In Proceedings of the3rd InternationalConference in Intelligent Data Analysis,1999,199-210.
    [20] M. Bicego, M. Cristani, V. Murino. Unsupervised scene analysis: A hiddenMarkov model approach, Computer Vision and Image Understanding,2006,102(1):22-41.
    [21] A. Panuccio, M. Bicego, V. Murino. A Hidden Markov Model-based approach tosequential data clustering, In Proceedings Joint IAPR International WorkshopsStructural Syntactic and Statistical Pattern Recognition,2002,734-742.
    [22] D. R. Brillinger. Time series data analysis and theory,2001, SIAM’s Classics inApplied Mathematics.
    [23] C. Chatfield. The analysis of time series. An introduction, Fifth Edition,1996,Chapman and Hall/CRC.
    [24] C.-C. Wang. A comparison study between fuzzy time series model and ARIMAmodel for forecasting Taiwan export, Expert Systems with Applications,2011,38(8):9296-9304.
    [25] X. Li, T. Tian. Time series recognition based on wavelet transform and Fouriertransform, In Proceedings of IEEE Symposium on Industrial Electronics&Applications (ISIEA),2010,722-726.
    [26] N. Kumar, V. Lolla, E. Keogh, S. Lonardi, C. Ratanamahatana, L. Wei. Time-seriesbitmaps: a practical visualization tool for working with large time series databases.In Proceedings of the5th SIAM International Conference on Data Mining,2005,531-535.
    [27] J. An, H. Chen, K. Furuse, N. Ohbo, E. Keogh. Grid-based indexing for large timeseries databases, In Proceedings of the4th International Conference on IntelligentData Engineering and Automated Learning,2003,614-621.
    [28] A. J. Bagnall, G. J. Janacek. Clustering time series with clipped data, MachineLearning,2005,58(2-3):151-178.
    [29] J. An, Y.-P. P. Chen, H. Chen. DDR: An index method for large time series datasets,Information Systems,2005,30(5):333-348.
    [30] A. Camerra, T. Palpanas, J. Shieh, E. Keogh. iSAX2.0: Indexing and mining onebillion time series, In Proceedings of the IEEE10th International Conference onData Mining (ICDM),2010,58-67.
    [31] Q. Li, I. F. V. Lopez, B. Moon. Skyline Index for Time Series Data, IEEETransactions on Knowledge and Data Engineering,2004,16(6):669-683.
    [32] P. V. Senin. Literature Review on Time Series Indexing, CSDL Technical Report,2009, CSDL Technica l Report.
    [33] N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger. The R*-tree: an efcient androbust access method for points and rectangles. In Proceedings of the InternationalConference on Management of data ACMSIGMOD,1990,19:322–331.
    [34] C. Faloutsos, M. Ranganathan, Y. Manolopoulos. Fast subsequence matching intime-series databases, In Proceedings of ACM SIGMOD International Conferenceon Management of Data,1994,419-429.
    [35] T. C. Fu, F. L. Chung, R. Luk, C-M. Ng. Stock time series pattern matching:template-based vs. rule-based approaches, Engineering Applications of ArtificialIntelligence,2007,20(3):347-364.
    [36] L. Hailin, G. Chonghui, Q. Wangren. Similarity measure based on piecewise linearapproximation and derivative dynamic time warping for time series mining, ExpertSystems with Applications,2011,38(12):14732-14743.
    [37] E. Keogh, K. Chakrabarti, M. Pazzani, S. Mehrotra. Dimensionality reduction forfast similarity search in large time series databases, Knowledge and InformationSystems,2001,3(3):263-286.
    [38] G. Li, Y. Wang, M. Li, Z. Wu. Similarity match in time series streams underdynamic time warping distance, In Proceedings of International Conference onComputer Science and Software Engineering,2008,4:399-402.
    [39] Q. Wang, V. Megalooikonomou. A dimensionality reduction technique for efficienttime series similarity analysis, Information Systems,2008,33(1):115-132.
    [40] L. Xian, C. Lei, X. Y. Jeffrey, H. Jimsong, M. Jian. Multiscale representations forfast pattern matching in stream time series, IEEE Transactions on Knowledge andData Engineering,2009,21(4):568-581.
    [41] L. Xian, C. Lei. Efficient methods on predictions for similarity search over streamtime series, In Proceedings of18th International Conference on Scientific andStatistical Database Management,2006,241-250.
    [42] L. Xian, C. Lei. Efficient similarity search over future stream time series, IEEETransactions on Knowledge and Data Engineering,2008,20(1):40-54.
    [43] M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, E. Keogh. Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures. InProceedings of the9th International Conference on Knowledge Discovery and DataMining,2003,216-225.
    [44] E. Keogh, S. Kasetty. On the need for time series data mining benchmarks: Asurvey and empirical demonstration. In Proceedings of the8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,2002,102-111.
    [45] B.-K. Yi, C. Faloutsos. Fast time sequence indexing for arbitrary Lp norms. InProceedings of the26th International Conference on Very Large Data Bases,2000,385-394.
    [46] E. Keogh, S. Lonardi, C. Ratanamahatana. Towards parameter-free data mining. InProceedings of the10th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining,2004,206-215.
    [47] W. W. Cohen, P. Ravikumar, S. E. Fienberg. A comparison of string distancemetrics for name-matching tasks. In Proceedings of the18th International JointConference on Artificial Intelligence,2003,73-78.
    [48] H. Mannila and J. Sepp anen. Recognizing similar situations from event sequences.In Proceedings of the1st SIAM International Conference on Data Mining(SDM'01). SIAM,2001,1-16.
    [49] J. A. Flanagan. A non-parametric approach to unsupervised learning and clusteringof symbol strings and sequences. In Proceedings of the4th Workshop on Self-Organizing Maps,2003,128-133.
    [50] O. Antunes. Temporal data mining: An overview. In Proceedings of InternationalConference on Knowledge Discovery and Data Mining2001.
    [51] E. Fuchs, T. Gruber, J. Nitschke, B. Sick. Online segmentation of time series basedon polynomial least-squares approximations, IEEE Transactions on PatternAnalysis and Machine Intelligence,2010,32(12):2232-2245.
    [52] E. Keogh, S. Chu, D. Hart, M. Pazzani. Segmenting Time Series: A Survey andNovel Approach, Data Mining In Time Series Databases,2004,57:1-22.
    [53] S.-H. Park, J.-H. Lee, S.-J. Chun, J.-W. Song. Representation and clustering of timeseries by means of segmentation based on PIPs detection, In Proceedings of the2ndInternational Conference on Computer and Automation Engineering,2010,3:17-21.
    [54] K. Vasko, H. T. T. Toivonen. Estimating the number of segments in time seriesdata using permutation tests, In Proceedings of the2nd IEEE InternationalConference on Data Mining,2002,466-473.
    [55] J. Lin, E. Keogh, S. Lonardi, P. Patel. Finding Motifs in Time Series,2002,53-68
    [56] A. Mueen, E. Keogh, Q. Zhu, S. Cash, B. Westover. Exact Discovery of TimeSeries Motif, SDM,2009.
    [57] B. Chiu, E. Keogh, S. Lonardi. Probabilistic Discovery of Time Series Motifs, InProceedings of the9th International Conference on Knowledge Discovery and DataMining,2003,493-498.
    [58] M. H. Ghaseminezhad, A. Karami. A novel self-organizing map (SOM) neuralnetwork for discrete groups of data clustering, Applied Soft Computing,2011,11(4):3771-3778.
    [59] S. Ismail, A. Shabri, R. Samsudin. A hybrid model of self-organizing maps (SOM)and least square support vector machine (LSSVM) for time-series forecasting,Expert Systems with Applications,2011,38(8):10574-10578,.
    [60] E. Egrioglu, C. H. Aladag, U. Yolcu. Fuzzy time series forecasting with a novelhybrid approach combining fuzzy c-means and neural networks, Expert Systemswith Applications,2013,40(3):854-857.
    [61] H. S. Guan, Q. S. Jiang. Cluster financial time series for portfolio. In Proceedingsof the International Conference on Wavelet Analysis and Pattern Recognition,2007,851-856.
    [62] G. Karypis, E.-H. Han, V. Kumar. Chameleon: hierarchical clustering usingdynamic modeling, Computer,1999,32(8):68-75.
    [63] T. W. Liao. Clustering of time series data—a survey, Pattern Recognition,2005,38(11):1857-1874,
    [64] X. Wang, K. A. Smith, R. Hyndman, D. Alahakoon. A scalable method for timeseries clustering, Technical report,2004, Monash University Australia.
    [65] C. H. Aladag, E. Egrioglu, C. Kadilar. Forecasting nonlinear time series with ahybrid methodology, Applied Mathematics Letters,2009,22:1467-1470.
    [66] D. Bao, Z. Yang. Intelligent stock trading system by turning point confirming andprobabilistic reasoning, Expert Systems with Applications,2008,34(1):620-627.
    [67] J. Behnamian, S. M. T. Fatemi Ghomi. Development of a PSO–SA hybridmetaheuristic for a new comprehensive regression model to time-series forecasting,Expert Systems with Applications,2010,37(2):974-984.
    [68] J.-H. Cheng, H.-P. Chen, Y.-M. Lin. A hybrid forecast marketing timing modelbased on probabilistic neural network, rough set and c4.5, Expert Systems withApplications,2010,37(3):1814-1820.
    [69] S.-C. Huang, P.-J. Chuang, C.-F. Wu, H.-J. Lai. Chaos-based support vectorregressions for exchange rate forecasting, Expert System with Applications,2010,37(12):8590-8598.
    [70] K. Mehdi, B. Mehdi. A novel hybridization of artificial neural networks andARIMA models for time series forecasting, Applied Soft Computing,2011,11(2):2664-2675.
    [71] J. D. Wichard. Forecasting the NN5time series with hybrid models, InternationalJournal of Forecasting,2011,27(3):700-707.
    [72] M. L. Hetland, P. Saetrom. Evolutionary rule mining in time series databases,Machine Learning,2005,58(2-3):107-125.
    [73] M. Ohsaki, Y. Sato, H. Yokoi, T. Yamaguchi. A Rule discovery support system forsequential medical data in the case study of a chronic hepatitis data set, InProceedings of the14th European Conference on Machine Learning,2003,154-165.
    [74] B. K. Sarker, T. Mori, K. Uehara. Parallel algorithms for mining association rulesin time series data, In Proceedings of the International Symposium on Parallel andDistributed Processing and Applications,2003,273-284.
    [75] I. Kim, S. R. Lee. A fuzzy time series prediction method based on consecutivevalues, In Proceedings of. IEEE International Conference of Fuzzy Systems,1999,2:702-707.
    [76] K. Lin, Q. Lin, C. Zhou, J. Yao. Time series prediction based on linear regressionand SVR, In Proceedings of International Conference on Natural Computation,2007,1:688-691.
    [77] D.-Y. Men, W-Y. Liu. Application of least squares support vector machine (LS-SVM) based on time series in power system monthly load forecasting. InProceedings of Power and Energy Engineering Conference (APPEEC),2011,1-4.
    [78] S. Policker, A. Geva. A new algorithm for time series prediction by temporal fuzzyclustering, In Proceedings of International Conference on Pattern Recognition,2000,2000,2:728-731.
    [79] C. Damle, A. Yalcin. Flood prediction using time series data mining, Journal ofHydrology,2007,333(2-4):305-316.
    [80] T. V. Gestel, J. Suykens, D. E. Baestaens, A. Lambrechts, G. Lanckriet, B.Vandaele, D. B. Moor, J. Vandewalle. Financial time series prediction using leastsquares support vector machines within the evidence framework, IEEETransactions on Neural Networks,2001,12(4):809-821.
    [81] R. J. Hyndman, Y. Khandak. Automatic time series forecasting: The forecastpackage for R, Journal of Statistical Software,2008,27(3):1-22.
    [82] W. K. Wong, E. Bai and A. W. C. Chu. Adaptive time variant models for fuzzytime series forecasting, IEEE Transaction on Systems, Man and Cybernetics-Part B:Cybernetics,2010,40(6):1531-1542.
    [83] W. Wu, W. Zhang, Y. Yang, Q. Wang. Time series analysis for bug numberprediction, In Proceedings of Intenational Conference on Software Engineering andData Mining,2010:589-596.
    [84] A. Jain, R. Dubes, Algorithms for Clustering Data,1988, Prentice Hall.
    [85] M. Aly. Survey on multiclass classification methods, Technical report,2005,Caltech USA.
    [86] R. M. Rifkin, A. B. R. Klautau. In defense of one-vs-all classification, The Journalof Machine Learning Research,2004,5:101-141.
    [87] K.-B. Duan, S. S. Keerthi. Which is the best multiclass SVM method? An empiricalstudy, In Proceedings of the Sixth International Workshop on Multiple ClassifierSystems,2005:278-285.
    [88] S.-T. John, S. Shiliang. A review of optimization methodologies in support vectormachines, Neurocomputing,2011,74(17):3609-3618.
    [89] P. Cotofrei, K. Stoffel. Classification Rules+Time=Temporal Rules, InProceedings of International Conference on Computational Science,2002,572-581.
    [90] X. Bai, C. Zhao. Research on time series forecasting model based on support vectormachines. In Proceedings of the International Conference on MeasuringTechnology and Mechatronics Automation (ICMTMA),2010,3:227-230,.
    [91] Z. Chen, Y. Yang. Assessing forecast accuracy measures [Online],2004, Available:http://www.stat.iastate.edu/preprint/articles/2004-10.pdf.
    [92] W. Eckerson. Performance Dashboards: Measuring, Monitoring, and ManagingYour Business,2005, John Wiley&Sons.
    [93] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning DataMining, Inference, and Prediction Second Edition,2008, Springer-Verlag.
    [94] R. H. Rowntree. Measuring the accuracy of prediction, The American EconomicReview,1928,18(3):477-488,.
    [95] D. Basak, S. Pal, D. C. Patranabis. Support vector regression, Neural InformationProcessing,2010,11(10):203-224,.
    [96] L.J. Cao, F.E.H. Tay. Support vector machine with adaptive parameters in financialtime series forecasting, IEEE Transactions on neural networks,2003,14(6):1506-1519.
    [97] L. Cao. Support vector machines experts for time series forecasting,Neurocomputing,2003,51:321-339
    [98] V. Cherkassky, Y. Ma. Practical selection of SVM parameters and noise estimationfor SVM regression, Neural Networks,2004,17(1):113-126,.
    [99] N. Turker, F. Gunes. A competitive approach to neural divice modeling: Supportvector machines, In Proceedings of International Conference on Artificial NeuralNetworks,2006,974-981.
    [100] C.-Y. Yeh, C-W. Huang, S-J. Lee. Multi-kernel support vector clustering for multi-class classification, International Journal of Innovative Computing, Information andControl,2010,6(6):2245-2262.
    [101] C. P. John. Sequential minimal optimization: A fast algorithm for training supportvector machines, Microsoft Research,1998.
    [102] E. Osuna, R. Freund. An improved training algorithm for Support Vector Machines,In Proceedings of IEEE Workshop on Neural Network for Signal Processing,1997,276-285.
    [103] J. C. Platt. Sequential minimal optimization: A fast algorithm for training supportvector machines. Advances in Kernel Methods, MIT Press Cambridge,1999,185-208.
    [104] G. W. Flake, Steve Lawrence. Efficient SVM Regression Training with SMO,Machine Learning,2002,46:271-290.
    [105] S. K. Shevade, S. S. Keerthi, C. Bhattacharyya, K. R. K. Murthy. Improvements tothe SMO algorithm for SVM regression. IEEE Transactions on Neural Networks,2000,11(5):1188-1193.
    [106] A. J. Smola, B. Scholkopf. A tutorial on support vector regression, Statistics andComputing,2004,14(3):199-222.
    [107] J. R. Wang, X. L. Deng. Selecting training points of the sequential minimaloptimization algorithm for support vector machine, In Proceedings of the2ndInternational Conference on Control, Instrumentation and Automation,2011,456-458.
    [108] J. F. Yang, Y. J. Zhai, D. P. Xu, P. Han. SMO algorithm applied in time seriesmodel building and forecast, In Proceedings of the6th International Conference onMachine Learning and Cybernetics,2007,4:2395-2400.
    [109] N. Cristianini, S.-T. John. An introduction to Support Vector Machines and otherKernel-based learning methods,2000, Cambridge University Press.
    [110] M. H. Kutner, C. J. Nachtsheim, J. Neter. Applied linear regression models (4thEdition),2005, Higher Education Press.
    [111] Yahoo Finance. Data available at http://finance.yahoo.com
    [112] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, H. Ian. The WEKAdata mining software: An update, SIGKDD Explorations,2009,1(1):10-18.
    [113] C. C. Chang, C. J. Lin, LIBSVM: A Library for Support Vector Machines,2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [114] M. Aiken, M. Bsat. Forecasting Market Trends with Neural Networks, InformationSystems Management,1999,16(4):1-7.
    [115] P.-C. Chang, C.-Y. Fan, J.-L. Lin. Trend discovery in financial time series datausing a case based fuzzy decision tree, Expert Systems with Applications,2011,38(5):6070-6080.
    [116] J. E. Dayhoff. Neural network architectures. An introduction,1990, Van NostrandReinhold Co.
    [117] B. B. Nair, V.P. Mohandas, N.R. Sakthivel: A Genetic Algorithm OptimizedDecision Tree-SVM based Stock Market Trend Prediction System, InternationalJournal on Computer Science and Engineering,2010,2(9):2981-2988.
    [118] Trendwatch. Technical notes available at http://www.trend-watch.co.uk/main_tech.asp
    [119] R. W. Colby. The encyclopedia of technical market indicators2nd Edition,2003,McGraw-Hill.
    [120] L. Bolelli, E. Seyda, D. Zhou, C.L. Giles. K-SVMeans: A Hybrid ClusteringAlgorithm for Multi-Type Interrelated Datasets, In Proceedings of InternationalConference on Web Intelligence,2007,198-204.
    [121] T. Hamamura, H. Mizutani, B. Irie. A multiclass classification method based onmultiple pairwise classifiers, In Proceedings of International Conference onDocument Analysis and Recognition,2003,809-813.
    [122] D. Priyadarshini, M. Acharya, A. P. Mishra. Link load prediction using supportvector regression and optimization, International Journal of Computer Applications,2011,24(7):22-25.
    [123] F. E. H. Tay, L. Cao. Application of support vector machines in financial timeseries forecasting, Omega,2001,29(4):309-317,
    [124] B. P. Joshi, S. Kumar. Intuitionistic fuzzy sets based method for fuzzy time seriesforecasting, Cybernetics and Systems: An International Journal,2012,43(1):34-47.
    [125] E. G. Gur, U. Harun. Classification of heart sounds based on the least squaressupport vector machine, International Journal of Innovative Computing,Information and Control,2011,7(12):7131-7144.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700