摘要
随着基于位置的社交网络的发展,时空-文本等轨迹数据量呈指数式增长,与此同时数据低质的问题日益显著。高质的签到数据可以使研究人员更好地从中挖掘丰富且有意义的知识,因此为了更有效地使用签到大数据,数据预处理必不可少。签到数据具有冗余度高、同时签到、时空签到跨度大等低质问题,导致不能直接使用现有的数据预处理流程和方法。针对签到数据特性,提出一套具有针对性的数据预处理流程。通过平均化处理消除了签到轨迹中存在的同时签到数据;通过学习基于熵的时间戳间隔阈值划分签到轨迹,解决签到轨迹时间跨度大的问题;利用基于密度聚类的方法实现签到轨迹分层,解决空间跨度大的问题。实验采用真实的签到轨迹数据,从离群点和分层效果两个方法对预处理效果进行评价,实现不同空间粒度的签到轨迹分离预处理,为后续的轨迹分析与挖掘奠定基础。
With the development of location-based social networks, the amount of trajectory data such as space-time and text has grown exponentially. Meanwhile, the problem of low quality data has become increasingly prominent. High-quality check-in data allows researchers to better extract rich and meaningful knowledge. Data preprocessing is essential to use check-in big data more effectively. The check-in data has low quality issues: high redundancy, simultaneous check-in, and large spatio-temporal check-in span. The result is that existing data preprocessing process and methods cannot be used directly. According to the characteristics of check-in data, we proposed a set of targeted data preprocessing process. We applied the averaging process to eliminate the presence of the simultaneous check-in data in the check-in trajectory. By learning the threshold of time stamp interval based on entropy to divide the check-in trajectory, we solved the problem of long time span of the check-in trajectory. Using density-based clustering method, the problem of long-span multi-level space of check-in trajectory was solved. The experiment used the real check-in trajectory data, and evaluated the preprocessing effect from the two methods of outliers and taxonomy effects. The results show that the preprocessing of check-in trajectory separation with different spatial granularity is realized, which lays a foundation for subsequent trajectory analysis and mining.
引文
[1] 潘晓, 马昂, 郭景峰, 等. 基于时间序列的轨迹数据相似性度量方法研究及应用综述[R]. 石家庄:石家庄铁道大学, 2018.
[2] 高强, 张凤荔, 王瑞锦, 等. 轨迹大数据:数据处理关键技术研究综述[J]. 软件学报, 2017, 28(4):959-992.
[3] Bao J, Zheng Y, Mokbel M F. Location-based and preference-aware recommendation using sparse geo-social networking data[C]//International Conference on Advances in Geographic Information Systems. ACM, 2012:199-208.
[4] Wei L Y, Zheng Y, Peng W C. Constructing popular routes from uncertain trajectories[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2012:195-203.
[5] Lee W C, Krumm J. Trajectory preprocessing[M]//Computing with Spatial Trajectories. Springer New York,2011:3-33.
[6] Hodge V, Austin J. A survey of outlier detection methodologies[J].Artificial Intelligence Review,2004,22(2):85-126.
[7] Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey[M].ACM Computing Surveys,2009,41(3):75-79.
[8] Aggarwal C C. Outlier analysis[M]. Springer Publishing Company, Incorporated, 2015.
[9] Zhang Y, Meratnia N, Havinga P. Outlier detection techniques for wireless sensor networks:a survey[J].IEEE Communications Surveys & Tutorials,2010,12(2):159-170.
[10] Gupta M, Gao J, Aggarwal C C, et al. Outlier detection for temporal data: a survey[J]. IEEE Trans. on Knowledge and Data Engineering, 2014, 26(9):2250-2267.
[11] Krumm J. Trajectory analysis for driving[M]//Computing with Spatial Trajectories.Springer New York,2011:213-241.
[12] Yuan J, Zheng Y, Xie X, et al. T-Drive: Enhancing driving directions with taxi drivers’ intelligence[J]. IEEE Transactions on Knowledge & Data Engineering,2012,25(1):220-232.
[13] 王洋,单征,赵炳麟,等.基于静态行为轨迹的异常特征检测技术[J].计算机应用研究,2017,34(8):2434-2438.
[14] Lei P R. A framework for anomaly detection in maritime trajectory behavior[J]. Knowledge and Information Systems, 2016, 47(1): 189-214.
[15] Zhu J, Jiang W, Liu A, et al. Time-dependent popular routes based trajectory outlier detection[C]//International Conference on Web Information Systems Engineering. Springer International Publishing, 2015:16-30.
[16] Mcmaster R B. A statistical analysis of mathematical measures for linear simplification[J]. American Cartographer, 1986, 13(2):103-116.
[17] Hershberger J, Snoeyink J. Speeding up the Douglas-Peucker line-simplification algorithm[J]. Proceedings of the International Symposium on Spatial Data Handling,1992:134-143.
[18] Meratnia N, By R A D. Spatiotemporal Compression Techniques for Moving Point Objects[C]//Advances in Database Technology—EDBT 2004, 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece, March 14-18,2004,Proceedings.DBLP,2004:765-782.
[19] 张晓滨, 杨东山. 基于时间约束的Hausdorff距离的时空轨迹相似度量[J].计算机应用研究,2017,34(7):2077-2079.
[20] Yuan N J, Zheng Y, Zhang L, et al. T-Finder: A recommender system for finding passengers and vacant taxis[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10):2390-2403.
[21] Zheng Y, Chen Y, Li Q, et al. Understanding transportation modes based on GPS data for web applications[J]. Acm Transactions on the Web, 2010, 4(1):1-36.
[22] Zheng Y, Xie X. Learning travel recommendations from user-generated GPS traces[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(1):1-29.
[23] Zheng Y, Zhang L Z, Xie X, et al. Mining interesting locations and travel sequences from GPS trajectories[C]//International Conference on World Wide Web. ACM, 2009:791-800.
[24] Cui G, Luo J, Wang X. Personalized travel route recommendation using collaborative filtering based on GPS trajectories[J]. International Journal of Digital Earth, 2018, 11(12):1-24.