基于Web2.0的电子商务中基于商品名的实体识别

英文篇名：Entity identification based on trade name in e-commerce-based Web2.0
作者：安先喜 ; 田英鑫 ; 郭子阳 ; 石胜飞
英文作者：AN Xianxi;TIAN Yingxin;GUO Ziyang;SHI Shengfei;School of Economics and Management,Harbin University of Engineering;School of Computer Science and Technology,Harbin Institute of Technology;
关键词：实体 ; 实体识别 ; 电子商务 ; 算法 ; 数据库 ; 语义学 ; 交易描述 ; 数据模型
英文关键词：entity;;entity identification;;electronic commerce;;algorithm;;databases;;semantics;;transaction description;;data models
中文刊名：HEBG
英文刊名：Journal of Harbin Engineering University
机构：哈尔滨工程大学经济管理学院;哈尔滨工业大学计算机科学与技术学院;
出版日期：2019-06-28 16:47
出版单位：哈尔滨工程大学学报
年：2019
期：v.40;No.273
基金：国家重点研发计划(2016YFB1000703);; 国家自然科学基金项目(U1509216,U1866602,61472099,61602129)
语种：中文;
页：HEBG201907022
页数：6
CN：07
ISSN：23-1390/U
分类号：152-157

摘要

由于Web2.0的出现,电子商务数据经常由不同网站和不同用户输入,从而同一商品存在着多种描述,这为用户检索和对比商品带来了困难。针对这种情况,本文基于商品名信息对商品进行分类,使得每一类描述一种现实中的商品。本文提出的系统拟将商品名拆分成为关键词集合,基于关键词集合相似性进行分类。对关键词拆分方法、基于集合的分类方法、关键词权重设置方法和相关反馈进行了研究。实验结果表明:本文提出的方法可以快速有效地对商品进行分类,并且权重设置和相关反馈策略可以有效地提高实体识别的准确性。
With the emergence of Web2.0,e-commerce data are often input by different websites and users; thus,there can be many descriptions of the same commodity. This makes it very difficult for users to search for and compare commodities. This paper proposes a method for classifying commodities based on their trade names,such that each category describes an actual type of commodity. The system proposed in this paper splits the trade name into sets of keywords and subsequently classifies them based on the similarity of their keyword sets. In this paper,we propose strategies for keyword splitting,set-based classification,keyword weight setting,and related feedback. The experimental results show that the proposed method can classify commodities quickly and effectively,and the weight-setting and related-feedback strategies can effectively improve the accuracy of entity identification.

引文

[1]CHRISTEN P.A survey of indexing techniques for scalable record linkage and deduplication[J].IEEE transactions on knowledge and data engineering,2012,24(9):1537-1555.
    [2]ELMAGARMID A K,IPEIROTIS P G,VERYKIOS V S.Duplicate record detection:a survey[J].IEEE transactions on knowledge and data engineering,2007,19(1):1-16.
    [3]WANG Sibo,XIAO Xiaokui,LEE C H.Crowd-based deduplication:an adaptive approach[C]//Proceedings of the2015 ACM SIGMOD International Conference on Management of Data.Melbourne,Victoria,Australia,2015:1263-1277.
    [4]GOKHALE C,DAS S,DOAN A,et al.Corleone:handsoff crowdsourcing for entity matching[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,Utah,USA,2014:601-612.
    [5]VERROIOS V,GARCIA-MOLINA H.Entity Resolution with crowd errors[C]//Proceedings of 2015 IEEE 31st International Conference on Data Engineering.Seoul,South Korea,2015:219-230.
    [6]VESDAPUNT N,BELLARE K,DALVI N.Crowdsourcing algorithms for entity resolution[J].Proceedings of the VLDB endowment,2014,7(12):1071-1082.
    [7]WHANG S E,LOFGREN P,GARCIA-MOLINA H.Question selection for crowd entity resolution[J].Proceedings of the VLDB endowment,2013,6(6):349-360.
    [8]HUA Wen,ZHENG Kai,ZHOU Xiaofang.Microblog entity linking with social temporal context[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.Melbourne,Victoria,Australia,2015:1761-1775.
    [9]SHEN Wei,HAN Jiawei,WANG Jianyong.A probabilistic model for linking named entities in web text with heterogeneous information networks[C]//Proceedings of the 2014ACM SIGMOD International Conference on Management of Data.Snowbird,Utah,USA,2014:1199-1210.
    [10]ZHU Xiaochen,SONG Shaoxu,LIAN Xiang,et al.Matching heterogeneous event data[C]//Proceedings of the2014 ACM SIGMOD International Conference on Management of Data.Snowbird,Utah,USA,2014:1211-1222.
    [11]CHIANG Y H,DOAN A,NAUGHTON J F.Modeling entity evolution for temporal record matching[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,Utah,USA,2014:1175-1186.
    [12]WHANG S E,GARCIA-MOLINA H.Incremental entity resolution on rules and data[J].The VLDB journal,2014,23(1):77-102.
    [13]GRUENHEID A,DONG X L,SRIVASTAVA D.Incremental record linkage[J].Proceedings of the VLDB endowment,2014,7(9):697-708.
    [14]WILDANI A,MILLER E L,RODEH O.HANDS:a heuristically arranged non-backup in-line deduplication system[C]//Proceedings of 2013 IEEE 29th International Conference on Data Engineering.Brisbane,QLD,Australia,2013:446-457.
    [15]LI Xian,DONG Luna,LYONS K B,et al.Scaling up copy detection[C]//Proceedings of 2015 IEEE 31st International Conference on Data Engineering.Seoul,South Korea,2015:89-100.
    [16]WHANG S E,MARMAROS D,GARCIA-MOLINA H.Pay-as-you-go entity resolution[J].IEEE transactions on knowledge and data engineering,2013,25(5):1111-1124.
    [17]LI Lingli,LI Jianzhong,WANG Hongzhi,et al.Contextbased entity description rule for entity resolution[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management.Glasgow,Scotland,UK,2011:1725-1730.
    [18]LI Lingli,LI Jianzhong,GAO Hong.Rule-based method for entity resolution[J].IEEE transactions on knowledge and data engineering,2015,27(1):250-263.
    [19]WANG Fangda,WANG Hongzhi,LI Jianzhong,et al.Graph-based reference table construction to facilitate entity matching[J].Journal of systems and software,2013,86(6):1679-1688.
    [20]ALTOWIM Y,KALASHNIKOV D V,MEHROTRA S.Progressive approach to relational entity resolution[J].Proceedings of the VLDB endowment,2014,7(11):999-1010.
    [21]ALTWAIJRY H,KALASHNIKOV D V,MEHROTRA S.Query-driven approach to entity resolution[J].Proceedings of the VLDB endowment,2013,6(14):1846-1857.
    [22]WANG Hongzhi,LI Jianzhong,GAO Hong.Efficient entity resolution based on subgraph cohesion[J].Knowledge and information systems,2016,46(2):285-314.
    [23]LI Qi,LI Yaliang,GAO Jing,et al.A confidence-aware approach for truth discovery on long-tail data[J].Proceedings of the VLDB endowment,2014,8(4):425-436.
    [24]PROKOSHYNA N,SZLICHTA J,CHIANG F,et al.Combining quantitative and logical data cleaning[J].Proceedings of the VLDB endowment,2015,9(4):300-311
    [25]ZHAO Zhou,CHENG J,NG W.Truth discovery in data streams:A single-pass probabilistic approach[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management.Shanghai,2014:1589-1598.
    [26]INTERLANDI M,TANG Nan.Proof positive and negative in data cleaning[C]//Proceedings of 2015 IEEE 31st International Conference on Data Engineering.Seoul,South Korea,2015:18-29.
    [27]XIAO Chuan,WANG Wei,LIN Xuemin,et al.Top-k set similarity joins[C]//Proceedings of the 25th International Conference on Data Engineering.Shanghai,China,2009:916-927.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700