From Eliza to XiaoIce:challenges and opportunities with social chatbots

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

From Eliza to XiaoIce:challenges and opportunities with social chatbots

详细信息查看全文 | 推荐本文 |

英文篇名：From Eliza to XiaoIce:challenges and opportunities with social chatbots
作者：Heung-yeung ; SHUM ; Xiao-dong ; HE ; Di ; LI
英文作者：Heung-yeung SHUM;Xiao-dong HE;Di LI;Microsoft Corporation;
英文关键词：Conversational system;;Social Chatbot;;Intelligent personal assistant;;Artificial intelligence;;Xiao Ice
中文刊名：JZUS
英文刊名：信息与电子工程前沿(英文)
机构：Microsoft Corporation;
出版日期：2018-01-03
出版单位：Frontiers of Information Technology & Electronic Engineering
年：2018
期：v.19
语种：英文;
页：JZUS201801004
页数：17
CN：01
ISSN：33-1389/TP
分类号：13-29

摘要

Conversational systems have come a long way since their inception in the 1960 s.After decades of research and development,we have seen progress from Eliza and Parry in the 1960 s and 1970 s,to task-completion systems as in the Defense Advanced Research Projects Agency(DARPA) communicator program in the 2000 s,to intelligent personal assistants such as Siri,in the 2010 s,to today's social chatbots like Xiao Ice.Social chatbots' appeal lies not only in their ability to respond to users' diverse requests,but also in being able to establish an emotional connection with users.The latter is done by satisfying users' need for communication,affection,as well as social belonging.To further the advancement and adoption of social chatbots,their design must focus on user engagement and take both intellectual quotient(IQ) and emotional quotient(EQ) into account.Users should want to engage with a social chatbot;as such,we define the success metric for social chatbots as conversation-turns per session(CPS).Using Xiao Ice as an illustrative example,we discuss key technologies in building social chatbots from core chat to visual awareness to skills.We also show how Xiao Ice can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses.As we become the first generation of humans ever living with artificial intelligenc(AI),we have a responsibility to design social chatbots to be both useful and empathetic,so they will become ubiquitous and help society as a whole.
Conversational systems have come a long way since their inception in the 1960 s.After decades of research and development,we have seen progress from Eliza and Parry in the 1960 s and 1970 s,to task-completion systems as in the Defense Advanced Research Projects Agency(DARPA) communicator program in the 2000 s,to intelligent personal assistants such as Siri,in the 2010 s,to today's social chatbots like Xiao Ice.Social chatbots' appeal lies not only in their ability to respond to users' diverse requests,but also in being able to establish an emotional connection with users.The latter is done by satisfying users' need for communication,affection,as well as social belonging.To further the advancement and adoption of social chatbots,their design must focus on user engagement and take both intellectual quotient(IQ) and emotional quotient(EQ) into account.Users should want to engage with a social chatbot;as such,we define the success metric for social chatbots as conversation-turns per session(CPS).Using Xiao Ice as an illustrative example,we discuss key technologies in building social chatbots from core chat to visual awareness to skills.We also show how Xiao Ice can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses.As we become the first generation of humans ever living with artificial intelligenc(AI),we have a responsibility to design social chatbots to be both useful and empathetic,so they will become ubiquitous and help society as a whole.

引文

Alam F,Danieli M,Riccardi G,2017.Annotating and modeling empathy in spoken conversations.Comput Speech Lang,50:40-61.https://doi.org/10.1016/j.csl.2017.12.003
    Andreani G,di Fabbrizio G,Gilbert M,et al.,2006.Let’s DISCOH:collecting an annotated open corpus with dialogue acts and reward signals for natural language helpdesks.Proc IEEE Spoken Language Technology Workshop,p.218-221.https://doi.org/10.1109/SLT.2006.326794
    Bahdanau D,Cho K,Bengio Y,2014.Neural machine translation by jointly learning to align and translate.https://arxiv.org/abs/1409.0473
    Beldoch M,1964.Sensitivity to expression of emotional meaning in three modes of communication.In:Davitz JR(Ed.),The Communication of Emotional Meaning.Mc Graw-Hill,New York,p.31-42.
    Bengio Y,Ducharme R,Vincent P,et al.,2003.A neural probabilistic language model.Proc Neural Information Processing Systems,p.1137-1155.
    Chen HM,Sun MS,Tu CC,et al.,2016.Neural sentiment classification with user and product attention.Proc Conf on Empirical Methods in Natural Language Processing,p.1650-1659.
    Colby KM,1975.Artificial Paranoia:a Computer Simulation of Paranoid Processes.Pergamon Press INC.Maxwell House,New York,NY,England.
    Dahl DA,Bates M,Brown M,et al.,1994.Expanding the scope of the ATIS task:the ATIS-3 corpus.Proc Workshop on Human Language Technology,p.43-48.https://doi.org/10.3115/1075812.1075823
    Deng L,Li JY,Huang JT,et al.,2013.Recent advances in deep learning for speech research at Microsoft.Proc IEEE Int Conf on Acoustics,Speech and Signal Processing,p.8604-8608.https://doi.org/10.1109/ICASSP.2013.6639345
    Elkahky AM,Song Y,He XD,2015.A multi-view deep learning approach for cross domain user modeling in recommendation systems.Proc 24th Int Conf on World Wide Web,p.278-288.https://doi.org/10.1145/2736277.2741667
    Fang H,Gupta S,Iandola F,et al.,2015.From captions to visual concepts and back.Proc IEEE Conf on Computer Vision and Pattern Recognition,p.1473-1482.https://doi.org/10.1109/CVPR.2015.7298754
    Fung P,Bertero D,Wan Y,et al.,2016.Towards empathetic human-robot interactions.Proc 17th Int Conf on Intelligent Text and Computational Linguistics.
    Gan C,Gan Z,He XD,et al.,2017,Style Net:generating attractive visual captions with styles.Proc IEEE Conf on Computer Vision and Pattern Recognition,p.3137-3146.https://doi.org/10.1109/CVPR.2017.108
    Gardner H,1983.Frames of Mind:the Theory of Multiple Intelligences.Basic Books,New York.https://doi.org/10.2307/3324261
    Glass J,Flammia G,Goodine D,et al.,1995.Multilingual spoken-language understanding in the MIT Voyager system.Speech Commun,17(1):1-18.https://doi.org/10.1016/0167-6393(95)00008-C
    Goleman D,1995.Emotional Intelligence:Why It Can Matter More than IQ.Bloomsbury,Inc.,New York,NY,England.
    Goleman D,1998.Working with Emotional Intelligence.Bloomsbury,Inc.,New York,NY,England.
    Güzeldere G,Franchi S,1995.Dialogues with colorful“personalities”of early AI.Stanford Human Rev,4(2):161-169.
    He KM,Zhang YX,Ren SQ,et al.,2016.Deep residual learning for image recognition.Proc IEEE Conf on Computer Vision and Pattern Recognition,p.770-778.https://doi.org/10.1109/CVPR.2016.90
    He XD,Deng L,2013.Speech-centric information processing:an optimization-oriented approach.Proc IEEE,101(5):116-1135.https://doi.org/10.1109/JPROC.2012.2236631
    He XD,Deng L,2017.Deep learning for image-to-text generation:a technical overview.IEEE Signal Process Mag,34(6):109-116.https://doi.org/10.1109/MSP.2017.2741510
    Hemphill CT,Godfrey JJ,Doddington GR,1990.The ATISspoken language systems pilot corpus.Proc Workshop on Speech and Natural Language,p.96-101.https://doi.org/10.3115/116580.116613
    Hinton G,Deng L,Yu D,et al.,2012.Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups.IEEE Signal Process Mag,29(6):82-97.https://doi.org/10.1109/MSP.2012.2205597
    Hochreiter S,Schmidhuber J,1997.Long short-term memory.Neur Comput,9(8):1735-1780.https://doi.org/10.1162/neco.1997.9.8.1735
    Huang PS,He XD,Gao JF,et al.,2013.Learning deep structured semantic models for web search using click through data.Proc 22nd ACM Int Conf on Information&Knowledge Management,p.2333-2338.https://doi.org/10.1145/2505515.2505665
    Karpathy A,Li FF,2015.Deep visual-semantic alignments for generating image descriptions.Proc IEEE Conf on Computer Vision and Pattern Recognition,p.3128-3137.https://doi.org/10.1109/CVPR.2015.7298932
    Krizhevsky A,Sutskever I,Hinton GE,2012.Image Net classification with deep convolutional neural networks.Proc25th Int Conf on Neural Information Processing Systems,p.1097-1105.
    Levin E,Narayanan S,Pieraccini R,et al.,2000.The ATT-DARPA ommunicator mixed-initiative spoken dialog system.6th Int Conf on Spoken Language Processing.
    Li JW,Galley M,Brockett C,et al.,2016.A persona-based neural conversation model.Proc 54th Annual Meeting of the Association for Computational Linguistics,p.944-1003.
    Li X,Mou LL,Yan R,et al.,2016.Stalematebreaker:a proactive content-introducing approach to automatic humancomputer conversation.Proc 25th Int Joint Conf on Artificial Intelligence,p.2845-2851.
    Liu XD,Gao JF,He XD,et al.,2015.Representation learning using multi-task deep neural networks for semantic classification and information retrieval.Proc Annual Conf on North American Chapter of the ACL,p.912-921.
    Lu ZD,Li H,2013.A deep architecture for matching short texts.Proc Int Conf on Neural Information Processing Systems,p.1367-1375.
    Maslow AH,1943.A theory of human motivation.Psychol Rev,50(4):370-396.
    Mathews A,Xie LX,He XM,2016.Senti Cap:generating image descriptions with sentiments.Proc 30th AAAI Conf on Artificial Intelligence,p.3574-3580.
    Mesnil G,He X,Deng L,et al.,2013.Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding.Interspeech,p.3771-3775.
    Mesnil G,Dauphin Y,Yao KS,et al.,2015.Using recurrent neural networks for slot filling in spoken language understanding.IEEE/ACM Trans Audio Speech Lang Process,23(3):530-539.https://doi.org/10.1109/TASLP.2014.2383614
    Mikolov T,Sutskever I,Chen K,et al.,2013.Distributed representations of words and phrases and their compositionality.Proc 26th Int Conf on Neural Information Processing Systems,p.3111-3119.
    Mower E,Matari?MJ,Narayanan S,2011.A framework for automatic human emotion classification using emotion profiles.IEEE Trans Audio Speech Lang Process,19(5):1057-1070.https://doi.org/10.1109/TASL.2010.2076804
    Murphy KR,2007.A critique of emotional intelligence:what are the problems and how can they be fixed?Pers Psychol-,60(1):235-238.https://doi.org/10.1111/j.1744-6570.2007.00071_2.x
    Price PJ,1990.Evaluation of spoken language systems:the ATIS domain.Proc Workshop on Speech and Natural Language,p.91-95.https://doi.org/10.3115/116580.116612
    Qian Y,Fan YC,Hu WP,et al.,2014.On the training aspects of deep neural network(DNN)for parametric TTSsynthesis.Proc IEEE Int Conf on Acoustics,Speech and Signal Processing,p.3829-3833.https://doi.org/10.1109/ICASSP.2014.6854318
    Raux A,Langner B,Bohus D,et al.,2005.Let’s go public!Taking a spoken dialog system to the real world.9th European Conf on Speech Communication and Technology,p.885-888.
    Rudnicky AI,Thayer EH,Constantinides PC,et al.,1999.Creating natural dialogs in the Carnegie Mellon communicator system.6th European Conf on Speech Communication and Technology.
    Sarikaya R,2017.The technology behind personal digital assistants-an overview of the system architecture and key components.IEEE Signal Process Mag,34(1):67-81.https://doi.org/10.1109/MSP.2016.2617341
    Sarikaya R,Crook PA,Marin A,et al.,2016.An overview of end-to-end language understanding and dialog management for personal digital assistants.Proc IEEE Spoken Language Technology Workshop,p.391-397.https://doi.org/10.1109/SLT.2016.7846294
    Seneff S,Hurley E,Lau R,et al.,1998.Galaxy-II:a reference architecture for conversational system development.5th Int Conf on Spoken Language Processing.
    Serban IV,Klinger T,Tesauro G,et al.,2017.Multiresolution recurrent neural networks:an application to dialogue response generation.AAAI,p.3288-3294.
    Shawar BA,Atwell E,2007.Different measurements metrics to evaluate a chatbot system.Proc Workshop on Bridging the Gap:Academic and Industrial Research in Dialog Technologies,p.89-96.
    Shieber SM,1994.Lessons from a restricted Turing test.Commun ACM,37(6):70-78.https://doi.org/10.1145/175208.175217
    Socher R,Perelygin A,Wu JY,et al.,2013.Recursive deep models for semantic compositionality over a sentiment treebank.Proc Conf on Empirical Methods in Natural Language Processing,p.1631-1642.
    Song R,2018.Image to poetry by cross-modality understanding with unpaired data.Personal Communication.
    Sordoni A,Galley M,Auli M,et al.,2015.A neural network approach to context-sensitive generation of conversational responses.Proc Annual Conf on North American Chapter of the ACL,p.196-205.
    Sutskever I,Vinyals O,Le QVV,2014.Sequence to sequence learning with neural networks.NIPS,p.1-9.https://doi.org/10.1007/s10107-014-0839-0
    Tokuhisa R,Inui K,Matsumoto Y,2008.Emotion classification using massive examples extracted from the web.Proc22nd Int Conf on Computational Linguistics,p.881-888.
    Tur G,de Mori R,2011.Spoken Language Understanding:Systems for Extracting Semantic Information from Speech.John Wiley and Sons,New York,NY.
    Tur G,Deng L,2011.Intent determination and spoken utterance classification.In:Tur G,de Mori R(Eds.),Spoken Language Understanding:Systems for Extracting Semantic Information from Speech.John Wiley and Sons,New York,NY.
    Turing A,1950.Computing machinery and intelligence.Mind,59:433-460.
    van den Oord A,Dieleman S,Zen HG,et al.,2016.Wave Net:a generative model for raw audio.9th ISCA Speech Synthesis Workshop,p.125.
    Vinyals O,Le QV,2015.A neural conversational model.Proc31st Int Conf on Machine Learning.
    Vinyals O,Toshev A,Bengio S,et al.,2015.Show and tell:a neural image caption generator.Proc IEEE Conf on Computer Vision and Pattern Recognition,p.3156-3164.https://doi.org/10.1109/CVPR.2015.7298935
    Walker M,Aberdeen J,Boland J,et al.,2001.DARPACommunicator dialog travel planning systems:the June2000 data collection.Proc 7th European Conf on Speech Communication and Technology.
    Walker M,Rudnicky AI,Aberdeen JS,et al.,2002,DARPACommunicator evaluation:progress from 2000 to 2001.Proc Int Conf on Spoken Language Processing,p.273-276.
    Wallace RS,2009.The anatomy of A.L.I.C.E.In:Epstein R,Roberts G,Beber G(Eds.),Parsing the Turing Test:Philosophical and Methodological Issues in the Quest for the Thinking Computer.Springer,Dordrecht,p.181-210.https://doi.org/10.1007/978-1-4020-6710-5_13
    Wang HN,He XD,Chang MW,et al.,2013.Personalized ranking model adaptation for web search.Proc 36th Int ACM SIGIR Conf on Research and Development in Information Retrieval,p.323-332.https://doi.org/10.1145/2484028.2484068
    Wang YY,Deng L,Acero A,2011.Semantic frame-based spoken language understanding.In:Tur G,de Mori R(Eds.),Spoken Language Understanding:Systems for Extracting Semantic Information from Speech.John Wiley and Sons,New York,NY.
    Wang ZY,Wang HX,Wen JR,et al.,2015.An inference approach to basic level of categorization.Proc 24th ACMInt Conf on Information and Knowledge Management,p.653-662.https://doi.org/10.1145/2806416.2806533
    Weizenbaum J,1966.ELIZA-a computer program for the study of natural language communication between man and machine.Commun ACM,9(1):36-45.https://doi.org/10.1145/357980.357991
    Wen TH,Vandyke D,Mrksic N,et al.,2016.A network-based end-to-end trainable task-oriented dialogue system.Proc15th Conf on European Chapter of the Association for Computational Linguistics,p.438-449.
    Williams JD,Young S,2007.Partially observable Markov decision processes for spoken dialog systems.Comput Speech Lang,21(2):393-422.https://doi.org/10.1016/j.csl.2006.06.008
    Xiong W,Droppo J,Huang XD,et al.,2016.Achieving human parity in conversational speech recognition.IEEE/ACMTrans Audio Speech Lang Process,in press.https://doi.org/10.1109/TASLP.2017.2756440
    Yan R,Song YP,Wu H,2016.Learning to respond with deep neural networks for retrieval-based human-computer conversation system.Proc 39th Int ACM SIGIR Conf on Research and Development in Information Retrieval,p.55-64.https://doi.org/10.1145/2911451.2911542
    Yang ZC,He XD,Gao JF,et al.,2016a.Stacked attention networks for image question answering.Proc IEEE Conf on Computer Vision and Pattern Recognition,p.21-29.https://doi.org/10.1109/CVPR.2016.10
    Yang ZC,Yang DY,Dyer C,et al.,2016b.Hierarchical attention networks for document classification.Proc 15th Annual Conf on North American Chapter of the Association for Computational Linguistics:Human Language Technologies,p.1480-1489.
    Yu Z,Xu ZY,Black AW,et al.,2016.Chatbot evaluation and database expansion via crowdsourcing.Proc RE-WOCHAT Workshop of LREC.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700