面向新媒体领域的错别字自动校对

英文篇名：Automatic Proofreading of Wrong Characters for New Media Field
作者：龚永罡 ; 汪昕宇 ; 付俊英 ; 王蕴琪
英文作者：GONG Yong-gang;WANG Xin-yu;FU Jun-ying;WANG Yun-qi;
关键词：n-gram模型 ; 混淆集 ; 支持度 ; 错别字
英文关键词：N-gram model;;confusing set;;support degree;;wrongly written character
中文刊名：SDDZ
英文刊名：Information Technology and Informatization
机构：北京工商大学计算机与信息工程学院;
出版日期：2018-10-25
出版单位：信息技术与信息化
年：2018
期：No.223
语种：中文;
页：SDDZ201810031
页数：3
CN：10
ISSN：37-1423/TN
分类号：78-80

摘要

新媒体平台每天原创新闻发布量巨大,采用人工审核内容中的错别字已经不切实际。本文提出了一种基于n-gram模型与规则相结合的方法,采集上亿篇新闻文章作为训练语料,对分词后的语料进行统计分析形成三元n-gram模型库,基于上下文语境构建错别字混淆集,通过最优化方法计算混淆词在目标场景中的支持度,有效实现错别字的自动检查与纠错。实验结果显示,文章查错召回率为78.9%,准确率为85.1%,具有重要的实际意义和广泛的应用领域。
Every day, a huge amount of original news is released in new media platform, so it is unrealistic to manually check the wrong characters in the audited content. In this paper, a method based on N-gram model and rules is proposed to collect hundreds of millions of news articles as training corpus. The corpus after word segmentation is statistically analyzed to form a ternary N-gram model library. The confusion set is constructed based on context. The support of confusion words in target scene is calculated by optimization method. Automatically checking and correcting errors. The experimental results show that the recall rate of error detection is 78.9%, and the accuracy rate is 85.1%. It has important practical significance and wide application fields.

引文

[1]施恒利,刘亮亮,王石,符建辉,张再跃,曹存根.汉字种子混淆集的构建方法研究[J].计算机科学,2014,41(08):229-232+253.
    [2]施恒利.汉字种子混淆集的构建方法研究[D].江苏科技大学,2014.
    [3]张鑫.面向社会媒体的中文文本校对方法研究与实现[D].黑龙江大学,201 5.
    [4]沈涛.结合N-gram模型与句法分析的语法纠错[D].东南大学,2017.
    [5]Bassil Y,Semaan P.ASR Context-Sensitive Error Correction Based on Microsoft N-Gram Dataset[J].Eprint Arxiv,2012.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700