摘要
文章提出了一种利用过滤式属性筛选-去相关的方法选择建模属性,极大程度地削弱属性之间的共线性程度。逻辑回归算法在进行类别判定时默认以0.5为判定阈值,但此种判定方式在不平衡数据集上的效果不理想,利用再缩放策略研究逻辑回归算法在不平衡数据集上的应用,并在某运营商提供的垃圾短信用户行为消费特征样本数据上进行实证分析。结果表明,经由过滤式属性筛选-去相关选取建模变量之后,基于再缩放策略的逻辑回归学习器具有良好的准确率和普适性。
This paper proposes a method of filtering attributes and de-correlation to select modeling attributes, greatly reducing the degree of collinearity between attributes. By default, the logistic regression algorithm takes 0.5 as the threshold when making a category decision, but the effect of such a decision on the imbalanced data set is not satisfactory. Therefore, the paper uses the rescaling strategy to study the application of logistic regression algorithm in unbalanced data sets, and makes an empirical analysis on the sample data of consumer behavior characteristics of spam SMS provided by an operator. The results show that after using method of filtering attributes and de-correlation to select modeling variables, logistic regression learning tool based on rescaling strategy has good accuracy and universality.
引文
[1]陶然.Logistic模型多重共线性问题的诊断及改进[J].统计与决策,2008,(15).
[2]张凤莲.多元线性回归中多重共线性问题的解决办法探讨[D].广州:华南理工大学硕士论文,2010.
[3]满敬銮,杨薇.基于多重共线性的处理方法[J].数学理论与应用,2010,(2).
[4]郭媛媛.基于核主成分回归的多重共线性消除问题研究[D].唐山:河北联合大学硕士论文,2014.
[5]赵东波.线性回归模型中多重共线性问题的研究[D].锦州:渤海大学硕士论文,2017.
[6]王鹏.面向不平衡数据分类问题的核逻辑回归算法的设计与实现[D].西安:西安电子科技大学硕士论文,2015.
[7]郭华平,董亚东,邬长安等.面向类不平衡的逻辑回归方法[J].模式识别与人工智能,2015,28(8).
[8]Zeng Z Q,Qun W U,Liao B S,et al.A Classfication Method for Imbalance Data Set Based on Kernel SMOTE[J].Acta Electronica Sinica,2009,37(11).
[9]周志华.机器学习[M].北京:清华大学出版社,2016.