老师,请问用网格搜索跑MNIST数据集报MemoryError是什么问题啊?

来源:9-8 OvR与OvO

scientist272

2018-09-12

这里是代码

import numpy as np
from sklearn.datasets import fetch_mldata

#PCA对数据进行降维
minst = fetch_mldata('MNIST original')
X,y = minst['data'],minst['target']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)
from sklearn.decomposition import PCA
pca = PCA(0.9)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
def PolynomialLogisticRegression(degree = 1, C = 0.1):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('log_reg', LogisticRegression(C=C))
    ])

# 待进行网格搜索的算法    
poly_log_reg = PolynomialLogisticRegression()
 
# 准备待搜索的参数列表
C_PARM = [0.1,0.2,0.3,0.4,0.5]
param_grid = [
    {
        'poly__degree': [i for i in range(1, 11)], 
        'log_reg__C': [i for i in C_PARM]
    }
]
 
# 实例化GridSearchCV进行网格搜索
grid_search = GridSearchCV(poly_log_reg, param_grid)
grid_search.fit(X_train_reduction ,y_train)    


跑了17分钟以后报MemoryError

写回答

1回答

liuyubobobo

2018-09-12

MNIST是一个28*28=784维的数据。使用多项式特征,你的poly_degree最多是10,也就是有784^10=

87732524600823436081182539776个特征。就算只有一个样本,有这么多特征。假设每个特征只使用8个bit,算算看,大概需要多少内存?


======


我简单估算了一下,大概要20亿个亿的GB。别说内存了,你的外存也远远不够啊:)

0
2
liuyubobobo
回复
scientist272
是多特征的情况不应该使用多项式特征做数据预处理!降维还来不及,多项式特征是在升维:)
2018-09-12
共2条回复

Python3入门机器学习 经典算法与应用  

Python3+sklearn,兼顾原理、算法底层实现和框架使用。

5839 学习 · 2437 问题

查看课程