头一次在Kaggle上做比赛。网上说从101类型的开始,都是教学类型的比赛,于是就找了一个好做的做做先。
不知道为什么,Numpy不能自动并行矩阵运算。然后python的多线程不能同时在多核环境上跑。所以写了一个多进程的版本。
放代码:
import multiprocessing
import csv
from numpy import*
def readTrainFile():
m=[]
with open('train.csv', 'rb') as trainer:
rawfile=csv.reader(trainer)
for row in rawfile:
m.append(row)
m.remove(m[0])
for row in m:
row[0]=int(row[0])
for i in xrange(1,size(row)):
if(row[i]!='0'):
row[i]=1
else:
row[i]=0
return mat(m)
def readTestFile():
m=[]
with open('test.csv', 'rb') as tester:
rawfile=csv.reader(tester)
for row in rawfile:
m.append(row)
m.remove(m[0])
for row in m:
for i in xrange(size(row)):
if(row[i]!='0'):
row[i]=1
else:
row[i]=0
return mat(m)
def worker(trainMat, unknownMat, k, cpu_id, pipe):
trainSize=trainMat.shape[0]
unknownSize=unknownMat.shape[0]
comMat=mat(zeros((trainSize, trainMat.shape[1]-1)))
sample=mat(zeros((trainSize, trainMat.shape[1]-1)))
sorter=mat(zeros((trainSize,2)))
sortedId=zeros((trainSize,1))
result=[]
voter=zeros((10,1))
for no in xrange(unknownSize):
voter=zeros((10,1))
sample=unknownMat[no,:]
comMat=tile(sample, (trainSize,1)) - trainMat[:,1:]
comMat=mat(array(comMat)**2)
sorter[:,0]=trainMat[:,0]
sorter[:,1]=comMat.sum(axis=1)
sortedId=sorter[:,1].argsort(axis=0)
for i in xrange(k):
vote=int(sorter[:,0][sortedId[i]])
voter[vote]=voter[vote]+1
result.append(voter.argmax())
print "This is ",no," th sample in CPU No.", cpu_id
pipe.send(array(result))
def saveRes(result):
with open('res.csv', 'wb') as resFile:
writer=csv.writer(resFile)
writer.writerow(['ImageId', 'Label'])
r=zeros((result.shape[0],2))
r[:,0]=array(range(1,result.size+1))
r[:,1]=result
for i in r.tolist():
i[0]=int(i[0])
i[1]=int(i[1])
writer.writerow(i)
def collector(r_pipe, pipes):
res=[]
for p in pipes:
res.extend(p.recv().tolist())
res=array(res)
r_pipe.send(res)
if __name__ == "__main__":
train_set=readTrainFile()
test_set=readTestFile()
k=1
print "Read file ok! k= ",k
cpu_n=multiprocessing.cpu_count()
b_size=test_set.shape[0]/cpu_n
process=[]
pipes=[]
print b_size, cpu_n
for i in xrange(cpu_n):
pipe=multiprocessing.Pipe()
p = multiprocessing.Process(target = worker,
args = (train_set, test_set[i*b_size:(i+1)*b_size,:], k, i, pipe[0], ))
process.append(p)
pipes.append(pipe[1])
p.start()
pipe=multiprocessing.Pipe()
r_pipe=pipe[1]
collect_process=multiprocessing.Process(target = collector, args=(pipe[0], pipes,))
collect_process.start()
res=r_pipe.recv()
saveRes(res)
command=raw_input("Press a to terminate processes!")
if(command=='a'):
for p in process:
p.terminate()
所有的训练样本用上大概需要跑4个小时。大概占用2.5G内存,并且运行时有500MB的波动(可能python写得太搓了,中间有对象不断生成然后释放,以后还会改这份代码)。比单进程的版本要快3倍。我的电脑是AMD A8的四核笔记本,内存8G。Numpy用的是Ubuntu包管理器里的。会不会用Intel的MKL重新编译一下会好些呢?这个以后也可以尝试一下。这份代码跑出来的准确率是0.96271,这个准确率比较靠后了。不管是k值取1还是3准确率都相同。难道取更大些好?因为KNN算法效率实在太渣,所以不打算继续在这个方法上再做尝试了。也许应该换些别的算法。看见有人把准确率弄到100%,简直丧心病狂。
头一次做kaggle还是挺有收获的,主要体验了一把python的多进程还有实际应用场景。这只是个尝试性的测验,以后还可以再做改进。