MNIST手写字符集的数据解析

1.前言

最近在做MNIST手写字识别,官方MNIST数据集为 .idx3-ubyte 格式,程序无法直接读取,涉及MNIST数据集的解析。

MNIST数据集:http://yann.lecun.com/exdb/mnist/

2.数据集说明

官网数据集说明:

train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

一个四个文件,训练图片集、训练特征集、测试图片集、测试特征集。

官网对图片集和特征集的说明如下:

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
……

The labels values are 0 to 9.

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
……

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

3.二进制文件解析

Python中解析二进制文件需要用到Struct模块,Struct的基本函数说明参考:https://wnxy.xyz/2020/10/17/Daily-usage-of-Struct-module-in-Python/

相关代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import matplotlib.pyplot as plt
import struct

file1="./MNIST_data/train-images.idx3-ubyte"
file2="./MNIST_data/train-labels.idx1-ubyte"

def train_images_ana(filepath):
"""解析图片数据集 .idx3-ubyte格式"""
# 以二进制方式读取文件
with open(filepath,'rb') as fbj:
bin_data=fbj.read()
offset=0
magic_num,image_num,rows_num,column_num=struct.unpack_from('>iiii',bin_data,offset)
offset+=struct.calcsize('>iiii')
imgsize=image_num*rows_num*column_num
fmt_image='>'+str(imgsize)+'B' # 训练集数据有60000*28*28
images=struct.unpack_from(fmt_image,bin_data,offset)
img=np.reshape(images,(image_num,rows_num*column_num)) # 构造一个60000*784的矩阵
return img

def train_labels_ana(filepath):
"""解析特征数据集 .idx1-ubyte格式"""
# 以二进制格式处理文件
with open(filepath,'rb') as fbj:
bin_data=fbj.read()
offset=0
magic_num,items_num=struct.unpack_from('>ii',bin_data,offset)
offset+=struct.calcsize('>ii')
fmt_label='>'+str(items_num)+'B'
labels=struct.unpack_from(fmt_label,bin_data,offset)
label=np.reshape(labels,[items_num])
return label

if __name__=='__main__':
imgs=train_images_ana(file1)
# 查看前10个手写字灰度图
for i in range(10):
img=np.reshape(imgs[i],[28,28])
plt.imshow(img,cmap='gray')
plt.show()
labels=train_labels_ana(file2)
#print(labels)

相关文章:

参考文章:

Author: wnxy
Link: https://wnxy.xyz/2020/10/18/Data-analysis-of-MNIST-handwritten-character-set/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.