1.前言
最近在做MNIST手写字识别,官方MNIST数据集为 .idx3-ubyte 格式,程序无法直接读取,涉及MNIST数据集的解析。
MNIST数据集:http://yann.lecun.com/exdb/mnist/
2.数据集说明
官网数据集说明:
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
一个四个文件,训练图片集、训练特征集、测试图片集、测试特征集。
官网对图片集和特征集的说明如下:
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] | [type] | [value] | [description] |
---|---|---|---|
0000 | 32 bit integer | 0x00000801(2049) | magic number (MSB first) |
0004 | 32 bit integer | 60000 | number of items |
0008 | unsigned byte | ?? | label |
0009 | unsigned byte | ?? | label |
…… |
The labels values are 0 to 9.
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] | [type] | [value] | [description] |
---|---|---|---|
0000 | 32 bit integer | 0x00000803(2051) | magic number |
0004 | 32 bit integer | 60000 | number of images |
0008 | 32 bit integer | 28 | number of rows |
0012 | 32 bit integer | 28 | number of columns |
0016 | unsigned byte | ?? | pixel |
0017 | unsigned byte | ?? | pixel |
…… |
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
3.二进制文件解析
Python中解析二进制文件需要用到Struct模块,Struct的基本函数说明参考:https://wnxy.xyz/2020/10/17/Daily-usage-of-Struct-module-in-Python/
相关代码如下:
1 | import numpy as np |
相关文章:
参考文章: