The MNIST is a set of handwritten digits images. You can download it by using Makefile
:
%.gz:
wget http://yann.lecun.com/exdb/mnist/$*.gz
%.idx: %.gz
gzip -d $*.gz
mv $* $*.idx
prepare: t10k-images-idx3-ubyte.idx t10k-labels-idx1-ubyte.idx train-images-idx3-ubyte.idx train-labels-idx1-ubyte.idx
The size of each image is 28×28 pixels.
The IDX File Format
The MNIST dataset are stored in IDX file format. The basic format is:
magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data
The first 4 bytes of a IDX file is magic number
:
typedef struct magic_number_s {
char byte1; // always 0
char byte2; // always 0
char type;
char dimentions;
} magic_number_t;
The first 2 bytes of magic number
are always 0. The third byte codes the type of the data:
0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)
For MNIST dataset, the type is unsigned byte
.
The 4-th byte codes the number of dimensions of the vector/matrix. For image, the number of dimension is 3; for label, the number of dimension is 1.
The handwrite digits
is formated as:
typedef struct images_s{
magic_number_t magic_number;
unsigned int number_of_images;
unsigned int number_of_rows;
unsigned int number_of_columns;
unsigned char images[1]; // 28x28x[number of images]
} images_t;
The size of a single image in MNIST dataset is 28*28
. Get the image date of a given index
by:
#define get_img(head, index) (head + 28*28*index)
The label, represent the results(digits) of handwrite digits
, is formated as:
typedef struct labels_s{
magic_number_t magic_number;
unsigned int number_of_items;
unsigned char labels[1]; // [number of labels]
} labels_t;
Bigendian
The MNIST dataset is stored in bigendian. If you want to get the right number of items
, which is unsigned int
, you may need to convert it to littleendian.
// The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
int is_bigendian() {
int i = 1;
char *p = (char *)&i;
if (p[0] == 1)
return 0;
else
return 1;
}
unsigned int bit32conversion(unsigned int num) {
return ((num>>24)&0xff) | ((num<<8)&0xff0000) | ((num>>8)&0xff00) | ((num<<24)&0xff000000);
}
if(!is_bigendian()) {
number_of_items = bit32conversion(*number_of_items);
}
The Distance of Images
There are multiple ways to caculate the distance of two images, one simple way is to using Euclidean distance.
// caculate the euclidean distance
double distance(unsigned char img1[], unsigned char img2[]) {
int i,j;
double sum = 0, value;
for(i = 0, j= 28*28; i < j; i++) {
value = img1[i] - img2[i];
sum = sum + value*value;
}
return sum;
}
Multi-thread Design
In this design, finding the k-Nearest Neighbors
in the train set for all the images in the test set is a time consuming process. We can design multi-thread program, in which each thread compute only a subset of the test set. The working structure for the thread can be designed as:
// the k-nearest neighbors is here.
typedef struct knn_s{
int index;
double distance;
} knn_t;
typedef struct cfg_s{
knn_t * knn;
unsigned int number_of_neighbor;
unsigned int k;
unsigned int train_start, train_stop, test_start, test_stop;
unsigned int total;
unsigned int * hit;
unsigned int thread;
} cfg_t;
In the multi-thread design, the program can utilize all the cores of the CUPs.
Performance Evaluate
The performance we evaluate here is the hit ratio
, which indicate the percentage the program success in telling the right digit from the image.
k | hit rate (%) |
---|---|
1 | 97.91 |
2 | 96.27 |
3 | 97.05 |
4 | 96.82 |
5 | 96.88 |
6 | 96.77 |
7 | 96.94 |
8 | 96.70 |
9 | 96.59 |
10 | 96.65 |
The best performance is get when k = 3
.
The complete source code can be got from: https://github.com/janzhou/cap5610