Last week, we analyzed the mobile touch data using Tableau. This week, we will do a simple exercise using machine learnign to analyze the data. The classification task we will consider is “whether a touch event is left or right” using sensor measurements as features.
The dataset you will use is:
raw_touch_data.csv |
Checkpoints
Checkpoint 1
We have developed a simple Matlab script that will try a variety of classifiers and include a variety of sensor reading features.
clear all; close all; | |
%%%%%%%%%%%%%% Change path %%%%%%%%%%%%%%%% | |
filename_path = '/change/path/to/file/raw_touch_data.csv'; | |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% | |
%%%%%%%%% Choose feature source %%%%%%%%%%% | |
add_accelerometer = false; % Adds accelerometer features | |
add_gyroscope = false; % Adds gyroscope features | |
add_magneticField = true; % Adds magnetic field features | |
add_gravity = false; % Adds gravity features | |
add_linearAcceleration = false; % Adds linear acceleration features | |
add_orientation = false; % Adds azimuth, pich and roll features | |
add_light = true; % Adds light value | |
add_proximity = false; % Adds proximity value | |
add_studentID = false; % Adds student ID | |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% | |
%%%%%%%%% Choose machine learning classifier parameters %%%%%%%%%%%%%%% | |
numTrees = 1; % Try different number of trees for the Random Forest classifier | |
sigma = 1; % Try different values of sigma for the Support Vector Machine classifier | |
dist = 'normal'; % Try different distributions = {'normal', 'kernel', 'mvmn' , 'mn'} for Naive Bayes classifier | |
K = 20; % Try different values of K for the K-nearest Neighbor classifier | |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% | |
% Main code starts here (you don't need to change anything here) | |
data = csvread(filename_path,1,0); | |
numTouch = data(end,28); | |
feat = []; feat_acc = []; feat_gyr = []; feat_mg = []; feat_grav = []; feat_lacc = []; feat_or = []; feat_light = []; feat_prox = []; feat_sID = []; | |
for i = 1: numTouch | |
temp = find(data(:,28) == i); | |
if add_accelerometer == true | |
feat_acc = compute3axialFeat(data(temp, [2 3 4])); | |
end | |
if add_gyroscope == true | |
feat_gyr = compute3axialFeat(data(temp, [5 6 7])); | |
end | |
if add_magneticField == true | |
feat_mf = compute3axialFeat(data(temp, [8 9 10])); | |
end | |
if add_gravity == true | |
feat_grav = compute3axialFeat(data(temp, [11 12 13])); | |
end | |
if add_linearAcceleration == true | |
feat_lacc = compute3axialFeat(data(temp, [14 15 16])); | |
end | |
if add_orientation == true | |
feat_or = compute3axialFeat(data(temp, [21 22 23])); | |
end | |
if add_light == true | |
feat_light = mean(data(temp,24)); | |
end | |
if add_proximity == true | |
feat_prox = mean(data(temp,25)); | |
end | |
if add_studentID == true | |
feat_sID = data(temp(1),26); | |
end | |
feat = [feat ; feat_acc feat_gyr feat_mg feat_grav feat_lacc feat_or feat_light feat_prox feat_sID]; | |
label(i) = data(temp(1),29); | |
end | |
% Divide data into training and test sets | |
temp = 1: numTouch; | |
temp=temp(randperm(length(temp))); % Shaffle data points | |
train = temp(1:round(0.50*length(temp))); % train samples (50% of data samples) | |
test = temp(round(0.50*length(temp))+1:end); % test samples (remaining 50% of data samples) | |
% Classify using K-nearest neighbor | |
prediction = knnclassify(feat(test,:), feat(train, :), label(train), K); | |
accuracyKNN = numel(find(prediction == label(test)'))/length(test)*100 | |
% Classify using Naive Bayes | |
NB = NaiveBayes.fit(feat(train,:),label(train),'Distribution', dist); | |
prediction = NB.predict(feat(test,:)); | |
accuracyNB = numel(find(prediction == label(test)'))/length(test)*100 | |
% Classify using SVM | |
SVMstruct = svmtrain(feat(train,:), label(train), 'kernel_function','rbf','rbf_sigma',sigma); | |
prediction = svmclassify(SVMstruct, feat(test,:)); | |
accuracySVM = numel(find(prediction == label(test)'))/length(test)*100 | |
% Classify using Random Forest | |
b = TreeBagger(numTrees,feat(train,:),label(train)'); | |
prediction_cell = predict(b,feat(test,:)); | |
for i = 1: length(prediction_cell) | |
prediction(i) = str2num(prediction_cell{i}); | |
end | |
accuracyRF = numel(find(prediction == label(test)'))/length(test)*100 |
The script above depends on a utility function, below.
function feat = compute3axialFeat(data) | |
X = data(:,1); % X axis | |
Y = data(:,2); % Y axis | |
Z = data(:,3); % Z axis | |
XYZ = sqrt((X.^2+Y.^2+Z.^2)); | |
% Compute mean, standard deviation, max and min features | |
feat = [mean(X) mean(Y) mean(Z) mean(XYZ) std(X) std(Y) std(Z) std(XYZ) max(X) max(Y) max(Z) max(XYZ) min(X) min(Y) min(Z) min(XYZ) ]; | |
end |
Download these two Matlab files and store them in the same folder. Also, copy the dataset file (i.e., raw_touch_data.csv) in the same folder.
Get this Matlab script to run on a computer. The expected output in the Matlab GUI Command Line window shoudl be something like below. Take a screenshot and submit.
The four classifiers are: (1) K nearest-neighbor (KNN), (2) Naive Bayes (NB), (3) Support Vector Machine (SVM), and (4) Random Forest (RF). As you can see, the performance with the default parameters are above chance (50%). But we can definitely improve.
Checkpoint 2
Let’s try change some parameters and see what happen to the accuracy performance. Change the parameter for the K-nearest neighbor classifier (i.e., K) from 20 to 10. Add the “light” measurement as a feature by setting the parameter add_light to true. Run the script again. Has accuracy improved? Take a screenshot of the new performance numbers.
Challenges
Test different combinations of features and training parameters. For each of the four classification algoithms, see if you can find a combination of features and parameter values to achieve really good accuacy performance.
1. K-NN
Report the highest accuracy number you’ve managed to achieve. Report the features and parameters you used.
2. Naive Bayes
Report the highest accuracy number you’ve managed to achieve. Report the features and parameters you used.
3. Support Vector Machine
Report the highest accuracy number you’ve managed to achieve. Report the features and parameters you used.
4. Random Forest
Report the highest accuracy number you’ve managed to achieve. Report the features and parameters you used.