Just Another Control System Blog

Monday, April 19, 2010

Bias Vs Variance Dilemma

Bias versus Variance dilemma is almost universal in all kinds of data modelling methods. As per Wikipedia, variance of a random variable is the deviation squared of that variable from its expected value. In other words, it is a measure of variation within the values of the variable across all possible values along with their probabilities. The bias of an estimator is the difference between the estimator's expected value and the true value of the parameter being estimated. A very good article on this topic is available here. Without repeating much of the content, I would simply highlight the key points which would make things easier in understanding this topic.

Var(x) = E(x^2) - [E(x)]^2, where E(.) is the expectation value. This can be rewritten as

E(x^2) = Var(x) + [E(x)]^2
If we replace x by e (approximation error of an estimator), we can rewrite above equation as

E(e^2) = Var(e) + [E(e)]^2
MSE = Var(e) + Bias^2

Hence we can see that for a desired mean square error, there is trade-off between the variance and the bias. If one increases, the other decreases.
The complexity of an estimator model is expressed in terms of the number of parameters. The effect of complexity on the bias and the variance of the estimator is explained here. In brief, it can be said that

Low Complexity leads to low variance but large bias.
Highly complex model leads to low bias but large variance.

Large variance implies that the estimator is too sensitive to the data set. Hence the model has a low generalization capability. Excellent performance on design (or training) data, but poor performance on test data. This is the case of overfitting. Large variance is observed in case of complex models with large number of parameters (over-parameterization). Note that because of high complexity, the model fits a the design data set very well. Hence, the error at individual points (in the data set) is very low and hence the model has a low bias.
Large bias implies that the model is too simple and hence very few data points lie on the regression curve. Hence, the error at individual points (in a given data set) are high leading to large a MSE. But since very few points participate in the model formation, the performance does not differ on different data tests. Hence, it has a low variance. The performance remains same over design and test data sets.

Wednesday, March 17, 2010

Removing some rows from a matrix, Matlab

>> A = magic(4)
A =
    16     2     3    13
     5    11    10     8
     9     7     6    12
     4    14    15     1
>> b = [ 2     3];
>> A(b,:) = []
A =
    16     2     3    13
     4    14    15     1

More on manipulating 1-D array

Tuesday, March 16, 2010

Matlab Path

This is how I add a folder to the matlab path on command window:

if(~exist('varycolor'))
addpath(genpath('~/MatlabAddon'));
end

The if condition checks for the existence of the function 'varycolor'. If it does not exists, it loads the folder where this file is present.

Thursday, March 4, 2010

Deleting empty rows in a matrix

>> a = [1,2,3;4,5,6;0,7,8;0,0,0;0,0,0]

a =
   1 2 3
   4 5 6
   0 7 8
   0 0 0
   0 0 0

>> b = a(any(a,2),:)

b =

   1 2 3
   4 5 6
   0 7 8
>> b = a(all(a,2),:)

b =

   1 2 3
   4 5 6

Wednesday, January 27, 2010

K-Medoid Algorithm in Matlab

According to Wikipedia, k-medoids algorithm is a clustering algorithm related to the k-means algorithm. In contrast to k-means algorithm, k-medoids chooses data points as centres. I wrote a Matlab program for implementing this algorithm. I am posting it here in the hope that people may find it useful.

function [IDX, Cluster, Err] = kmedoid2(data, NC, maxIter, varargin)
% Performs clustering using Partition Around Medoids (PAM) algorithm
% Initial clusters must be selected from the available data points

% Usage
% [IDX,C,cost] = kmedoid(data, NC, maxIter, [init_cluster])
% 
% Input : 
%       data - input data matrix
%       NC - number of clusters
%       maxIter - Maximum number of iterations
%       init_cluster - Initial cluster centers (optional)
%       init_index - Index of initial clusters
%
% Output :
%       IDX - data point index which became cluster centres
%       C -  Cluster centers
%       Err - cost associated with the clusters for all iterations
% ------------------------------------------

% Size of data
dsize = size(data);

% Number of data points
L = dsize(1);

% No. of features
NF = dsize(2); 

% Cluster size
csize = [NC, NF];

if (L < NC)
    error('Too few data points for making k clusters');
end

if(nargin > 5)
    error('Usage : [IDX,C,error] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx])');
elseif(nargin == 5)
    vsize1 = size(varargin{1});
    vsize2 = length(varargin{2});
    
    if(isequal(vsize1,csize))
        Cluster = varargin{1};
    else
        error('Incorrect size for initial cluster');
    end
    
    if(vsize2 == NC)
        IDX = varargin{2};
    else
        error('Incorrect size for initial cluster index');
    end
elseif(nargin == 4)
   error('You must provide two optional arguments: init_cluster, init_idx');
elseif(nargin == 3) % no initial cluster provided
    IDX = randint(NC,1,L)+1;
    Cluster = data(IDX,:); % Initialize the initial clusters randomly
else
    display('Usage : [IDX,C] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx]');
    error('Function takes at least 3 arguments');
end

%Array for storing cost for each iteration
Err = zeros(maxIter,1);

Z = zeros(NC, NF, L);
for i = 1:L
    Z(:,:,i) = repmat(data(i,:),NC,1);
end


FIRST = 1;
total_cost = 0;


for Iteration = 1:maxIter
    %disp(Iteration);
    
    if(FIRST) 
        % First time, compute the cost associated with the initial cluster
        
        C = repmat(Cluster,[1,1,L]);
        B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance
        B1 = squeeze(B);
        [Bmin,Bidx] = min(B1,[],1);

        cost = zeros(1,NC);
        for k = 1:NC
            cost(k) = sum(Bmin(Bidx==k));
        end
        total_cost = sum(cost);
        
        FIRST = 0; % Reset the FIRST flag
        
    else % Not first time
        
        % change one cluster center randomly and see if the cost is
        % decreased. If yes, accept the change, otherwise reject the change

        while(1)
            Tidx = randint(1,1,L)+1; % find a random number betn 1 to L
            if(isempty(find(IDX==Tidx))) % Avoid previously selected centers
                break;
            end
        end
        % randomly change one cluster
        pos = randint(1,1,NC) + 1;
        OLD_IDX = IDX; % Preserve the old index list
        IDX(pos) = Tidx;

        %Assign cluster centers
        PCluster = Cluster;
        Cluster = data(IDX,:);

        % For each data point, find the cluster which is closest to it
        % and compute the cost obtained from this association

        C = repmat(Cluster,[1,1,L]);

        %B = sum(abs(Z-C),2); %Manhattan distance
        B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance

        % Row - cluster (NC)
        % Column - each data point 1, 2, ... L
        B1 = squeeze(B);

        % For each data point, find the nearest cluster
        % size(Bmin) = 1xL
        [Bmin,Bidx] = min(B1,[],1);

        cost = zeros(1,NC);
        for k = 1:NC
            cost(k) = sum(Bmin(Bidx==k));
        end

        previous_cost = total_cost;
        total_cost = sum(cost);


        if(total_cost > previous_cost) % cost increases
            % Bad choice, restore the previous index list
            IDX = OLD_IDX;
            total_cost = previous_cost;
            Cluster = PCluster;
        end
    end
    Err(Iteration) = total_cost;
end

if(nargout == 0)
    disp(IDX);
elseif(nargout > 3)
    error('Function gives only 3 outputs');
end   
    
return

Save this file as "kmedoid.m". Now we can create clusters for a given data set as follows :

[Idx, C,err] = kmedoid(data, 5, 1000);

Above command iterates for 1000 times to create 5 clusters for a given data.

Idx - the indices of data points selected as cluster centres.
C - cluster centres selected from the data sets
err - cost for each iteration. It can plotted to see if the error is decreasing.

Tuesday, January 26, 2010

Cycle through Markers in Matlab Plot

figure(2);

markers = 'ox+*sdv^<>ph';

colorset = varycolor(NC);

hold all;

for i = 1 : NC

    plot(C(:,1), C(:,2), 'LineStyle', 'None', 'Marker', markers(i), 'Markersize', 10);

end

Monday, January 18, 2010

Avoid loops in Matlab

Following examples demonstrate how one can avoid for loops in MATLAB to speed up execution :

Vector assignment

x = rand(1,100);
%In spite of :
for k=1:100
y(k) = sin(x(k));
end
% We can use :
y(:) = sin(x(:));
Subtracting a vector from each row/column of a matrix

a = [1 2 3]
b = magic(3);
c = repmat(a,3,1) - b;
Use logical indexing to operate on a part of a matrix. Use "find" command wherever possible.

a = 200:-10:50;
% replace the value 120 by 0
a(find(a == 120)) = 0;
a = 200 190 180 170 160 150 140 130 0 110 100 90 80 70 60 50

Another example :

% [idx1, idx2] = find(angle < 0);
% idx = [idx2 idx2];
% for j = 1:length(idx)
% angle(idx(j,1),idx(j,2))=360+angle(idx(j,1),idx(j,2));
% end
can be replaced by following code :

angle(angle<0) = angle(angle<0)+360;

Useful commands for matrices and vectors :

sum, norm, diff, abs, max, min, sort

Useful link in this direction

http://www.ee.columbia.edu/~marios/matlab/matlab_tricks.html

Just Another Control System Blog

Monday, April 19, 2010

Bias Vs Variance Dilemma

Wednesday, March 17, 2010

Removing some rows from a matrix, Matlab

Tuesday, March 16, 2010

Matlab Path

Thursday, March 4, 2010

Deleting empty rows in a matrix

Wednesday, January 27, 2010

K-Medoid Algorithm in Matlab

Tuesday, January 26, 2010

Cycle through Markers in Matlab Plot

Monday, January 18, 2010

Avoid loops in Matlab

Followers

Labels

Related Links

Blog Archive

Monday, April 19, 2010

Wednesday, March 17, 2010

Tuesday, March 16, 2010

Thursday, March 4, 2010

Wednesday, January 27, 2010

Tuesday, January 26, 2010

Monday, January 18, 2010

Followers

Labels

Subscribe To Swagat's blog

Related Links

Blog Archive