Wednesday, January 27, 2010

K-Medoid Algorithm in Matlab

According to Wikipedia, k-medoids algorithm is a clustering algorithm related to the k-means algorithm. In contrast to k-means algorithm, k-medoids chooses data points as centres.  I wrote a Matlab program for implementing this algorithm. I am posting it here in the hope that people may find it useful.

function [IDX, Cluster, Err] = kmedoid2(data, NC, maxIter, varargin)
% Performs clustering using Partition Around Medoids (PAM) algorithm
% Initial clusters must be selected from the available data points

% Usage
% [IDX,C,cost] = kmedoid(data, NC, maxIter, [init_cluster])
% 
% Input : 
%       data - input data matrix
%       NC - number of clusters
%       maxIter - Maximum number of iterations
%       init_cluster - Initial cluster centers (optional)
%       init_index - Index of initial clusters
%
% Output :
%       IDX - data point index which became cluster centres
%       C -  Cluster centers
%       Err - cost associated with the clusters for all iterations
% ------------------------------------------

% Size of data
dsize = size(data);

% Number of data points
L = dsize(1);

% No. of features
NF = dsize(2); 

% Cluster size
csize = [NC, NF];

if (L < NC)
    error('Too few data points for making k clusters');
end

if(nargin > 5)
    error('Usage : [IDX,C,error] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx])');
elseif(nargin == 5)
    vsize1 = size(varargin{1});
    vsize2 = length(varargin{2});
    
    if(isequal(vsize1,csize))
        Cluster = varargin{1};
    else
        error('Incorrect size for initial cluster');
    end
    
    if(vsize2 == NC)
        IDX = varargin{2};
    else
        error('Incorrect size for initial cluster index');
    end
elseif(nargin == 4)
   error('You must provide two optional arguments: init_cluster, init_idx');
elseif(nargin == 3) % no initial cluster provided
    IDX = randint(NC,1,L)+1;
    Cluster = data(IDX,:); % Initialize the initial clusters randomly
else
    display('Usage : [IDX,C] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx]');
    error('Function takes at least 3 arguments');
end

%Array for storing cost for each iteration
Err = zeros(maxIter,1);

Z = zeros(NC, NF, L);
for i = 1:L
    Z(:,:,i) = repmat(data(i,:),NC,1);
end


FIRST = 1;
total_cost = 0;


for Iteration = 1:maxIter
    %disp(Iteration);
    
    if(FIRST) 
        % First time, compute the cost associated with the initial cluster
        
        C = repmat(Cluster,[1,1,L]);
        B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance
        B1 = squeeze(B);
        [Bmin,Bidx] = min(B1,[],1);

        cost = zeros(1,NC);
        for k = 1:NC
            cost(k) = sum(Bmin(Bidx==k));
        end
        total_cost = sum(cost);
        
        FIRST = 0; % Reset the FIRST flag
        
    else % Not first time
        
        % change one cluster center randomly and see if the cost is
        % decreased. If yes, accept the change, otherwise reject the change

        while(1)
            Tidx = randint(1,1,L)+1; % find a random number betn 1 to L
            if(isempty(find(IDX==Tidx))) % Avoid previously selected centers
                break;
            end
        end
        % randomly change one cluster
        pos = randint(1,1,NC) + 1;
        OLD_IDX = IDX; % Preserve the old index list
        IDX(pos) = Tidx;

        %Assign cluster centers
        PCluster = Cluster;
        Cluster = data(IDX,:);

        % For each data point, find the cluster which is closest to it
        % and compute the cost obtained from this association

        C = repmat(Cluster,[1,1,L]);

        %B = sum(abs(Z-C),2); %Manhattan distance
        B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance

        % Row - cluster (NC)
        % Column - each data point 1, 2, ... L
        B1 = squeeze(B);

        % For each data point, find the nearest cluster
        % size(Bmin) = 1xL
        [Bmin,Bidx] = min(B1,[],1);

        cost = zeros(1,NC);
        for k = 1:NC
            cost(k) = sum(Bmin(Bidx==k));
        end

        previous_cost = total_cost;
        total_cost = sum(cost);


        if(total_cost > previous_cost) % cost increases
            % Bad choice, restore the previous index list
            IDX = OLD_IDX;
            total_cost = previous_cost;
            Cluster = PCluster;
        end
    end
    Err(Iteration) = total_cost;
end

if(nargout == 0)
    disp(IDX);
elseif(nargout > 3)
    error('Function gives only 3 outputs');
end   
    
return  

Save this file as "kmedoid.m". Now we can create clusters for a given data set as follows :

[Idx, C,err] = kmedoid(data, 5, 1000);

Above command iterates for 1000 times to create 5 clusters for a given data.

Idx - the indices of data points selected as cluster centres.
C - cluster centres selected from the data sets
err - cost for each iteration. It can plotted to see if the error is decreasing.


Tuesday, January 26, 2010

Cycle through Markers in Matlab Plot

figure(2);
markers = 'ox+*sdv^<>ph';
colorset = varycolor(NC);

hold all;
for i = 1 : NC
    plot(C(:,1), C(:,2), 'LineStyle', 'None', 'Marker', markers(i), 'Markersize', 10);
end

Monday, January 18, 2010

Avoid loops in Matlab

Following examples demonstrate how one can avoid for loops in MATLAB to speed up execution :

  1. Vector assignment

    x = rand(1,100);
    %In spite of :
    for k=1:100
    y(k) = sin(x(k));
    end
    % We can use :
    y(:) = sin(x(:));


  2. Subtracting a vector from each row/column of a matrix

    a = [1 2 3]
    b = magic(3);
    c = repmat(a,3,1) - b;

  3. Use logical indexing to operate on a part of a matrix. Use "find" command wherever possible.

    a = 200:-10:50;
    % replace the value 120 by 0
    a(find(a == 120)) = 0;
    a =   200   190   180   170   160   150   140   130     0   110   100    90    80    70    60    50

    Another example :

    % [idx1, idx2] = find(angle < 0);
    % idx = [idx2 idx2];
    % for j = 1:length(idx)
    %     angle(idx(j,1),idx(j,2))=360+angle(idx(j,1),idx(j,2));
    % end
    can be replaced by following code :

    angle(angle<0) = angle(angle<0)+360;


  4. Useful commands for matrices and vectors :

    sum, norm, diff, abs, max, min, sort


     
    Useful link in this direction
    http://www.ee.columbia.edu/~marios/matlab/matlab_tricks.html  
     
     
     
     
     
     

 
pre { margin: 5px 20px; border: 1px dashed #666; padding: 5px; background: #f8f8f8; white-space: pre-wrap; /* css-3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* Internet Explorer 5.5+ */ }