Wednesday, January 27, 2010

K-Medoid Algorithm in Matlab

According to Wikipedia, k-medoids algorithm is a clustering algorithm related to the k-means algorithm. In contrast to k-means algorithm, k-medoids chooses data points as centres.  I wrote a Matlab program for implementing this algorithm. I am posting it here in the hope that people may find it useful.

function [IDX, Cluster, Err] = kmedoid2(data, NC, maxIter, varargin)
% Performs clustering using Partition Around Medoids (PAM) algorithm
% Initial clusters must be selected from the available data points

% Usage
% [IDX,C,cost] = kmedoid(data, NC, maxIter, [init_cluster])
% 
% Input : 
%       data - input data matrix
%       NC - number of clusters
%       maxIter - Maximum number of iterations
%       init_cluster - Initial cluster centers (optional)
%       init_index - Index of initial clusters
%
% Output :
%       IDX - data point index which became cluster centres
%       C -  Cluster centers
%       Err - cost associated with the clusters for all iterations
% ------------------------------------------

% Size of data
dsize = size(data);

% Number of data points
L = dsize(1);

% No. of features
NF = dsize(2); 

% Cluster size
csize = [NC, NF];

if (L < NC)
    error('Too few data points for making k clusters');
end

if(nargin > 5)
    error('Usage : [IDX,C,error] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx])');
elseif(nargin == 5)
    vsize1 = size(varargin{1});
    vsize2 = length(varargin{2});
    
    if(isequal(vsize1,csize))
        Cluster = varargin{1};
    else
        error('Incorrect size for initial cluster');
    end
    
    if(vsize2 == NC)
        IDX = varargin{2};
    else
        error('Incorrect size for initial cluster index');
    end
elseif(nargin == 4)
   error('You must provide two optional arguments: init_cluster, init_idx');
elseif(nargin == 3) % no initial cluster provided
    IDX = randint(NC,1,L)+1;
    Cluster = data(IDX,:); % Initialize the initial clusters randomly
else
    display('Usage : [IDX,C] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx]');
    error('Function takes at least 3 arguments');
end

%Array for storing cost for each iteration
Err = zeros(maxIter,1);

Z = zeros(NC, NF, L);
for i = 1:L
    Z(:,:,i) = repmat(data(i,:),NC,1);
end


FIRST = 1;
total_cost = 0;


for Iteration = 1:maxIter
    %disp(Iteration);
    
    if(FIRST) 
        % First time, compute the cost associated with the initial cluster
        
        C = repmat(Cluster,[1,1,L]);
        B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance
        B1 = squeeze(B);
        [Bmin,Bidx] = min(B1,[],1);

        cost = zeros(1,NC);
        for k = 1:NC
            cost(k) = sum(Bmin(Bidx==k));
        end
        total_cost = sum(cost);
        
        FIRST = 0; % Reset the FIRST flag
        
    else % Not first time
        
        % change one cluster center randomly and see if the cost is
        % decreased. If yes, accept the change, otherwise reject the change

        while(1)
            Tidx = randint(1,1,L)+1; % find a random number betn 1 to L
            if(isempty(find(IDX==Tidx))) % Avoid previously selected centers
                break;
            end
        end
        % randomly change one cluster
        pos = randint(1,1,NC) + 1;
        OLD_IDX = IDX; % Preserve the old index list
        IDX(pos) = Tidx;

        %Assign cluster centers
        PCluster = Cluster;
        Cluster = data(IDX,:);

        % For each data point, find the cluster which is closest to it
        % and compute the cost obtained from this association

        C = repmat(Cluster,[1,1,L]);

        %B = sum(abs(Z-C),2); %Manhattan distance
        B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance

        % Row - cluster (NC)
        % Column - each data point 1, 2, ... L
        B1 = squeeze(B);

        % For each data point, find the nearest cluster
        % size(Bmin) = 1xL
        [Bmin,Bidx] = min(B1,[],1);

        cost = zeros(1,NC);
        for k = 1:NC
            cost(k) = sum(Bmin(Bidx==k));
        end

        previous_cost = total_cost;
        total_cost = sum(cost);


        if(total_cost > previous_cost) % cost increases
            % Bad choice, restore the previous index list
            IDX = OLD_IDX;
            total_cost = previous_cost;
            Cluster = PCluster;
        end
    end
    Err(Iteration) = total_cost;
end

if(nargout == 0)
    disp(IDX);
elseif(nargout > 3)
    error('Function gives only 3 outputs');
end   
    
return  

Save this file as "kmedoid.m". Now we can create clusters for a given data set as follows :

[Idx, C,err] = kmedoid(data, 5, 1000);

Above command iterates for 1000 times to create 5 clusters for a given data.

Idx - the indices of data points selected as cluster centres.
C - cluster centres selected from the data sets
err - cost for each iteration. It can plotted to see if the error is decreasing.


6 comments:

Unknown said...

Hello! How can I extract a vector containing the number of the corresponding cluster for each data point? (size=[n data point,1])
Thank you!

Unknown said...

Thank you very much for sharing the code, but you should note that your implementation is Clarans (Ng & Han, 1994) and not Pam (Kaufman & Rousseeuw, 1990). While both are k-medoid algorithms, the latter uses an exhaustive search and different initialisation. As can be clearly seen in your code you restrict yourself to removing a medoid and evaluate the total cost as proposed by Ng & Han.

rajendra kumar said...

Thank you for it...
please tell me what will be the structure (size of )of the data matrix in the matlab function.... i'm not getting,please elavorate it with example..

thanks
Rajendra Kumar

pvgajjar said...

Dear sir,
i have tried to run this kmedoid code in matlab 7.13.0.564, R2011b.
i have save file as kmedoid2.m as the function name is kmedoid2.
but it gives error:
function [IDX, Cluster, Err] = kmedoid2(data, NC, maxIter, varargin)
|
Error: Function definitions are not permitted in this context.

plz give me solution so that i can run this prog.

Unknown said...

hi,
I am trying to cluster the digital circuit using clustering algorithm,so whether ur k medoid algorithm can be used my work??

Unknown said...

Thank you ,It is perfect.

 
pre { margin: 5px 20px; border: 1px dashed #666; padding: 5px; background: #f8f8f8; white-space: pre-wrap; /* css-3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* Internet Explorer 5.5+ */ }