- Var(x) = E(x^2) - [E(x)]^2, where E(.) is the expectation value. This can be rewritten as
E(x^2) = Var(x) + [E(x)]^2
If we replace x by e (approximation error of an estimator), we can rewrite above equation as
E(e^2) = Var(e) + [E(e)]^2
MSE = Var(e) + Bias^2
Hence we can see that for a desired mean square error, there is trade-off between the variance and the bias. If one increases, the other decreases.
- The complexity of an estimator model is expressed in terms of the number of parameters. The effect of complexity on the bias and the variance of the estimator is explained here. In brief, it can be said that
- Low Complexity leads to low variance but large bias.
- Highly complex model leads to low bias but large variance.
- Large variance implies that the estimator is too sensitive to the data set. Hence the model has a low generalization capability. Excellent performance on design (or training) data, but poor performance on test data. This is the case of overfitting. Large variance is observed in case of complex models with large number of parameters (over-parameterization). Note that because of high complexity, the model fits a the design data set very well. Hence, the error at individual points (in the data set) is very low and hence the model has a low bias.
- Large bias implies that the model is too simple and hence very few data points lie on the regression curve. Hence, the error at individual points (in a given data set) are high leading to large a MSE. But since very few points participate in the model formation, the performance does not differ on different data tests. Hence, it has a low variance. The performance remains same over design and test data sets.
Monday, April 19, 2010
Bias Vs Variance Dilemma
Bias versus Variance dilemma is almost universal in all kinds of data modelling methods. As per Wikipedia, variance of a random variable is the deviation squared of that variable from its expected value. In other words, it is a measure of variation within the values of the variable across all possible values along with their probabilities. The bias of an estimator is the difference between the estimator's expected value and the true value of the parameter being estimated. A very good article on this topic is available here. Without repeating much of the content, I would simply highlight the key points which would make things easier in understanding this topic.
Wednesday, March 17, 2010
Removing some rows from a matrix, Matlab
>> A = magic(4) A = 16 2 3 13 5 11 10 8 9 7 6 12 4 14 15 1 >> b = [ 2 3]; >> A(b,:) = [] A = 16 2 3 13 4 14 15 1
More on manipulating 1-D array
Tuesday, March 16, 2010
Matlab Path
This is how I add a folder to the matlab path on command window:
if(~exist('varycolor'))
addpath(genpath('~/MatlabAddon'));
end
The if condition checks for the existence of the function 'varycolor'. If it does not exists, it loads the folder where this file is present.
if(~exist('varycolor'))
addpath(genpath('~/MatlabAddon'));
end
The if condition checks for the existence of the function 'varycolor'. If it does not exists, it loads the folder where this file is present.
Thursday, March 4, 2010
Deleting empty rows in a matrix
>> a = [1,2,3;4,5,6;0,7,8;0,0,0;0,0,0]
a =
1 2 3
4 5 6
0 7 8
0 0 0
0 0 0
>> b = a(any(a,2),:)
b =
1 2 3
4 5 6
0 7 8
>> b = a(all(a,2),:)
b =
1 2 3
4 5 6
a =
1 2 3
4 5 6
0 7 8
0 0 0
0 0 0
>> b = a(any(a,2),:)
b =
1 2 3
4 5 6
0 7 8
>> b = a(all(a,2),:)
b =
1 2 3
4 5 6
Wednesday, January 27, 2010
K-Medoid Algorithm in Matlab
According to Wikipedia, k-medoids algorithm is a clustering algorithm related to the k-means algorithm. In contrast to k-means algorithm, k-medoids chooses data points as centres. I wrote a Matlab program for implementing this algorithm. I am posting it here in the hope that people may find it useful.
Save this file as "kmedoid.m". Now we can create clusters for a given data set as follows :
Above command iterates for 1000 times to create 5 clusters for a given data.
Idx - the indices of data points selected as cluster centres.
C - cluster centres selected from the data sets
err - cost for each iteration. It can plotted to see if the error is decreasing.
function [IDX, Cluster, Err] = kmedoid2(data, NC, maxIter, varargin) % Performs clustering using Partition Around Medoids (PAM) algorithm % Initial clusters must be selected from the available data points % Usage % [IDX,C,cost] = kmedoid(data, NC, maxIter, [init_cluster]) % % Input : % data - input data matrix % NC - number of clusters % maxIter - Maximum number of iterations % init_cluster - Initial cluster centers (optional) % init_index - Index of initial clusters % % Output : % IDX - data point index which became cluster centres % C - Cluster centers % Err - cost associated with the clusters for all iterations % ------------------------------------------ % Size of data dsize = size(data); % Number of data points L = dsize(1); % No. of features NF = dsize(2); % Cluster size csize = [NC, NF]; if (L < NC) error('Too few data points for making k clusters'); end if(nargin > 5) error('Usage : [IDX,C,error] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx])'); elseif(nargin == 5) vsize1 = size(varargin{1}); vsize2 = length(varargin{2}); if(isequal(vsize1,csize)) Cluster = varargin{1}; else error('Incorrect size for initial cluster'); end if(vsize2 == NC) IDX = varargin{2}; else error('Incorrect size for initial cluster index'); end elseif(nargin == 4) error('You must provide two optional arguments: init_cluster, init_idx'); elseif(nargin == 3) % no initial cluster provided IDX = randint(NC,1,L)+1; Cluster = data(IDX,:); % Initialize the initial clusters randomly else display('Usage : [IDX,C] = kmedoid(data, NC, maxIter, [init_cluster], [init_idx]'); error('Function takes at least 3 arguments'); end %Array for storing cost for each iteration Err = zeros(maxIter,1); Z = zeros(NC, NF, L); for i = 1:L Z(:,:,i) = repmat(data(i,:),NC,1); end FIRST = 1; total_cost = 0; for Iteration = 1:maxIter %disp(Iteration); if(FIRST) % First time, compute the cost associated with the initial cluster C = repmat(Cluster,[1,1,L]); B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance B1 = squeeze(B); [Bmin,Bidx] = min(B1,[],1); cost = zeros(1,NC); for k = 1:NC cost(k) = sum(Bmin(Bidx==k)); end total_cost = sum(cost); FIRST = 0; % Reset the FIRST flag else % Not first time % change one cluster center randomly and see if the cost is % decreased. If yes, accept the change, otherwise reject the change while(1) Tidx = randint(1,1,L)+1; % find a random number betn 1 to L if(isempty(find(IDX==Tidx))) % Avoid previously selected centers break; end end % randomly change one cluster pos = randint(1,1,NC) + 1; OLD_IDX = IDX; % Preserve the old index list IDX(pos) = Tidx; %Assign cluster centers PCluster = Cluster; Cluster = data(IDX,:); % For each data point, find the cluster which is closest to it % and compute the cost obtained from this association C = repmat(Cluster,[1,1,L]); %B = sum(abs(Z-C),2); %Manhattan distance B = sqrt(sum((abs(Z-C).^2),2)); %Euclidean distance % Row - cluster (NC) % Column - each data point 1, 2, ... L B1 = squeeze(B); % For each data point, find the nearest cluster % size(Bmin) = 1xL [Bmin,Bidx] = min(B1,[],1); cost = zeros(1,NC); for k = 1:NC cost(k) = sum(Bmin(Bidx==k)); end previous_cost = total_cost; total_cost = sum(cost); if(total_cost > previous_cost) % cost increases % Bad choice, restore the previous index list IDX = OLD_IDX; total_cost = previous_cost; Cluster = PCluster; end end Err(Iteration) = total_cost; end if(nargout == 0) disp(IDX); elseif(nargout > 3) error('Function gives only 3 outputs'); end return
Save this file as "kmedoid.m". Now we can create clusters for a given data set as follows :
[Idx, C,err] = kmedoid(data, 5, 1000);
Above command iterates for 1000 times to create 5 clusters for a given data.
Idx - the indices of data points selected as cluster centres.
C - cluster centres selected from the data sets
err - cost for each iteration. It can plotted to see if the error is decreasing.
Tuesday, January 26, 2010
Cycle through Markers in Matlab Plot
figure(2);
markers = 'ox+*sdv^<>ph';
colorset = varycolor(NC);
hold all;
for i = 1 : NC
plot(C(:,1), C(:,2), 'LineStyle', 'None', 'Marker', markers(i), 'Markersize', 10);
end
markers = 'ox+*sdv^<>ph';
colorset = varycolor(NC);
hold all;
for i = 1 : NC
plot(C(:,1), C(:,2), 'LineStyle', 'None', 'Marker', markers(i), 'Markersize', 10);
end
Monday, January 18, 2010
Avoid loops in Matlab
Following examples demonstrate how one can avoid for loops in MATLAB to speed up execution :
- Vector assignment
x = rand(1,100);
%In spite of :
for k=1:100
y(k) = sin(x(k));
end
% We can use :
y(:) = sin(x(:));
- Subtracting a vector from each row/column of a matrix
a = [1 2 3]
b = magic(3);
c = repmat(a,3,1) - b;
- Use logical indexing to operate on a part of a matrix. Use "find" command wherever possible.
a = 200:-10:50;
% replace the value 120 by 0
a(find(a == 120)) = 0;
a = 200 190 180 170 160 150 140 130 0 110 100 90 80 70 60 50
Another example :
% [idx1, idx2] = find(angle < 0);
% idx = [idx2 idx2];
% for j = 1:length(idx)
% angle(idx(j,1),idx(j,2))=360+angle(idx(j,1),idx(j,2));
% end
can be replaced by following code :
angle(angle<0) = angle(angle<0)+360;
- Useful commands for matrices and vectors :
sum, norm, diff, abs, max, min, sort
Useful link in this direction
http://www.ee.columbia.edu/~marios/matlab/matlab_tricks.html
Subscribe to:
Posts (Atom)