Sunday, July 31, 2016

Ternary Search Tree in C/C++

Task is to store an array of integers in a ternary search tree. Each element of the array can take one of the three values: 0, 1 or 2.

The C/C++ code is available on my GITLAB page.  The resulting tree could be plotted using Graphviz software using a dot file.  This code also contains code for generating a dot file. Once the dot file is obtained. Use the following command to generate the tree as shown below:

$ dot -Tps tst_data.dot -o tst_data.ps
$ gv tst_data.ps





Saturday, June 18, 2016

Data Generation for Tic-Tac-Toe Game : Example of Recursion in C++


Problem Statement: I need to generate data set for a 3x3 Tic-Tac-Toe Game. In this game, there are 9 Cells and each can take one of the values in the set {0, 1, 2}. Where 0 denotes and empty cell while 1 and 2 correspond to each of the two players.  So, we need to generate a 9x1 state vector where each element can have one of three possible states. Total number of states is 3^9 = 19683.


I have written a generic program that can generate such datasets irrespective of their dimension or possible state values:

We use a function to remove redundant vectors.


#include <iostream>
#include <fstream>
#include <cmath>

using namespace std;

#define N 3
#define M 3
#define MAX 20000
int Data[MAX][N];

std::ofstream f1("dataset2.txt");
//-------------------------------------------

bool find_match(int state[N], int rowcnt)
{
  for(int q = 0; q < rowcnt; q++)
  {
    int simcnt = 0;
    for(int r = 0; r < N; r++)
    {
      if(Data[q][r] == state[r])
        simcnt++;
    }

    if(simcnt == N)
      return true;
  }

  return false;
}

//------------------------------------

void add_row(int state[N], int rowcnt)
{

  for(int r = 0; r < N; r++)
    Data[rowcnt][r] = state[r];
}

//-----------------------------------------

void genstate(int state[], int k, int &rowcnt)//, char *filename)
{

  bool mFlag = false;
  for(int l = k; l < N; l++)
  {
    for(int j = 0; j < M; j++)
    {
      state[k] = j;

      mFlag = find_match(state, rowcnt);
      if(mFlag == false)
      {
        add_row(state, rowcnt);
        rowcnt++;
        for(int p = 0; p < N; p++)
          f1 << state[p] << "\t";
        f1 << endl;
      }
      genstate(state, k+1,rowcnt);

    }           
  }

 // f1.close();
}

//-------------------------

int main()
{
  int state[N];
  for(int i = 0; i < N; i++)
    state[i] = 0;

  int rowcnt = 0;

  genstate(state, 0, rowcnt);//, "dataset.txt");

  f1.close();

  return 0;
}


       
 

Wednesday, September 19, 2012

Systematic Resampling

The idea is simple, but I don't know why I took so much of time to understand it. I am still not sure if I got it right. Anyway, let me explain what I understood.

Please refer to this page to understand the basics of systematic sampling. The starting point is chosen at random and then the other points are selected at regular intervals. 

Let us assume that we have population X[N] with N elements. The population has an associated weight vector W[N]. We need to sample Y[n] of n elements out of X using systematic sampling. The algorithm that I could understand from Section 8.6 of [1] is as follows:
  1. Compute the sum of all weights $ \displaystyle S = \sum_i^N W_i $
  2. Compute the selection interval I = S/n ; 
  3. To initiate the random selection process, select a uniform random number R in the open interval (0, I].
  4. The n samples are given as : R, R+I, R+2I, ... , R+(n-1)I

void systematic_resampling1(gsl_rng *r, double *dest, size_t m, double *src, size_t n, double *w)
{
  if(m > n)
  {
    cerr << "choose m <= n" << endl;
    exit(-1);
  }

  float sum_w = 0.0;
  for(size_t i = 0; i < n; ++i)
    sum_w += w[i];

  // selection interval
  float I = sum_w / m ;  

  //generate a random number between 0 and I
  float u = gsl_rng_uniform_pos(r) * I;

  size_t j = 0;
  while(j < m)
  {            
    float C = 0.0;
    for(size_t i = 0; i < n; ++i)
    {
      C = C + w[i];      

      if(u <= C)
      {
        dest[j] = src[i];
        j = j + 1;
        u = u + I;
      }
    }
  }
}

The other Algorithm [2], that seems to achieve the same is as follows:

void systematic_resampling2(gsl_rng *r, double *dest, size_t m, double *src, size_t n, double *w)
{
  if(m > n)
  {
    exit(-1);
  }
  float sum_w = 0.0;
  for(size_t i = 0; i < n; ++i)
    sum_w += w[i];

  //generate a random number between 0 and 1
  float u = gsl_rng_uniform_pos(r);

  int j = 0;
  float C = 0.0;
  for(size_t i = 0; i < n; ++i)
  {
    C = C + w[i]/sum_w;      
    if(C > 1.0)
      C = 1.0;

    while((u+j)/m <= C)
    {
      dest[j] = src[i];
      j = j + 1;
    }
  }
}



Reference:

[1]  Kirk M. Wolter, "Introduction to Variance Estimation", Statistics for Social and Behavioral Sciences series, Springer, 2nd Edition, 2007.

[2] Murray Pollock, "Introduction to Particle Filtering", Algorithms & Computationally Intensive Inference Reading Group Discussion Notes, 2010. 

Monday, September 17, 2012

Weighted Random Sampling With Replacement in C


The problem is to select k items with-replacement from a list of n items with a probability that depends on the weights associated with the items in the original list.

Input:  x[n], w[n]
Output: x[n]   (x is overwritten)

void wrswr(gsl_rng *r, std::vector &x, const std::vector &w)
{
  if(x.size() != w.size())
  {
    cerr << "w and x must have same length ..." << endl;
    exit(-1);
  }
  else
  {
    size_t N = (size_t)x.size();
    std::vector tempx(N);

    double sum_w = 0.0;
    for(size_t i = 0; i < N; ++i)
    {
      sum_w += w.at(i);
      tempx.at(i) = x.at(i);  // copy of source xk
    }

    for(size_t i = 0; i < N; ++i)
    {
      double u = gsl_rng_uniform(r);

      size_t j = 0;
      double c = w[0] / sum_w;
      while (u > c && j < N)
      {
        j = j + 1;
        c = c + w[j] / sum_w;
      }
      x.at(i) = tempx.at(j); 
    }

  }
}         


Weighted Random Sampling


You need to compile it using GSL library. I implemented the idea available on this post.  For a normal random sampling with replacement algorithm without weights is available in GSL itself. Refer to this page for details on "Shuffling and Sampling" with GSL. 

Monday, April 19, 2010

Latex on Blogger

I have been on blogger for a long time now. I don't want to move over to Wordpress which is known to provide better support for mathematical equations. However, there are many like me and people have found solutions. There are many ways to do it. But  this solution worked for me.

Just a test :
\int_{0}^{1}\frac{x^{4}\left(1-x\right)^{4}}{1+x^{2}}dx
=\frac{22}{7}-\pi

Another solution is available here. It requires you to install a script into your browser (chrome or firefox). After that it shows two extra menu buttons on your blogger editor : Latex and UnLatex. A latex code put in between a pair of  'double dollars' converted into an image equation. You can undo the operation by clicking on UnLatex. Another test:

Bias Vs Variance Dilemma

Bias versus Variance dilemma is almost universal in all kinds of data modelling methods. As per Wikipedia, variance of a random variable is the deviation squared of that variable from its expected value. In other words, it is a measure of variation within the values of the variable across all possible values along with their probabilities. The bias of an estimator is the difference between the estimator's expected value and the true value of the parameter being estimated. A very good article on this topic is available here. Without repeating much of the content, I would simply highlight the key points which would make things easier in understanding this topic.

  1. Var(x) = E(x^2) - [E(x)]^2, where E(.) is the expectation value.  This can be rewritten as


    E(x^2) = Var(x) + [E(x)]^2
    If we replace x by e (approximation error of an estimator), we can rewrite above equation as

    E(e^2) = Var(e) + [E(e)]^2
    MSE = Var(e) + Bias^2

    Hence we can see that for a desired mean square error, there is trade-off between the variance and the bias. If one increases, the other decreases.

  2. The complexity of an estimator model is expressed in terms of the number of parameters. The effect of complexity on the bias and the variance of the estimator is explained here. In brief, it can be said that
    1. Low Complexity leads to low variance but large bias.
    2. Highly complex model leads to low bias but large variance.

  3. Large variance implies that the estimator is too sensitive to the data set. Hence the model has a low generalization capability. Excellent performance on design (or training) data, but poor performance on test data. This is the case of overfitting. Large variance is observed in case of complex models with large number of parameters (over-parameterization). Note that because of high complexity, the model fits a  the design data set very well. Hence, the error at individual points (in the data set) is very low and hence the model has a low bias.

  4. Large bias implies that the model is too simple and hence very few data points lie on the regression curve. Hence, the error at individual points (in a given data set) are high leading to large a MSE. But since very few points participate in the model formation, the performance does not differ on different data tests. Hence, it has a low variance. The performance remains same over design and test data sets.

Wednesday, March 17, 2010

Removing some rows from a matrix, Matlab

>> A = magic(4)
A =
    16     2     3    13
     5    11    10     8
     9     7     6    12
     4    14    15     1
>> b = [ 2     3];
>> A(b,:) = []
A =
    16     2     3    13
     4    14    15     1
More on manipulating 1-D array
 
pre { margin: 5px 20px; border: 1px dashed #666; padding: 5px; background: #f8f8f8; white-space: pre-wrap; /* css-3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* Internet Explorer 5.5+ */ }