Homework #2 – Maximum likelihood parameter estimation - Solutions

CAP 5638, Pattern Recognition, Fall, 2005

Department of Computer Science, Florida State University

¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾¾

Points: 100

Due: Thursday, October 6, 2005

 

Problem 1 (15 points) Problem 1 (Chapter 3 in the textbook).

 

(a)  See the following plots.

(c) Note this problem itself is not clear and confusing. For a large n, as we assume that the samples are generated according to the distribution, we have . It should be marked on the plot p(x|q) with x=2. Note it does not correspond to the maximum as expected (this is only for one particular one)

 

 

Problem 2 (15 points) Problem 2 (Chapter 3 in the textbook).

 

 

Problem 3 (15 points) Problem 3 (Chapter 3 in the textbook).

 

 

Problem 4 (25 points) Problem 7 (Chapter 3 in the textbook).

(e) This example shows the parametric form is very important for maximum-likelihood estimation. If the assumed form is far from the true underlying model, the ML estimate can give larger error than other models in the same assumed family. In order to get good results using ML estimate, one needs to find the most accurate model for the unknown underlying model based on prior knowledge, experience, or experiments on some data. If several models are available, they should be evaluated and compared using some test data.

 

 

Problem 5 As we did in class,  suppose that there are C classes (w1, …, wC), and there are d features (x1, …., xd), the features are assumed to be statistically independent with normal distribution of unknown mean and variance and the parameters are estimated using maximum likelihood estimation method.

 

1)     (15 points) Write down the training steps and a set of discriminant functions for minimum error rate classification. You need to include the details.

First we fix the notation, let  represent the kth feature in the ith of class wj. Assume that we have n training samples in total and we have n1 for w1, …, and nC for wC

 

There are different choices of discriminant functions for different minimum error rate classification. Here I choose

Here for class j, we need to estimate both  and . For , using maximum likelihood estimation, we have

 

To estimate , based on the assumptions that the features are statistically independent and they are normal distributed, we have

Here class wj has 2d parameters and they are estimated by according to maximum likelihood estimation

Plug-in the results to the discriminant function given above and ignore the common constant, we have

 

2)     (15 points) Implement your steps using a program language of your choice and then apply your program on the wine dataset for leave-one-out classification (available on the course web page and http://www.ics.uci.edu/~mlearn/MLRepository.html). You need to include the results from your program and your source code.

 

Give the equations above, to do leave-one-out recognition on the wine dataset, we need to do the following

-        For each sample in the dataset

o      We form a training set by removing it from the entire data set (177 samples)

o      We estimate the 26 parameters (13 means and 13 variances for each class) and there are in total 78 parameters.

o      For the one that was left out, we compute g1, g2, and g3 and we assign it to the class with the largest g’s.

o      We compare that the classification result with the true label: if there are different, it is a mistake; otherwise, it is correctly classified.

-        Output the classification rate

 

Here is a Matlab program

 

C=3;

first=0;

ns=[1:C]*0;

wine_uci %read the dataset

for i=1:C,

    ns(i)=sum(imgWine(:,1)==(i));

end

startInd=ns*0;

startInd(1)=0;

for i=2:(C+1),

    startInd(i)=startInd(i-1)+ns(i-1);

end

nsInd=ns*0;

imgMat=imgWine(:,2:size(imgWine,2));

for i=1:size(imgWine,1),

    k=imgWine(i,1);

    nsInd(k)=nsInd(k)+1;

    imgMat(nsInd(k)+startInd(k),:)=imgWine(i,2:size(imgWine,2));

end

startInd=ns*0;

startInd(1)=0;

for i=2:(C+1),

    startInd(i)=startInd(i-1)+ns(i-1);

end

 

clf;

colormap(gray(256));

correct=0;

wrong=0;

total=0;

for c=1:C,

    for k=1:ns(c),

    %Leave one out classification

    firstsub = (c-1)*max(ns);

    subplot(2,1,1);

    gx=1:C*0;

    for i=1:C,

      subInd=(startInd(i)+1):startInd(i+1);

      if i==c,

        subInd=[1:(k-1) (k+1):ns(i)];

      end

      subMat=imgMatP(subInd,:);

      %size(subMat)

      mean_vec=sum(subMat);

      mean_vec = mean_vec/size(subInd,2);

      var_vec=sum(subMat.^2)/size(subInd,2)-mean_vec.^2;

      if min(min(var_vec)) <= 0.000000001,

          var_vec =var_vec-min(min(var_vec))+0.000000001;

      end

      gx_marg=-0.5*log(var_vec)-(imgMatP(startInd(c)+k,:)-mean_vec).^2./(2*var_vec);

      gx(i) = sum(gx_marg);

     

      %pause;

    end

    format long;

    [Y, c1]=max(gx);

    subplot(2,1,2);

    p=plot(1:C,gx,'-');

    set(p,'LineWidth',[2]);

    gxStr=[sprintf('%6.2f  ',gx)];

    disp([sprintf('Test image %dth from class %d: Number of features %d, classified as %d with\n\t gx=[%s]',k,c,size(imgMatP,2),c1,gxStr)]);

    if c1==c,

        resStr=[sprintf('correct')];

        correct=correct+1;

    else

        resStr=[sprintf('wrong (class %d)', c)];

        wrong=wrong+1;

    end

    total=total+1;

    title([sprintf('Classified as \\omega%d, which is %s (total %d (%d correct %d wrong), which is %4.2f%%)',c1,resStr,total,correct, wrong, correct*100/total)],'FontSize',[12]);

    firstsub = (c-1)*max(ns);

%     subplot(2,max(ns),firstsub+k);

%     title([sprintf('Classified as \\omega%d',c1)],'FontSize',[12]);

    pause(2)

    %firstsub = (c-1)*max(ns);

    %subplot(C+1,max(ns),firstsub+k);

%     cla;

%     %image((reshape(imgMat(startInd(c)+k,:),imSize))');

%     axis('off'); axis('image');

%     title([sprintf('\\omega%d k=%d',c,k)],'FontSize',[8]);

    end

end

display([sprintf('Total %d (%d correct %d wrong), which is %4.2f%%',total,correct, wrong, correct*100/total)]);

 

3)     (Extra credit, 10 points) Implement linear dimension reduction assuming that a linear transformation is known. Then apply your program on the wine dataset for leave-one-out classification using either 1) randomly generated linear transformation, 2) principal component analysis, 3) or Fisher discriminant analysis. You need to include the results from your program and your source code.

 

 

This is similar to the original program except that we reduce the dimension first by multiplying the data matrix with W, the dimension reduction matrix.