使用找到的第二个函数库,然后实现一下随机森林。

未编译的压缩包

randomforest-matlab.zip
这个压缩包中包含:回归和分类两部分的代码,具体的文件如下图1所示
image.png
图1. 未编译文件列表
在使用这部分文件的时候,直接运行图中的tutorial_ClassRF.m这个教程文件,但是会出现mex编译文件的错误,这部分目前还没有解决。

解决办法

mex编译文件是c/c++与matlab混编过程中用到的,主要是让matlab能够调用c/c++的文件,加快运行速度
在Microsoft Visual Studio中开发Matlab的mexw文件,生成mexw64或mexw32文件,教程

已编译的压缩包

所以直接使用已经预编译完成的文件夹,Windows-Precompiled-RF_MexStandalone-v0.02-.zip
其中的文件列表如图2所示:
image.png
图2. 预编译完成的文件夹
直接进入tutorial_ClassRF.m文件,代码如下

  1. % A simple tutorial file to interface with RF
  2. % Options copied from http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
  3. %run plethora of tests
  4. clc
  5. close all
  6. %compile everything
  7. if strcmpi(computer,'PCWIN') |strcmpi(computer,'PCWIN64')
  8. compile_windows
  9. else
  10. compile_linux
  11. end
  12. total_train_time=0;
  13. total_test_time=0;
  14. %load the twonorm dataset
  15. load data/twonorm
  16. %modify so that training data is NxD and labels are Nx1, where N=#of
  17. %examples, D=# of features
  18. X = inputs';
  19. Y = outputs;
  20. [N D] =size(X);
  21. %randomly split into 250 examples for training and 50 for testing
  22. randvector = randperm(N);
  23. X_trn = X(randvector(1:250),:);
  24. Y_trn = Y(randvector(1:250));
  25. X_tst = X(randvector(251:end),:);
  26. Y_tst = Y(randvector(251:end));
  27. % example 1: simply use with the defaults
  28. model = classRF_train(X_trn,Y_trn);
  29. Y_hat = classRF_predict(X_tst,model);
  30. fprintf('\nexample 1: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  31. % example 2: set to 100 trees
  32. model = classRF_train(X_trn,Y_trn, 100);
  33. Y_hat = classRF_predict(X_tst,model);
  34. fprintf('\nexample 2: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  35. % example 3: set to 100 trees, mtry = 2
  36. model = classRF_train(X_trn,Y_trn, 100,2);
  37. Y_hat = classRF_predict(X_tst,model);
  38. fprintf('\nexample 3: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  39. % example 4: set to defaults trees and mtry by specifying values as 0
  40. model = classRF_train(X_trn,Y_trn, 0, 0);
  41. Y_hat = classRF_predict(X_tst,model);
  42. fprintf('\nexample 4: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  43. % example 5: set sampling without replacement (default is with replacement)
  44. extra_options.replace = 0 ;
  45. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  46. Y_hat = classRF_predict(X_tst,model);
  47. fprintf('\nexample 5: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  48. % example 6: Using classwt (priors of classes)
  49. clear extra_options;
  50. extra_options.classwt = [1 1]; %for the [-1 +1] classses in twonorm
  51. % if you sort the labels in training and arrange in ascending order then
  52. % for twonorm you have -1 and +1 classes, with here assigning 1 to
  53. % both classes
  54. % As you have specified the classwt above, what happens that the priors are considered
  55. % also is considered the freq of the labels in the data. If you are
  56. % confused look into src/rfutils.cpp in normClassWt() function
  57. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  58. Y_hat = classRF_predict(X_tst,model);
  59. fprintf('\nexample 6: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  60. % example 7: modify to make class(es) more IMPORTANT than the others
  61. % extra_options.cutoff (Classification only) = A vector of length equal to
  62. % number of classes. The 'winning' class for an observation is the one with the maximum ratio of proportion
  63. % of votes to cutoff. Default is 1/k where k is the number of classes (i.e., majority
  64. % vote wins). clear extra_options;
  65. extra_options.cutoff = [1/4 3/4]; %for the [-1 +1] classses in twonorm
  66. % if you sort the labels in training and arrange in ascending order then
  67. % for twonorm you have -1 and +1 classes, with here assigning 1/4 and
  68. % 3/4 respectively
  69. % thus the second class needs a lot less votes to win compared to the first class
  70. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  71. Y_hat = classRF_predict(X_tst,model);
  72. fprintf('\nexample 7: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  73. fprintf(' y_trn is almost 50/50 but y_hat now has %f/%f split\n',length(find(Y_hat~=-1))/length(Y_tst),length(find(Y_hat~=1))/length(Y_tst));
  74. % extra_options.strata = (not yet stable in code) variable that is used for stratified
  75. % sampling. I don't yet know how this works.
  76. % example 8: sampsize example
  77. % extra_options.sampsize = Size(s) of sample to draw. For classification,
  78. % if sampsize is a vector of the length the number of strata, then sampling is stratified by strata,
  79. % and the elements of sampsize indicate the numbers to be drawn from the strata.
  80. clear extra_options
  81. extra_options.sampsize = size(X_trn,1)*2/3;
  82. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  83. Y_hat = classRF_predict(X_tst,model);
  84. fprintf('\nexample 8: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  85. % example 9: nodesize
  86. % extra_options.nodesize = Minimum size of terminal nodes. Setting this number larger causes smaller trees
  87. % to be grown (and thus take less time). Note that the default values are different
  88. % for classification (1) and regression (5).
  89. clear extra_options
  90. extra_options.nodesize = 2;
  91. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  92. Y_hat = classRF_predict(X_tst,model);
  93. fprintf('\nexample 9: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  94. % example 10: calculating importance
  95. clear extra_options
  96. extra_options.importance = 1; %(0 = (Default) Don't, 1=calculate)
  97. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  98. Y_hat = classRF_predict(X_tst,model);
  99. fprintf('\nexample 10: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  100. %model will have 3 variables for importance importanceSD and localImp
  101. %importance = a matrix with nclass + 2 (for classification) or two (for regression) columns.
  102. % For classification, the first nclass columns are the class-specific measures
  103. % computed as mean decrease in accuracy. The nclass + 1st column is the
  104. % mean decrease in accuracy over all classes. The last column is the mean decrease
  105. % in Gini index. For Regression, the first column is the mean decrease in
  106. % accuracy and the second the mean decrease in MSE. If importance=FALSE,
  107. % the last measure is still returned as a vector.
  108. figure('Name','Importance Plots')
  109. subplot(2,1,1);
  110. bar(model.importance(:,end-1));xlabel('feature');ylabel('magnitude');
  111. title('Mean decrease in Accuracy');
  112. subplot(2,1,2);
  113. bar(model.importance(:,end));xlabel('feature');ylabel('magnitude');
  114. title('Mean decrease in Gini index');
  115. %importanceSD = The ?standard errors? of the permutation-based importance measure. For classification,
  116. % a D by nclass + 1 matrix corresponding to the first nclass + 1
  117. % columns of the importance matrix. For regression, a length p vector.
  118. model.importanceSD
  119. % example 11: calculating local importance
  120. % extra_options.localImp = Should casewise importance measure be computed? (Setting this to TRUE will
  121. % override importance.)
  122. %localImp = a D by N matrix containing the casewise importance measures, the [i,j] element
  123. % of which is the importance of i-th variable on the j-th case. NULL if
  124. % localImp=FALSE.
  125. clear extra_options
  126. extra_options.localImp = 1; %(0 = (Default) Don't, 1=calculate)
  127. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  128. Y_hat = classRF_predict(X_tst,model);
  129. fprintf('\nexample 11: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  130. model.localImp
  131. % example 12: calculating proximity
  132. % extra_options.proximity = Should proximity measure among the rows be calculated?
  133. clear extra_options
  134. extra_options.proximity = 1; %(0 = (Default) Don't, 1=calculate)
  135. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  136. Y_hat = classRF_predict(X_tst,model);
  137. fprintf('\nexample 12: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  138. model.proximity
  139. % example 13: use only OOB for proximity
  140. % extra_options.oob_prox = Should proximity be calculated only on 'out-of-bag' data?
  141. clear extra_options
  142. extra_options.proximity = 1; %(0 = (Default) Don't, 1=calculate)
  143. extra_options.oob_prox = 0; %(Default = 1 if proximity is enabled, Don't 0)
  144. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  145. Y_hat = classRF_predict(X_tst,model);
  146. fprintf('\nexample 13: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  147. % example 14: to see what is going on behind the scenes
  148. % extra_options.do_trace = If set to TRUE, give a more verbose output as randomForest is run. If set to
  149. % some integer, then running output is printed for every
  150. % do_trace trees.
  151. clear extra_options
  152. extra_options.do_trace = 1; %(Default = 0)
  153. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  154. Y_hat = classRF_predict(X_tst,model);
  155. fprintf('\nexample 14: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  156. % example 14: to see what is going on behind the scenes
  157. % extra_options.keep_inbag Should an n by ntree matrix be returned that keeps track of which samples are
  158. % 'in-bag' in which trees (but not how many times, if sampling with replacement)
  159. %
  160. clear extra_options
  161. extra_options.keep_inbag = 1; %(Default = 0)
  162. model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);
  163. Y_hat = classRF_predict(X_tst,model);
  164. fprintf('\nexample 15: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  165. model.inbag
  166. % example 16: getting the OOB rate. model will have errtr whose first
  167. % column is the OOB rate. and the second column is for the 1-st class and
  168. % so on
  169. model = classRF_train(X_trn,Y_trn);
  170. Y_hat = classRF_predict(X_tst,model);
  171. fprintf('\nexample 16: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));
  172. figure('Name','OOB error rate');
  173. plot(model.errtr(:,1)); title('OOB error rate'); xlabel('iteration (# trees)'); ylabel('OOB error rate');
  174. % example 17: getting prediction per tree, votes etc for test set
  175. model = classRF_train(X_trn,Y_trn);
  176. test_options.predict_all = 1;
  177. [Y_hat, votes, prediction_pre_tree] = classRF_predict(X_tst,model,test_options);
  178. fprintf('\nexample 17: error rate %f\n', length(find(Y_hat~=Y_tst))/length(Y_tst));

image.png
图3. example10结果
image.png
图4. example16结果