Resources

books

  • 周志华《机器学习》第七章:贝叶斯分类器
  • 李航《统计学习方法》第四章:朴素贝叶斯法
  • 《机器学习实战》第四章:基于概率论的方法:朴素贝叶斯

documents

  • https://zhuanlan.zhihu.com/p/26262151

    Example: Block insulting comments

    1 Code analyse

    1 Prepare data

    1 Argument analyse

  • CreateDataSet():

    • describe: Create data set or import data set
    • returns: data set
  • CerateWordList():
    • describe: Use documents words to reate a word set
    • returns: word set
  • BuildWordVector():
    • describe: Change input words to word vector
    • arguments:
      • wodSet:
      • inputSet:
    • returns: word vector

2 Process analyse

  • CreateDataSet():
  • CerateWordList():
  • BuildWordVector():

2 Code analyse

  1. def CreateDataSet():
  2. postingList = [
  3. ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
  4. ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
  5. ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
  6. ['stop', 'posting', 'stupid', 'worthless', 'gar e'],
  7. ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
  8. ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  9. # 1 is insulting document, 0 is not insulting document
  10. classVector = [0, 1, 0, 1, 0, 1]
  11. return postList, classVector
  1. def CerateWordList(dataSet):
  2. # create empty set
  3. wordSet = set()
  4. for data in dataSet:
  5. # | calculate the nuion of two sets
  6. wordSet = wordSet | set(data)
  7. return list(wordSet)
  1. def BuildWordVector(wordSet, inputSet):
  2. result = [0] * len(wordSet)
  3. for word in inputSet:
  4. if word in wordSet:
  5. result[wordSet.index(word)] = 1
  6. return result

3 Train algorithm

1 Arguments analyse

  • discribe:
  • arguments:
    • trainText: (Documents to be trained)
    • trainCategory: (Documents catrgorys)
  • returns:
    • p0Vector: (Non-insulting words appear probility vector)
    • p1Vector: (Insulting words appear probility vector)
    • appearProb: (The appear probility of insulting files)

2 Process analyse

  • Calculate doucuments quantity and words quality in a document
  • Calculate the apper probability of insulting files
  • Initialize words apper list and total number
  • Calculate words apper list and total number(Realize by loop and condition statements)
  • Calculate probability(log from)

3 Code implementation

  1. def TrainAlgorithm(trainText, trainCategory):
  2. documentsNum = len(trainText)
  3. documentWordsNum = len(trainText[0])
  4. appearProb = np.sum(trainCategory)/documentNum
  5. p0Num = np.zeros(documentWordsNum)
  6. p1Num = np.zeros(documentWordsNum)
  7. p0TotalNum = 0
  8. p1TotalNum = 0
  9. for i in documentsNum:
  10. if trainCategory(i) == 0:
  11. p0Num += trainText[i]
  12. p0TotalNum += np.sum(trainText[i])
  13. else:
  14. p1Num += trainText[I]
  15. p1TotalNum += np.sum(trainText[i])
  16. p0Vector = p0Num/p0TotalNum
  17. p1Vector = p1Num/p1TotalNum
  18. return p0Vector, p1Vector, appearProb

4 Optimized algorithm code implementation

  1. def TrainAlgorithm(trainText, trainCategory):
  2. documentsNum = len(trainText)
  3. documentWordsNum = len(trainText[0])
  4. appearProb = np.sum(trainCategory)/documentNum
  5. p0Num = np.ones(documentWordsNum)
  6. p1Num = np.ones(documentWordsNum)
  7. p0TotalNum = 2.0
  8. p1TotalNum = 2.0
  9. for i in documentsNum:
  10. if trainCategory(i) == 0:
  11. p0Num += trainText[i]
  12. p0TotalNum += np.sum(trainText[i])
  13. else:
  14. p1Num += trainText[I]
  15. p1TotalNum += np.sum(trainText[i])
  16. p0Vector = log(p0Num/p0TotalNum)
  17. p1Vector = log(p1Num/p1TotalNum)
  18. return p0Vector, p1Vector, appearProb

4 Test algorithm

1 Arguments analyse

  • discribe:
  • arguments:
    • classifyVector: The vector to be classify
    • p0Vector: Non-insulting words appear probility vector
    • p1Vector: Insulting words appear probility vector
    • appearProb: The appear probility of insulting files
  • returns:

2 Process analyse

  • Calculate formula: log(P(F1|C))+log(P(F2|C))+….+log(P(Fn|C))+log(P(C))
  • (classifyVector * p1Vector) means relative every words and its probility

3 Code implementation

  1. def Naive_Bayes_Classify(classifyVector, p0Vector, p1Vector, appearProb):
  2. p1 = np.sum(classifyVector * p1Vector) + np.log(appearProb)
  3. p0 = np.sum(classifyVector * p1Vector) + np.log(1 - appearProb)
  4. if p1 > p0:
  5. return 1
  6. else:
  7. retuen 0

2 Process chart

屏幕快照 2020-07-12 下午11.02.02.png