缘起

最近阅读<<我的第一本算法书>>(【日】石田保辉;宫崎修一)
本系列笔记拟采用golang练习之

网页排名(PageRank/佩奇排名), 随机游走

  1. 网页排名(PageRank,也叫作佩奇排名)是一种在搜索网页时对搜索结果进行排序的算法。
  2. 网页排名就是利用网页之间的链接结构计算出网页价值的算法。
  3. 在网页排名中,链入页面越多的网页,它的重要性也就越高。
  4. 假设没有链入页面的网页权重为1
  5. 有链入页面的网页权重是其链入页面的权重之和。
  6. 如果一个网页链向多个页面,那么其链向的所有页面将平分它的权重。
  7. 在网页排名中,链入的页面越多,该网页所发出的链接的价值就越高。
  8. 可以使用“随机游走模型”(random walk model)来解决网页互链的问题.
  9. 用户浏览网页的操作就可以这样来定义:
  10. 用户等概率跳转到当前网页所链向的一个网页的概率为1-α;
  11. 等概率远程跳转到其他网页中的一个网页的概率为α。
  12. 模拟用户随机访问页面的过程,
  13. 每访问一个页面, 其权重加1,
  14. 直到访问的总次数达到N次为止,
  15. 每个页面的权重值代表的是“某一刻正在浏览这个网页的概率”,
  16. 可直接将其作为网页的权重来使用。
  17. 摘自 <<我的第一本算法书>> 【日】石田保辉;宫崎修一

目标

  • 实现基于随机游走模型的PageRank算法, 并验证其有效性和稳定性(网页权重在模拟若干次后, 保持稳定)

设计

  • IPage: 网页模型接口
  • IPageRanking: 网页排名算法接口
  • tPage: 网页模型的实现
  • tRandomWalkPageRanking: 基于随机游走模型的PageRank算法, 实现IPageRanking接口

单元测试

  • page_rank_test.go, 验证网页排名算法的有效性和稳定性
  • 首先通过简单case验证其有效性
  • 然后随机生成大批量随机互链的网页, 验证在多轮随机游走后, 网页权重的稳定性 ```go package others

import ( “fmt” pr “learning/gooop/others/page_rank” “math/rand” “sort” “testing” “time” )

func Test_PageRank(t *testing.T) { fnAssertTrue := func(b bool, msg string) { if !b { t.Fatal(msg) } }

  1. t.Log("testing simple case")
  2. p11 := pr.NewPage("p11")
  3. p12 := pr.NewPage("p12")
  4. p13 := pr.NewPage("p13")
  5. p21 := pr.NewPage("p21")
  6. p22 := pr.NewPage("p22")
  7. p31 := pr.NewPage("p31")
  8. p32 := pr.NewPage("p32")
  9. p11.AddLink(p21)
  10. p11.AddLink(p22)
  11. p12.AddLink(p21)
  12. p12.AddLink(p22)
  13. p13.AddLink(p21)
  14. p13.AddLink(p22)
  15. p21.AddLink(p31)
  16. p22.AddLink(p31)
  17. p31.AddLink(p32)
  18. p32.AddLink(p31)
  19. samples := []pr.IPage{
  20. p11,p12,p13, p21, p22, p31, p32,
  21. }
  22. pr.RandomWalkPageRanking.RankAll(samples, 1000)
  23. sort.Sort(sort.Reverse(pr.IPageSlice(samples)))
  24. for _,p := range samples {
  25. t.Log(p)
  26. }
  27. fnAssertTrue(samples[0].ID() == "p31", "expecting top.1 = p31")
  28. fnAssertTrue(samples[1].ID() == "p32", "expecting top.2 = p32")
  29. fnAssertTrue(samples[2].ID() == "p21" || samples[2].ID() == "p22", "expecting top.3 in (p21,p22)")
  30. fnAssertTrue(samples[3].ID() == "p21" || samples[3].ID() == "p22", "expecting top.4 in (p21,p22)")
  31. // generate 1000 random pages
  32. iPageCount := 1000
  33. pages := make([]pr.IPage, iPageCount)
  34. for i,_ := range pages {
  35. pages[i] = pr.NewPage(fmt.Sprintf("p%02d.com", i))
  36. }
  37. r := rand.New(rand.NewSource(time.Now().UnixNano()))
  38. for i,p := range pages {
  39. // add random page links
  40. for j,iPageLinks := 0, r.Intn(10);j < iPageLinks; {
  41. n := r.Intn(iPageCount)
  42. if n != i {
  43. j++
  44. p.AddLink(pages[n])
  45. }
  46. }
  47. }
  48. // rank pages and get top 100
  49. mapTop100 := make(map[string]bool, 0)
  50. fnTestRanking := func(rounds int) {
  51. t.Logf("testing page rank with %v rounds", rounds)
  52. bFirstRound := len(mapTop100) == 0
  53. // page ranking
  54. pr.RandomWalkPageRanking.RankAll(pages, rounds)
  55. // sort pages by ranking
  56. sort.Sort(sort.Reverse(pr.IPageSlice(pages)))
  57. hits := 0
  58. for i,p := range pages {
  59. if i < 10 {
  60. t.Log(p)
  61. }
  62. if i < 100 {
  63. if bFirstRound {
  64. mapTop100[p.ID()] = true
  65. } else if _,ok := mapTop100[p.ID()];ok {
  66. hits++
  67. }
  68. } else {
  69. break
  70. }
  71. }
  72. if !bFirstRound {
  73. t.Logf("hit rate: %s%v", "%", hits)
  74. }
  75. }
  76. // test stability under different rounds
  77. fnTestRanking(3000)
  78. fnTestRanking(10000)
  79. fnTestRanking(30000)
  80. fnTestRanking(90000)

}

  1. <a name="0LRlR"></a>
  2. # 测试输出
  3. - 测试表明, 当随机游走的总次数 >= 网页数量 * 每个网页的平均发出链接数时, 所得排名是比较稳定的

$ go test -v page_rank_test.go === RUN Test_PageRank page_rank_test.go:19: testing simple case page_rank_test.go:47: p(p31, 0.4206, [p32]) page_rank_test.go:47: p(p32, 0.3673, [p31]) page_rank_test.go:47: p(p21, 0.0639, [p31]) page_rank_test.go:47: p(p22, 0.0604, [p31]) page_rank_test.go:47: p(p11, 0.0300, [p21 p22]) page_rank_test.go:47: p(p12, 0.0291, [p21 p22]) page_rank_test.go:47: p(p13, 0.0287, [p21 p22]) page_rank_test.go:77: testing page rank with 3000 rounds page_rank_test.go:89: p(p604.com, 0.0039, []) page_rank_test.go:89: p(p807.com, 0.0035, [p709.com p328.com p303.com p110.com p858.com p394.com p231.com p731.com p83.com]) page_rank_test.go:89: p(p72.com, 0.0034, [p249.com p347.com p604.com p533.com p416.com p958.com p966.com p385.com]) page_rank_test.go:89: p(p712.com, 0.0033, [p485.com p451.com p236.com p141.com p168.com p693.com]) page_rank_test.go:89: p(p452.com, 0.0032, [p588.com p344.com p618.com p258.com p394.com p285.com p780.com p606.com p89.com]) page_rank_test.go:89: p(p709.com, 0.0031, [p666.com p554.com p103.com p202.com p230.com]) page_rank_test.go:89: p(p975.com, 0.0029, []) page_rank_test.go:89: p(p840.com, 0.0029, [p974.com p698.com p49.com p851.com p73.com]) page_rank_test.go:89: p(p867.com, 0.0028, [p588.com p196.com p931.com p381.com p621.com p848.com]) page_rank_test.go:89: p(p633.com, 0.0028, [p840.com]) page_rank_test.go:77: testing page rank with 10000 rounds page_rank_test.go:89: p(p604.com, 0.0039, []) page_rank_test.go:89: p(p807.com, 0.0034, [p709.com p328.com p303.com p110.com p858.com p394.com p231.com p731.com p83.com]) page_rank_test.go:89: p(p72.com, 0.0034, [p249.com p347.com p604.com p533.com p416.com p958.com p966.com p385.com]) page_rank_test.go:89: p(p452.com, 0.0033, [p588.com p344.com p618.com p258.com p394.com p285.com p780.com p606.com p89.com]) page_rank_test.go:89: p(p712.com, 0.0033, [p485.com p451.com p236.com p141.com p168.com p693.com]) page_rank_test.go:89: p(p709.com, 0.0031, [p666.com p554.com p103.com p202.com p230.com]) page_rank_test.go:89: p(p975.com, 0.0029, []) page_rank_test.go:89: p(p840.com, 0.0029, [p974.com p698.com p49.com p851.com p73.com]) page_rank_test.go:89: p(p612.com, 0.0028, [p116.com p562.com p179.com p37.com p761.com]) page_rank_test.go:89: p(p319.com, 0.0028, [p726.com p63.com p558.com p301.com p185.com p944.com]) page_rank_test.go:104: hit rate: %98 page_rank_test.go:77: testing page rank with 30000 rounds page_rank_test.go:89: p(p604.com, 0.0039, []) page_rank_test.go:89: p(p807.com, 0.0034, [p709.com p328.com p303.com p110.com p858.com p394.com p231.com p731.com p83.com]) page_rank_test.go:89: p(p72.com, 0.0034, [p249.com p347.com p604.com p533.com p416.com p958.com p966.com p385.com]) page_rank_test.go:89: p(p452.com, 0.0033, [p588.com p344.com p618.com p258.com p394.com p285.com p780.com p606.com p89.com]) page_rank_test.go:89: p(p712.com, 0.0032, [p485.com p451.com p236.com p141.com p168.com p693.com]) page_rank_test.go:89: p(p709.com, 0.0031, [p666.com p554.com p103.com p202.com p230.com]) page_rank_test.go:89: p(p975.com, 0.0029, []) page_rank_test.go:89: p(p840.com, 0.0029, [p974.com p698.com p49.com p851.com p73.com]) page_rank_test.go:89: p(p319.com, 0.0028, [p726.com p63.com p558.com p301.com p185.com p944.com]) page_rank_test.go:89: p(p612.com, 0.0028, [p116.com p562.com p179.com p37.com p761.com]) page_rank_test.go:104: hit rate: %98 page_rank_test.go:77: testing page rank with 90000 rounds page_rank_test.go:89: p(p604.com, 0.0039, []) page_rank_test.go:89: p(p807.com, 0.0034, [p709.com p328.com p303.com p110.com p858.com p394.com p231.com p731.com p83.com]) page_rank_test.go:89: p(p72.com, 0.0034, [p249.com p347.com p604.com p533.com p416.com p958.com p966.com p385.com]) page_rank_test.go:89: p(p452.com, 0.0033, [p588.com p344.com p618.com p258.com p394.com p285.com p780.com p606.com p89.com]) page_rank_test.go:89: p(p712.com, 0.0032, [p485.com p451.com p236.com p141.com p168.com p693.com]) page_rank_test.go:89: p(p709.com, 0.0031, [p666.com p554.com p103.com p202.com p230.com]) page_rank_test.go:89: p(p975.com, 0.0029, []) page_rank_test.go:89: p(p840.com, 0.0029, [p974.com p698.com p49.com p851.com p73.com]) page_rank_test.go:89: p(p612.com, 0.0028, [p116.com p562.com p179.com p37.com p761.com]) page_rank_test.go:89: p(p319.com, 0.0028, [p726.com p63.com p558.com p301.com p185.com p944.com]) page_rank_test.go:104: hit rate: %98 —- PASS: Test_PageRank (13.93s) PASS ok command-line-arguments 13.936s

  1. <a name="nsiQ7"></a>
  2. # IPage.go
  3. 网页模型接口
  4. ```go
  5. package page_rank
  6. import "fmt"
  7. type IPage interface {
  8. fmt.Stringer
  9. ID() string
  10. GetWeight() float64
  11. SetWeight(float64)
  12. GetLinks() []IPage
  13. AddLink(IPage)
  14. }
  15. type IPageSlice []IPage
  16. func (me IPageSlice) Len() int {
  17. return len(me)
  18. }
  19. func (me IPageSlice) Less(i,j int) bool {
  20. return me[i].GetWeight() < me[j].GetWeight()
  21. }
  22. func (me IPageSlice) Swap(i,j int) {
  23. me[i], me[j] = me[j], me[i]
  24. }

IPageRanking.go

网页排名算法接口

  1. package page_rank
  2. type IPageRanking interface {
  3. RankAll(pages []IPage, rounds int)
  4. }

tPage.go

网页模型的实现

  1. package page_rank
  2. import (
  3. "fmt"
  4. "strings"
  5. )
  6. type tPage struct {
  7. id string
  8. weight float64
  9. links []IPage
  10. }
  11. func NewPage(id string) IPage {
  12. return &tPage{
  13. id: id,
  14. weight: 0,
  15. links: []IPage{},
  16. }
  17. }
  18. func (me *tPage) ID() string {
  19. return me.id
  20. }
  21. func (me *tPage) GetWeight() float64 {
  22. return me.weight
  23. }
  24. func (me *tPage) SetWeight(w float64) {
  25. me.weight = w
  26. }
  27. func (me *tPage) GetLinks() []IPage {
  28. return me.links
  29. }
  30. func (me *tPage) AddLink(p IPage) {
  31. me.links = append(me.links, p)
  32. }
  33. func (me *tPage) String() string {
  34. linkStrings := make([]string, len(me.links))
  35. for i,p := range me.links {
  36. linkStrings[i] = p.ID()
  37. }
  38. return fmt.Sprintf("p(%v, %8.4f, [%v])", me.id, me.weight, strings.Join(linkStrings, " "))
  39. }

tRandomWalkPageRanking.go

基于随机游走模型的PageRank算法, 实现IPageRanking接口

  1. package page_rank
  2. import (
  3. "math/rand"
  4. "time"
  5. )
  6. type tRandomWalkPageRanking struct {
  7. }
  8. var gPossiblityToLinkedPage = 85
  9. func newRandomWalkPageRanking() IPageRanking {
  10. return &tRandomWalkPageRanking{}
  11. }
  12. func (me *tRandomWalkPageRanking) RankAll(pages []IPage, rounds int) {
  13. iPageCount := len(pages)
  14. if iPageCount <= 0 {
  15. return
  16. }
  17. r := rand.New(rand.NewSource(time.Now().UnixNano()))
  18. current := pages[0]
  19. iVisitCount := iPageCount * rounds
  20. for i := 0;i < iVisitCount;i++ {
  21. // visit current page
  22. current.SetWeight(current.GetWeight() + 1)
  23. possibility := r.Intn(100)
  24. if possibility < gPossiblityToLinkedPage && len(current.GetLinks())>0 {
  25. // goto linked page
  26. current = me.VisitLinkedPage(current, r)
  27. } else {
  28. // goto unlinked page
  29. current = me.VisitUnlinkedPage(current, pages, r)
  30. }
  31. }
  32. fVisitCount := float64(iVisitCount)
  33. for _,p := range pages {
  34. p.SetWeight(p.GetWeight() / fVisitCount)
  35. }
  36. }
  37. func (me *tRandomWalkPageRanking) VisitLinkedPage(current IPage, r *rand.Rand) IPage {
  38. links := current.GetLinks()
  39. next := links[r.Intn(len(links))]
  40. return next
  41. }
  42. func (me *tRandomWalkPageRanking) VisitUnlinkedPage(current IPage, pages []IPage, r *rand.Rand) IPage {
  43. mapLinks := make(map[string]bool, 0)
  44. mapLinks[current.ID()] = true
  45. for _,p := range current.GetLinks() {
  46. mapLinks[p.ID()] = true
  47. }
  48. n := len(pages)
  49. for {
  50. next := pages[r.Intn(n)]
  51. if _,ok := mapLinks[next.ID()];!ok {
  52. return next
  53. }
  54. }
  55. }
  56. var RandomWalkPageRanking = newRandomWalkPageRanking()

(end)