基本框架代码

基本原理是同时调用电脑的多核进行爬虫,同时下载

  1. import multiprocessing as mp
  2. import time
  3. from urllib.request import urlopen
  4. from urllib.parse import urljoin
  5. from bs4 import BeautifulSoup
  6. import re
  7. base_url='https://yulizi123.github.io/'
  8. def crawl(url):
  9. response = urlopen(url)
  10. time.sleep(0.1)
  11. return response.read().decode()
  12. def parse(html):
  13. soup=BeautifulSoup(html,features='html.parser')
  14. urls = soup.find_all('a',{'href':re.compile('^/.+?/$')})
  15. title=soup.find('h1').get_text().strip()
  16. page_urls=set([urljoin(base_url,url['href']) for url in urls])
  17. url = soup.find('meta',{'property':'og:url'})['content']
  18. return title,page_urls,url
  19. unseen=set([base_url,])
  20. seen=set()
  21. count, t1 = 1, time.time()
  22. while len(unseen) != 0: # still get some url to visit
  23. if len(seen) > 20:
  24. break
  25. print('\nDistributed Crawling...')
  26. htmls = [crawl(url) for url in unseen]
  27. print('\nDistributed Parsing...')
  28. results = [parse(html) for html in htmls]
  29. print('\nAnalysing...')
  30. seen.update(unseen) # seen the crawled
  31. unseen.clear() # nothing unseen
  32. for title, page_urls, url in results:
  33. print(count, title, url)
  34. count += 1
  35. unseen.update(page_urls - seen) # get new url to crawl
  36. print('Total time: %.1f s' % (time.time() - t1,)) # 53 s

当前为一个正常的单线程的爬虫程序,多线程爬虫由于模块的种种问题,现在无法进行调试,待python的多线程学习完毕之后再来进行修改。

异步加载Asyncio

利用一个单线程的程序来控制爬虫的每一步的