一、准备工作

1.背景介绍

小说网站:新笔趣阁

2.爬虫步骤

大致分三个步骤:

  • 发起请求:明确如何发起Http请求,获取到数据;
  • 解析数据:获取到杂乱的数据,对数据进行清理;
  • 保存数据:保存为自己想要的格式。

发起请求就用requests
解析数据有xpath、Beautiful Soup、正则表达式等,本文用BeautifulSoup
保存数据:用常规的文本保存,后续继续用docx和xlsx保存

3.Beautiful Soup

  1. pip install bs4

官方中文教程

4.小试牛刀

下载《斗破苍穹》部分章节
首先审查页面元素,分析第一章url

  1. from typing import ChainMap
  2. import requests
  3. from bs4 import BeautifulSoup
  4. import sys
  5. def get_contents(server,target):
  6. url=server+target
  7. req=requests.get(url=url)
  8. req.encoding='utf-8'
  9. html=req.text
  10. bf=BeautifulSoup(html,'lxml')
  11. texts=bf.find('div',id='content')
  12. content=texts.text.replace('\xa0'*4,'\n\n')
  13. return content
  14. def get_urls(target):
  15. chapters=[]
  16. urls=[]
  17. nums=0
  18. req=requests.get(url=target,verify=False) #加上verify=false可避免get时指定ssl证书
  19. req.encoding='utf-8' #避免乱码
  20. html=req.text
  21. bs=BeautifulSoup(html,'lxml')
  22. a=bs.find('div',id="list")
  23. a=a.find_all('a')[100:200]
  24. nums=len(a)
  25. for each in a:
  26. urls.append(each.get('href'))
  27. chapters.append(each.string)
  28. return urls,chapters,nums
  29. def writer(path,name,text):
  30. write_flag = True
  31. with open(path, 'a', encoding='utf-8') as f:
  32. f.write(name+'\n')
  33. f.writelines(text)
  34. f.write('\n\n')
  35. if __name__=='__main__':
  36. server='https://www.vbiquge.com'
  37. target='https://www.vbiquge.com/1_1413/'
  38. book_name='斗破苍穹.txt'
  39. urls,chapters,nums=get_urls(target)
  40. for i in range(nums):
  41. writer(book_name,chapters[i],get_contents(server,urls[i]))
  42. sys.stdout.write("已下载:{0}/{1}{2}".format(i,nums,'\r'))
  43. sys.stdout.flush()
  • 面向对象
  1. from bs4 import BeautifulSoup
  2. import requests
  3. import sys
  4. """
  5. 类说明:下载《笔趣网》小说《一念永恒》
  6. parameters:
  7. returns:
  8. Modify:
  9. 21.2.18
  10. """
  11. class downloader(object):
  12. def __init__(self):
  13. self.server = 'https://www.bqkan.com'
  14. self.target = 'https://www.bqkan.com/1_1094/'
  15. self.chapters = [] # 存放章节名
  16. self.urls = [] # 存放章节链接
  17. self.nums = [] # 章节数
  18. """
  19. 函数说明:获取下载链接
  20. """
  21. def get_download_url(self):
  22. req = requests.get(url=self.target)
  23. '''
  24. 查看网页的源码发现网页的编码方式gbk,BeautifulSoup解析后得到的soup,打印出来是乱码,实际上其本身已经是正确的(从原始的GB2312编码)解析(为Unicode)后的了。之所以乱码,那是因为,打印soup时,调用的是__str__,其默认是UTF-8,所以输出到GBK的cmd中,才显示是乱码
  25. '''
  26. req.encoding = 'gb18030' #确保内容不乱码
  27. div_bf = BeautifulSoup(req.text,'lxml')
  28. div = div_bf.find_all('div', class_='listmain')
  29. a_bf = BeautifulSoup(str(div[0]))
  30. a = a_bf.find_all('a')
  31. self.nums = len(a[15:]) # 剔除前15章
  32. for each in a[15:]:
  33. self.chapters.append(each.string) #a标签中的章节名
  34. self.urls.append(each.get('href')) #a标签中的链接
  35. """
  36. 获取章节内容
  37. parameters:
  38. target 下载链接-string
  39. returns:
  40. texts 章节内容-string
  41. """
  42. def get_contens(self, target):
  43. url=self.server+target
  44. req = requests.get(url)
  45. bf = BeautifulSoup(req.text,'lxml')
  46. texts = bf.find_all('div', class_='showtxt')
  47. texts = texts[0].text.replace('\xa0'*8, '\n\n') # 章节中的八个空格替换为回车
  48. return texts
  49. """
  50. 说明:将爬取的文件写入爬虫
  51. parameters:
  52. name-章节名称-string
  53. path-当前路径下,小说保存名-string
  54. text-章节内容-string
  55. retruns:
  56. """
  57. def writer(self, name, path, text):
  58. write_flag = True
  59. with open(path, 'a', encoding='utf-8') as f:
  60. f.write(name+'\n')
  61. f.writelines(text)
  62. f.write('\n\n')
  63. if __name__ == '__main__':
  64. dl = downloader()
  65. dl.get_download_url()
  66. book_name='一念永恒.txt'
  67. print('开始下载:')
  68. for i in range(dl.nums):
  69. dl.writer(dl.chapters[i], book_name, dl.get_contens(dl.urls[i]))
  70. sys.stdout.write("已下载:{0:.2%}{1}".format(i/dl.nums,'\r'))
  71. sys.stdout.flush()
  72. print("下载完成")