一、项目背景

动漫之家漫画下载

二、漫画下载

下载《欢乐懒朋友》
url:https://www.dmzj.com/info/huanlelanpengyou.html

目标:保存所有章节的图片到本地

1.获取章节名和链接名

  1. import requests
  2. from bs4 import BeautifulSoup
  3. target='https://www.dmzj.com/info/huanlelanpengyou.html'
  4. req=requests.get(target)
  5. html=req.text
  6. bs=BeautifulSoup(html,'lxml')
  7. list_con_li=bs.find('ul',class_="list_con_li")
  8. a=list_con_li.find_all('a')
  9. for each in a:
  10. print(each.text,each.get('href'))

2.获取漫画图片地址

  • 先看第一章url:https://www.dmzj.com/view/huanlelanpengyou/94870.html#@page=1
    可以发现漫画分页存放,第一页后面加了page=1,第二页加了page=2,以此类推
  • 打开页面,发现不能点击右键检查元素,简单的反爬虫,直接按F12就可以,如果F12不行,还可以在URL前面加view-source:
  • 在检查不容易找到图片地址,可在浏览器Network中轻松找到真实地址,Headers中可以看清请求头信息,Preview可以浏览返回信息。
  • 找到图片的真实地址:https://images.dmzj1.com/img/chapterpic/30997/116345/1573149593515.jpg,然后再去html页面中去搜索
  • 判别是否动态加载:去view-source:中搜索不到图片地址,说明是动态加载的

    • Javascript中的动态加载就两种
      • 外部加载,在HTML页面中,一引用的形式加载一个js,如下
      • 内部加载,就是直接将js脚本直接写在html内。
    • 如下所示:
      1. <script type="text/javascript">
      2. var arr_img = new Array();
      3. var page = '';
      4. eval(function(p,a,c,k,e,d){e=function(c){return c.toString(36)};if(!''.replace(/^/,String)){while(c--){d[c.toString(a)]=k[c]||c.toString(a)}k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('k g=\'{"j":"e","f":"0","h":"d\\/4\\/2\\/1\\/c.3\\r\\5\\/4\\/2\\/1\\/8.3\\r\\5\\/4\\/2\\/1\\/6.3\\r\\5\\/4\\/2\\/1\\/9.3\\r\\5\\/4\\/2\\/1\\/a.3\\r\\5\\/4\\/2\\/1\\/b.3\\r\\5\\/4\\/2\\/1\\/u.3","o":"7","l":"m","n":"\\v\\i \\p\\t\\s\\q"}\';',32,32,'|132297|30997|jpg|chapterpic|nimg|16127590575194||16127590571358|16127590580377|16127590582676|16127590583266|16127590571361|img|113978|hidden|pages|page_url|u8bdd|id|var|chapter_order|38|chapter_name|sum_pages|u4e00|u51fa||u5f39|u952e|16127590586897|u7b2c33'.split('|'),0,{}))
      5. </script>
  • 通过对比图片的源地址:https://images.dmzj1.com/img/chapterpic/30997/116345/1573149593515.jpg,容易发现该地址是由上面的js代码中的数字拼凑起来的。下面定义get_urls函数获取图片的源地址,如下:

    1. def get_urls(target):
    2. req=requests.get(url=target)
    3. bs=BeautifulSoup(req.text,'lxml')
    4. script_info=str(bs.script)
    5. pics=re.findall('\d{14}',script_info)
    6. for idx, pic in enumerate(pics): #采集图片后顺序不对,按图片名数字升序对图片进行排序
    7. if len(pic) == 13:
    8. pics[idx] = pic + '0'
    9. pics = sorted(pics, key=lambda x:int(x))
    10. chapter_qian=re.findall('\|(\d{5})\|',script_info)[0]
    11. chapter_hou=re.findall('\|(\d{6})\|',script_info)[0]
    12. urls=[]
    13. for pic in pics:
    14. url='https://images.dmzj1.com/img/chapterpic/'+chapter_qian+'/'+chapter_hou+'/'+pic+'.jpg'
    15. urls.append(url)
    16. return urls

    3.下载图片

    使用urllib中的urlretrieve模块下载

    1. import requests
    2. from urllib.request import urlretrieve
    3. dn_url = 'https://images.dmzj1.com/img/chapterpic/30997/116345/1573149593515.jpg'
    4. urlretrieve(dn_url,'1.jpg')

    没有其他反爬虫手段,可以直接下载了。但很多情况下是不能直接下载的,而且会出现下载中断的情况,这就需要在下载中引入header头 ```python

import requests from contextlib import closing

download_header = { ‘Referer’: ‘https://www.dmzj.com/view/huanlelanpengyou/111947.html‘ }

dn_url = ‘https://images.dmzj1.com/img/chapterpic/30997/130584/16084654761554.jpg‘ with closing(requests.get(dn_url, headers=download_header, stream=True)) as response: chunk_size = 1024
content_size = int(response.headers[‘content-length’])
if response.status_code == 200: print(‘文件大小:%0.2f KB’ % (content_size / chunk_size)) with open(‘1.jpg’, “wb”) as file:
for data in response.iter_content(chunk_size=chunk_size):
file.write(data)
else: print(‘链接异常’) print(‘下载完成!’)

  1. <a name="yx02N"></a>
  2. ### 4.下载整本漫画
  3. 整合上述代码
  4. ```python
  5. import requests
  6. from bs4 import BeautifulSoup
  7. from contextlib import closing
  8. import re,os,time
  9. from tqdm import tqdm
  10. def download_pics(target,names,urls,save_dir):
  11. for i,url in enumerate(tqdm(urls)):
  12. download_header={
  13. 'Referer':url
  14. }
  15. name=names[i]
  16. #去掉空格中的.
  17. while '.' in name:
  18. name=name.replace('.','')
  19. #创建章节目录
  20. chapter_dir=os.path.join(save_dir,name)
  21. if name not in os.listdir(save_dir):
  22. os.mkdir(chapter_dir)
  23. #获取图片源地址
  24. req=requests.get(url)
  25. bs=BeautifulSoup(req.text,'lxml')
  26. script_info=str(bs.script)
  27. pics=re.findall('\d{14}',script_info)
  28. for idx, pic in enumerate(pics): #对图片进行排序
  29. if len(pic) == 13:
  30. pics[idx] = pic + '0'
  31. pics = sorted(pics, key=lambda x:int(x))
  32. chapter_qian=re.findall('\|(\d{5})\|',script_info)[0]
  33. chapter_hou=re.findall('\|(\d{6})\|',script_info)[0]
  34. for idx,pic in enumerate(pics):
  35. pic_url='https://images.dmzj1.com/img/chapterpic/'+chapter_qian+'/'+chapter_hou+'/'+pic+'.jpg'
  36. pic_name='{0}.jpg'.format(idx+1)
  37. pic_save_path=os.path.join(chapter_dir,pic_name)
  38. #开始下载图片
  39. with closing(requests.get(pic_url, headers=download_header, stream=True)) as response:
  40. chunk_size = 1024
  41. content_size = int(response.headers['content-length'])
  42. if response.status_code == 200:
  43. with open(pic_save_path, "wb") as file:
  44. for data in response.iter_content(chunk_size=chunk_size):
  45. file.write(data)
  46. else:
  47. print('链接异常')
  48. time.sleep(10)
  49. #获取章节名和链接名
  50. def get_chapters(target):
  51. req=requests.get(target)
  52. html=req.text
  53. bs=BeautifulSoup(html,'lxml')
  54. list_con_li=bs.find('ul',class_="list_con_li")
  55. a=list_con_li.find_all('a')
  56. chapter_names=[]
  57. chapter_urls=[]
  58. for each in a:
  59. chapter_names.append(each.text)
  60. chapter_urls.append(each.get('href'))
  61. return chapter_names,chapter_urls
  62. if __name__=='__main__':
  63. save_dir='欢乐懒朋友'
  64. if save_dir not in os.listdir(r'D:\ProgramData\Python'):
  65. os.mkdir(save_dir)
  66. target='https://www.dmzj.com/info/huanlelanpengyou.html'
  67. names,urls=get_chapters(target)
  68. download_pics(target,names,urls,save_dir)