image.png

1 欢迎爬虫

1.1 SEO-搜索引擎优化

比如谷歌 百度等搜索引擎,这种正规的爬虫会设置明显的User-Agent
谷歌 PC UA
Mozilla/5.0(compatible;Googlebot/2.1;+http://www.google.com/bot.html)

百度 PC UA
Mozilla/5.0(compatible;Baiduspider/2.0;+http://www.baidu.com/search/spider.html)Mozilla/5.0(compatible;Baiduspider-render/2.0;+http://www.baidu.com/search/spider.html)

必应:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
image.png

1.2 robots.txt —拒绝搜索引擎

是一种存放于网站根目录下的ASCII编码的文本文件,它通常告诉网络搜索引擎的漫游器(又称网络蜘蛛),此网站中的哪些内容是不应被搜索引擎的漫游器获取的,哪些是可以被漫游器获取的

image.png
淘宝:全栈拒绝百度搜索
image.png
image.png
京东:拒绝一淘等比价相关商业类型爬虫
image.png
amazon: 设置的挺全面,最后也把一淘全屏蔽了
User-agent: EtaoSpider
Disallow: /

  1. User-agent: *
  2. Disallow: /exec/obidos/account-access-login
  3. Disallow: /exec/obidos/change-style
  4. Disallow: /exec/obidos/flex-sign-in
  5. Disallow: /exec/obidos/handle-buy-box
  6. Disallow: /exec/obidos/tg/cm/member/
  7. Disallow: /gp/aw/help/id=sss
  8. Disallow: /gp/cart
  9. Disallow: /gp/flex
  10. Disallow: /gp/product/e-mail-friend
  11. Disallow: /gp/product/product-availability
  12. Disallow: /gp/product/rate-this-item
  13. Disallow: /gp/sign-in
  14. Disallow: /gp/reader
  15. Disallow: /gp/sitbv3/reader
  16. Disallow: /gp/richpub/syltguides/create
  17. Disallow: /gp/gfix
  18. Disallow: /gp/associations/wizard.html
  19. Disallow: /gp/dmusic/order
  20. Disallow: /gp/legacy-handle-buy-box.html
  21. Disallow: /gp/aws/ssop
  22. Disallow: /gp/yourstore
  23. Disallow: /gp/gift-central/organizer/add-wishlist
  24. Disallow: /gp/vote
  25. Disallow: /gp/voting/
  26. Disallow: /gp/music/wma-pop-up
  27. Disallow: /gp/customer-images
  28. Disallow: /gp/richpub/listmania/createpipeline
  29. Disallow: /gp/content-form
  30. Disallow: /gp/pdp/invitation/invite
  31. Disallow: /gp/customer-reviews/common/du
  32. Disallow: /gp/customer-reviews/write-a-review.html
  33. Disallow: /gp/associations/wizard.html
  34. Disallow: /gp/music/clipserve
  35. Disallow: /gp/customer-media/upload
  36. Disallow: /gp/history
  37. Disallow: /gp/item-dispatch
  38. Disallow: /gp/dmusic/order/handle-buy-box.html
  39. Disallow: /gp/recsradio
  40. Disallow: /gp/slredirect
  41. Disallow: /dp/shipping/
  42. Disallow: /dp/twister-update/
  43. Disallow: /dp/manual-submit/
  44. Disallow: /dp/e-mail-friend/
  45. Disallow: /dp/product-availability/
  46. Disallow: /dp/rate-this-item/
  47. Disallow: /gp/registry/wishlist/*/reserve
  48. Disallow: /gp/structured-ratings/actions/get-experience.html
  49. Disallow: /gp/twitter/
  50. Disallow: /ap/signin
  51. Disallow: /gp/registry/wishlist/
  52. Disallow: /wishlist/
  53. Allow: /wishlist/universal*
  54. Allow: /wishlist/vendor-button*
  55. Allow: /wishlist/get-button*
  56. Disallow: /gp/wishlist/
  57. Allow: /gp/wishlist/universal*
  58. Allow: /gp/wishlist/vendor-button*
  59. Allow: /gp/wishlist/ipad-install*
  60. Disallow: /registry/wishlist/
  61. Disallow: /review/common/du
  62. Disallow: /gp/registry/search.html
  63. Disallow: /product-reviews/B0069IY63Y
  64. Disallow: /gp/orc/rml/
  65. Disallow: */gcrnsts
  66. Disallow: /gp/gc/widget
  67. Disallow: /gp/dmusic/mp3/player
  68. Disallow: /gp/entity-alert/external
  69. Disallow: /gp/customer-reviews/dynamic/sims-box
  70. Disallow: /review/dynamic/sims-box
  71. Disallow: /gp/redirect.html
  72. Disallow: /gp/twister/ajaxv2
  73. Disallow: /ss/twister/ajax
  74. Disallow: /b?*node=7454917011
  75. Disallow: /b?*node=7454927011
  76. Disallow: /b?*node=7454939011
  77. Disallow: /b?*node=7454898011
  78. Disallow: /gp/customer-media/actions/delete/
  79. Disallow: /gp/customer-media/actions/edit-caption/
  80. Disallow: /gp/dmusic/
  81. Allow: /gp/dmusic/promotions/PrimeMusic
  82. Allow: /gp/dmusic/promotions/AmazonMusicUnlimited
  83. Disallow: /gp/offer-listing/
  84. Disallow: /b?*node=9052533011
  85. Disallow: /lm/R1XIHQVKXSKBNJ
  86. Disallow: /lm/R3HQ5WJSZK6QSO
  87. Disallow: /surprise/
  88. Disallow: /local/ajax/
  89. Disallow: */B00M3E1NYI
  90. Disallow: */B00M3E1Q5Y
  91. Disallow: */B00M3E1TOM
  92. Disallow: */B00M3E1WYO
  93. Disallow: */B00M3E204K
  94. Disallow: */B00M3E236A
  95. Disallow: */B00M3E260I
  96. Disallow: */B00M3E28WO
  97. Disallow: */B00M3E2BC6
  98. Disallow: */B00M3E2DPQ
  99. Disallow: */B00M3E2GU8
  100. Disallow: */B00M3E2J14
  101. Disallow: */B00M3E2LOE
  102. Disallow: */B00M3E1HJY
  103. Disallow: /gp/socialmedia/giveaways
  104. Disallow: /gp/b2b-rd
  105. Disallow: /gp/aw/so.html
  106. Disallow: /gp/rentallist
  107. Disallow: /gp/video/dvd-rental/settings
  108. Disallow: /gp/rl/settings
  109. Disallow: /gp/video/settings
  110. Disallow: /gp/video/library
  111. Disallow: /gp/video/watchlist
  112. Disallow: /reviews/iframe
  113. Disallow: /gp/switch-language
  114. Disallow: /ga/p/
  115. Disallow: /gp/profile/
  116. Disallow: /giveaway/host/setup/
  117. Disallow: /ss/customer-reviews/lighthouse/
  118. Disallow: /ospublishing/story/*
  119. Disallow: /gp/aw/ol/
  120. Disallow: /gp/promotion/
  121. Disallow: /hz/leaderboard/top-reviewers/
  122. Disallow: /creatorhub
  123. Disallow: /creatorhub/*
  124. Disallow: /slp/s$
  125. Disallow: /-/
  126. Allow: /-/es/
  127. Disallow: /hz/help/contact/*/message/$
  128. Disallow: /gp/aw/shoppingAids/
  129. Disallow: /rss/people/*/reviews
  130. Disallow: /gp/pdp/rss/*/reviews
  131. Disallow: /gp/cdp/member-reviews/
  132. Disallow: /gp/aw/cr/
  133. Disallow: */sim/B001132UEE
  134. Allow: /gp/offer-listing/B000
  135. Allow: /gp/offer-listing/9000
  136. Disallow: /gp/aag
  137. Allow: /gp/aag/main?*seller=ABVFEJU8LS620
  138. Disallow: /gp/pdp/profile/
  139. Disallow: /gp/help/customer/express/c2c/
  140. Disallow: /slp/*/b$
  141. Disallow: /hz/contact-us/ajax/initiate-trusted-contact/
  142. User-agent: EtaoSpider
  143. Disallow: /

DHgate:比较依赖google搜索引擎,设置的较全面,把国内的一众搜索引擎全屏蔽了

  1. User-Agent: *
  2. Allow: /*?f=bm
  3. Allow: /*?utm_source
  4. Disallow: /sellerPromise.do
  5. Disallow: /search.do
  6. Disallow: /promoproduct.do
  7. Disallow: /disc.do
  8. Disallow: /dcp/
  9. Disallow: /mydhgate/
  10. Disallow: /loadShipList.do?
  11. Disallow: /logodata.do
  12. Disallow: /viewcart.do
  13. Disallow: /omniture/omniture.do
  14. Disallow: /im/dhtalk.do
  15. Disallow: /product/brandreport.do
  16. Disallow: /product/productdisplay.do
  17. Disallow: /product/shippingpayment.do
  18. Disallow: /product/mobileprict.do
  19. Disallow: /product/productguaranteedajax.do
  20. Disallow: /product/inventory.do
  21. Disallow: /product/getstprodinfo.do
  22. Disallow: /product/productreview.do
  23. Disallow: /product/displayorder.do
  24. Disallow: /product/getonlinestatus.do
  25. Disallow: /product/getqrcode.do
  26. Disallow: /cart/
  27. Disallow: /csmsg/
  28. Disallow: /dhrec/
  29. Disallow: /product-images/
  30. Disallow: /wholesale/store/
  31. Disallow: /w/
  32. Disallow: /promotionproduct/
  33. Disallow: /seller-feedback/
  34. Disallow: /!forums/
  35. Disallow: /*?
  36. #
  37. # Crawlers which we'd rather not have.
  38. #
  39. User-Agent: Googlebot
  40. Allow: /*?f=
  41. Allow: /*?utm_source=
  42. Allow: /*?_escaped_fragment_=
  43. Allow: /*?dspm=
  44. Allow: /*?d1_page_num=
  45. Allow: /*?m=
  46. Allow: /*?yangfaninfo=
  47. Allow: /*?admsg=
  48. Allow: /*?shareToken=
  49. Allow: /*?invitorid=
  50. Allow: /*?from=
  51. Allow: /*?utm_terms=
  52. Disallow: /w/*?dspm=
  53. Disallow: /dcp/*?dspm=
  54. Disallow: /dhrec/*?dspm=
  55. Disallow: /disc.do*?dspm=
  56. Disallow: /sellerPromise.do
  57. Disallow: /search.do
  58. Disallow: /promoproduct.do
  59. Disallow: /disc.do
  60. Disallow: /dcp/
  61. Disallow: /mydhgate/
  62. Disallow: /loadShipList.do?
  63. Disallow: /logodata.do
  64. Disallow: /viewcart.do
  65. Disallow: /!forums/
  66. Disallow: /dhrec/
  67. Disallow: /w/
  68. Disallow: /product/shippingpayment.do
  69. Disallow: /product/mobileprict.do
  70. Disallow: /product/productguaranteedajax.do
  71. Disallow: /product/inventory.do
  72. Disallow: /product/getstprodinfo.do
  73. Disallow: /product/productreview.do
  74. Disallow: /product/displayorder.do
  75. Disallow: /product/getonlinestatus.do
  76. Disallow: /product/getqrcode.do
  77. Disallow: /*?
  78. User-Agent: YandexBot
  79. Allow: /*?f=bm
  80. Allow: /*?utm_source
  81. Allow: /*?_escaped_fragment_=
  82. Disallow: /sellerPromise.do
  83. Disallow: /search.do
  84. Disallow: /promoproduct.do
  85. Disallow: /disc.do
  86. Disallow: /dcp/
  87. Disallow: /mydhgate/
  88. Disallow: /loadShipList.do?
  89. Disallow: /logodata.do
  90. Disallow: /viewcart.do
  91. Disallow: /!forums/
  92. Disallow: /dhrec/
  93. Disallow: /w/
  94. Disallow: /product/shippingpayment.do
  95. Disallow: /product/mobileprict.do
  96. Disallow: /product/productguaranteedajax.do
  97. Disallow: /product/inventory.do
  98. Disallow: /product/getstprodinfo.do
  99. Disallow: /product/productreview.do
  100. Disallow: /product/displayorder.do
  101. Disallow: /product/getonlinestatus.do
  102. Disallow: /product/getqrcode.do
  103. Disallow: /*?
  104. User-Agent: bingbot
  105. Allow: /*?f=bm
  106. Allow: /*?utm_source
  107. Allow: /*?_escaped_fragment_=
  108. Disallow: /sellerPromise.do
  109. Disallow: /search.do
  110. Disallow: /promoproduct.do
  111. Disallow: /disc.do
  112. Disallow: /dcp/
  113. Disallow: /mydhgate/
  114. Disallow: /loadShipList.do?
  115. Disallow: /logodata.do
  116. Disallow: /viewcart.do
  117. Disallow: /!forums/
  118. Disallow: /dhrec/
  119. Disallow: /w/
  120. Disallow: /product/shippingpayment.do
  121. Disallow: /product/mobileprict.do
  122. Disallow: /product/productguaranteedajax.do
  123. Disallow: /product/inventory.do
  124. Disallow: /product/getstprodinfo.do
  125. Disallow: /product/productreview.do
  126. Disallow: /product/displayorder.do
  127. Disallow: /product/getonlinestatus.do
  128. Disallow: /product/getqrcode.do
  129. Disallow: /*?
  130. User-agent: ia_archiver
  131. Allow: /*?f=bm
  132. Allow: /*?utm_source
  133. Allow: /*?_escaped_fragment_=
  134. Disallow: /sellerPromise.do
  135. Disallow: /search.do
  136. Disallow: /promoproduct.do
  137. Disallow: /disc.do
  138. Disallow: /dcp/
  139. Disallow: /mydhgate/
  140. Disallow: /loadShipList.do?
  141. Disallow: /logodata.do
  142. Disallow: /viewcart.do
  143. Disallow: /!forums/
  144. Disallow: /dhrec/
  145. Disallow: /w/
  146. Disallow: /product/shippingpayment.do
  147. Disallow: /product/mobileprict.do
  148. Disallow: /product/productguaranteedajax.do
  149. Disallow: /product/inventory.do
  150. Disallow: /product/getstprodinfo.do
  151. Disallow: /product/productreview.do
  152. Disallow: /product/displayorder.do
  153. Disallow: /product/getonlinestatus.do
  154. Disallow: /product/getqrcode.do
  155. Disallow: /*?
  156. #
  157. # Naverbot Yeti
  158. #
  159. User-agent: NaverBot
  160. Allow: /*?f=bm
  161. Allow: /*?utm_source
  162. Allow: /*?_escaped_fragment_=
  163. Disallow: /sellerPromise.do
  164. Disallow: /search.do
  165. Disallow: /promoproduct.do
  166. Disallow: /disc.do
  167. Disallow: /dcp/
  168. Disallow: /mydhgate/
  169. Disallow: /loadShipList.do?
  170. Disallow: /logodata.do
  171. Disallow: /viewcart.do
  172. Disallow: /!forums/
  173. Disallow: /dhrec/
  174. Disallow: /w/
  175. Disallow: /product/shippingpayment.do
  176. Disallow: /product/mobileprict.do
  177. Disallow: /product/productguaranteedajax.do
  178. Disallow: /product/inventory.do
  179. Disallow: /product/getstprodinfo.do
  180. Disallow: /product/productreview.do
  181. Disallow: /product/displayorder.do
  182. Disallow: /product/getonlinestatus.do
  183. Disallow: /product/getqrcode.do
  184. Disallow: /*?
  185. User-agent: Yeti
  186. Allow: /*?f=bm
  187. Allow: /*?utm_source
  188. Allow: /*?_escaped_fragment_=
  189. Disallow: /sellerPromise.do
  190. Disallow: /search.do
  191. Disallow: /promoproduct.do
  192. Disallow: /disc.do
  193. Disallow: /dcp/
  194. Disallow: /mydhgate/
  195. Disallow: /loadShipList.do?
  196. Disallow: /logodata.do
  197. Disallow: /viewcart.do
  198. Disallow: /!forums/
  199. Disallow: /dhrec/
  200. Disallow: /w/
  201. Disallow: /product/shippingpayment.do
  202. Disallow: /product/mobileprict.do
  203. Disallow: /product/productguaranteedajax.do
  204. Disallow: /product/inventory.do
  205. Disallow: /product/getstprodinfo.do
  206. Disallow: /product/productreview.do
  207. Disallow: /product/displayorder.do
  208. Disallow: /product/getonlinestatus.do
  209. Disallow: /product/getqrcode.do
  210. Disallow: /*?
  211. User-Agent: Googlebot-Image
  212. Allow: /
  213. User-Agent: BaiduSpider
  214. Disallow: /
  215. User-Agent: Sosospider
  216. Disallow: /
  217. User-Agent: HaosouSpider
  218. Disallow: /
  219. User-Agent: Sogou web spider
  220. Disallow: /
  221. User-Agent: Sogou inst spider
  222. Disallow: /
  223. User-Agent: Sogou spider
  224. Disallow: /
  225. User-Agent: Sogou wap spider
  226. Disallow: /
  227. User-Agent: almaden
  228. Disallow: /
  229. User-Agent: ASPSeek
  230. Disallow: /
  231. User-Agent: Axmo
  232. Disallow: /
  233. User-Agent: BecomeBot
  234. Disallow: /
  235. User-Agent: CherryPicker
  236. Disallow: /
  237. User-Agent: Crescent Internet ToolPak
  238. Disallow: /
  239. User-Agent: DISCo Pump
  240. Disallow: /
  241. User-agent: Download Ninja
  242. Disallow: /
  243. User-Agent: DTS Agent
  244. Disallow: /
  245. User-Agent: EmailCollector
  246. Disallow: /
  247. User-Agent: EmailSiphon
  248. Disallow: /
  249. User-Agent: EmailWolf
  250. Disallow: /
  251. User-Agent: Expired Domain Sleuth
  252. Disallow: /
  253. User-Agent: Franklin Locator
  254. Disallow: /
  255. User-Agent: Gaisbot
  256. Disallow: /
  257. User-Agent: grub
  258. Disallow: /
  259. User-Agent: htdig
  260. Disallow: /
  261. User-Agent: HTTrack
  262. Disallow: /
  263. User-Agent: iaea.org
  264. Disallow: /
  265. User-Agent: IconSurf
  266. Disallow: /
  267. User-Agent: Iltrovatore-Setaccio
  268. Disallow: /
  269. User-Agent: Indy Library
  270. Disallow: /
  271. User-agent: InternetSeer.com
  272. Disallow: /
  273. User-Agent: IUPUI
  274. Disallow: /
  275. User-Agent: larbin
  276. Disallow: /
  277. User-agent: libwww
  278. Disallow: /
  279. User-Agent: LNSpiderGuy
  280. Disallow: /
  281. User-Agent: lwp-trivial
  282. Disallow: /
  283. User-Agent: MetaTagRobot
  284. Disallow: /
  285. User-Agent: Missigua Locator
  286. Disallow: /
  287. User-Agent: mozDex
  288. Disallow: /
  289. User-agent: MSIECrawler
  290. Disallow: /
  291. User-Agent: NaverBot
  292. Disallow: /
  293. User-Agent: NextGenSearch
  294. Disallow: /
  295. User-Agent: NPbot
  296. Disallow: /
  297. User-Agent: Nutch
  298. Disallow: /
  299. User-agent: Offline Explorer
  300. Disallow: /
  301. User-Agent: Oracle Ultra Search
  302. Disallow: /
  303. User-Agent: PictureOfInternet
  304. Disallow: /
  305. User-Agent: PlantyNet
  306. Disallow: /
  307. User-Agent: psbot
  308. Disallow: /
  309. User-Agent: QuepasaCreep
  310. Disallow: /
  311. User-Agent: RPT-HTTPClient
  312. Disallow: /
  313. User-agent: sitecheck.internetseer.com
  314. Disallow: /
  315. User-agent: SiteSnagger
  316. Disallow: /
  317. User-Agent: spider.acont.de
  318. Disallow: /
  319. User-Agent: Sqworm
  320. Disallow: /
  321. User-Agent: SSM Agent
  322. Disallow: /
  323. User-Agent: szukacz
  324. Disallow: /
  325. User-Agent: TAMU
  326. Disallow: /
  327. User-agent: Teleport
  328. Disallow: /
  329. User-agent: TeleportPro
  330. Disallow: /
  331. User-Agent: Telesoft
  332. Disallow: /
  333. User-Agent: TurnitinBot
  334. Disallow: /
  335. User-Agent: TutorGig
  336. Disallow: /
  337. User-Agent: Ultraseek
  338. Disallow: /
  339. User-Agent: WebCopier
  340. Disallow: /
  341. User-Agent: WebEMailExtractor
  342. Disallow: /
  343. User-Agent: WebReaper
  344. Disallow: /
  345. User-Agent: Webster Pro
  346. Disallow: /
  347. User-Agent: WebStripper
  348. Disallow: /
  349. User-Agent: WebZIP
  350. Disallow: /
  351. User-Agent: Wotbox
  352. Disallow: /
  353. User-Agent: Xenu
  354. Disallow: /
  355. User-agent: Zao
  356. Disallow: /
  357. User-agent: Zealbot
  358. Disallow: /
  359. User-Agent: ZipppBot
  360. Disallow: /
  361. User-agent: ZyBORG
  362. Disallow: /
  363. User-agent: AdsBot-Google
  364. Disallow:
  365. #
  366. # Sitemap
  367. #
  368. Sitemap:https://www.dhgate.com/sitemapseo/product_sitemap__index.xml
  369. Sitemap:https://www.dhgate.com/sitemapseo/pc_item_sitemap_index.xml
  370. Sitemap:https://www.dhgate.com/sitemapseo/pc_p_sitemap_index.xml

2 拒绝爬虫

网站大部分都欢迎搜索引擎,但是不欢迎恶意的商业性质的爬虫行为

3 项目遇到的爬虫问题

3.1 爬虫对服务器的压力

爬虫导致日志过多

3.2 电商运营平台(大量产品/价格/排名等数据)-防爬

参考:

3.3 网盘搜索—防爬

IMG_3470的副本.jpg