1 欢迎爬虫
1.1 SEO-搜索引擎优化
比如谷歌 百度等搜索引擎,这种正规的爬虫会设置明显的User-Agent
谷歌 PC UA
Mozilla/5.0(compatible;Googlebot/2.1;+http://www.google.com/bot.html)
百度 PC UA
Mozilla/5.0(compatible;Baiduspider/2.0;+http://www.baidu.com/search/spider.html)Mozilla/5.0(compatible;Baiduspider-render/2.0;+http://www.baidu.com/search/spider.html)
必应:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
1.2 robots.txt —拒绝搜索引擎
是一种存放于网站根目录下的ASCII编码的文本文件,它通常告诉网络搜索引擎的漫游器(又称网络蜘蛛),此网站中的哪些内容是不应被搜索引擎的漫游器获取的,哪些是可以被漫游器获取的
淘宝:全栈拒绝百度搜索
京东:拒绝一淘等比价相关商业类型爬虫
amazon: 设置的挺全面,最后也把一淘全屏蔽了
User-agent: EtaoSpider
Disallow: /
User-agent: *
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member/
Disallow: /gp/aw/help/id=sss
Disallow: /gp/cart
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in
Disallow: /gp/reader
Disallow: /gp/sitbv3/reader
Disallow: /gp/richpub/syltguides/create
Disallow: /gp/gfix
Disallow: /gp/associations/wizard.html
Disallow: /gp/dmusic/order
Disallow: /gp/legacy-handle-buy-box.html
Disallow: /gp/aws/ssop
Disallow: /gp/yourstore
Disallow: /gp/gift-central/organizer/add-wishlist
Disallow: /gp/vote
Disallow: /gp/voting/
Disallow: /gp/music/wma-pop-up
Disallow: /gp/customer-images
Disallow: /gp/richpub/listmania/createpipeline
Disallow: /gp/content-form
Disallow: /gp/pdp/invitation/invite
Disallow: /gp/customer-reviews/common/du
Disallow: /gp/customer-reviews/write-a-review.html
Disallow: /gp/associations/wizard.html
Disallow: /gp/music/clipserve
Disallow: /gp/customer-media/upload
Disallow: /gp/history
Disallow: /gp/item-dispatch
Disallow: /gp/dmusic/order/handle-buy-box.html
Disallow: /gp/recsradio
Disallow: /gp/slredirect
Disallow: /dp/shipping/
Disallow: /dp/twister-update/
Disallow: /dp/manual-submit/
Disallow: /dp/e-mail-friend/
Disallow: /dp/product-availability/
Disallow: /dp/rate-this-item/
Disallow: /gp/registry/wishlist/*/reserve
Disallow: /gp/structured-ratings/actions/get-experience.html
Disallow: /gp/twitter/
Disallow: /ap/signin
Disallow: /gp/registry/wishlist/
Disallow: /wishlist/
Allow: /wishlist/universal*
Allow: /wishlist/vendor-button*
Allow: /wishlist/get-button*
Disallow: /gp/wishlist/
Allow: /gp/wishlist/universal*
Allow: /gp/wishlist/vendor-button*
Allow: /gp/wishlist/ipad-install*
Disallow: /registry/wishlist/
Disallow: /review/common/du
Disallow: /gp/registry/search.html
Disallow: /product-reviews/B0069IY63Y
Disallow: /gp/orc/rml/
Disallow: */gcrnsts
Disallow: /gp/gc/widget
Disallow: /gp/dmusic/mp3/player
Disallow: /gp/entity-alert/external
Disallow: /gp/customer-reviews/dynamic/sims-box
Disallow: /review/dynamic/sims-box
Disallow: /gp/redirect.html
Disallow: /gp/twister/ajaxv2
Disallow: /ss/twister/ajax
Disallow: /b?*node=7454917011
Disallow: /b?*node=7454927011
Disallow: /b?*node=7454939011
Disallow: /b?*node=7454898011
Disallow: /gp/customer-media/actions/delete/
Disallow: /gp/customer-media/actions/edit-caption/
Disallow: /gp/dmusic/
Allow: /gp/dmusic/promotions/PrimeMusic
Allow: /gp/dmusic/promotions/AmazonMusicUnlimited
Disallow: /gp/offer-listing/
Disallow: /b?*node=9052533011
Disallow: /lm/R1XIHQVKXSKBNJ
Disallow: /lm/R3HQ5WJSZK6QSO
Disallow: /surprise/
Disallow: /local/ajax/
Disallow: */B00M3E1NYI
Disallow: */B00M3E1Q5Y
Disallow: */B00M3E1TOM
Disallow: */B00M3E1WYO
Disallow: */B00M3E204K
Disallow: */B00M3E236A
Disallow: */B00M3E260I
Disallow: */B00M3E28WO
Disallow: */B00M3E2BC6
Disallow: */B00M3E2DPQ
Disallow: */B00M3E2GU8
Disallow: */B00M3E2J14
Disallow: */B00M3E2LOE
Disallow: */B00M3E1HJY
Disallow: /gp/socialmedia/giveaways
Disallow: /gp/b2b-rd
Disallow: /gp/aw/so.html
Disallow: /gp/rentallist
Disallow: /gp/video/dvd-rental/settings
Disallow: /gp/rl/settings
Disallow: /gp/video/settings
Disallow: /gp/video/library
Disallow: /gp/video/watchlist
Disallow: /reviews/iframe
Disallow: /gp/switch-language
Disallow: /ga/p/
Disallow: /gp/profile/
Disallow: /giveaway/host/setup/
Disallow: /ss/customer-reviews/lighthouse/
Disallow: /ospublishing/story/*
Disallow: /gp/aw/ol/
Disallow: /gp/promotion/
Disallow: /hz/leaderboard/top-reviewers/
Disallow: /creatorhub
Disallow: /creatorhub/*
Disallow: /slp/s$
Disallow: /-/
Allow: /-/es/
Disallow: /hz/help/contact/*/message/$
Disallow: /gp/aw/shoppingAids/
Disallow: /rss/people/*/reviews
Disallow: /gp/pdp/rss/*/reviews
Disallow: /gp/cdp/member-reviews/
Disallow: /gp/aw/cr/
Disallow: */sim/B001132UEE
Allow: /gp/offer-listing/B000
Allow: /gp/offer-listing/9000
Disallow: /gp/aag
Allow: /gp/aag/main?*seller=ABVFEJU8LS620
Disallow: /gp/pdp/profile/
Disallow: /gp/help/customer/express/c2c/
Disallow: /slp/*/b$
Disallow: /hz/contact-us/ajax/initiate-trusted-contact/
User-agent: EtaoSpider
Disallow: /
DHgate:比较依赖google搜索引擎,设置的较全面,把国内的一众搜索引擎全屏蔽了
User-Agent: *
Allow: /*?f=bm
Allow: /*?utm_source
Disallow: /sellerPromise.do
Disallow: /search.do
Disallow: /promoproduct.do
Disallow: /disc.do
Disallow: /dcp/
Disallow: /mydhgate/
Disallow: /loadShipList.do?
Disallow: /logodata.do
Disallow: /viewcart.do
Disallow: /omniture/omniture.do
Disallow: /im/dhtalk.do
Disallow: /product/brandreport.do
Disallow: /product/productdisplay.do
Disallow: /product/shippingpayment.do
Disallow: /product/mobileprict.do
Disallow: /product/productguaranteedajax.do
Disallow: /product/inventory.do
Disallow: /product/getstprodinfo.do
Disallow: /product/productreview.do
Disallow: /product/displayorder.do
Disallow: /product/getonlinestatus.do
Disallow: /product/getqrcode.do
Disallow: /cart/
Disallow: /csmsg/
Disallow: /dhrec/
Disallow: /product-images/
Disallow: /wholesale/store/
Disallow: /w/
Disallow: /promotionproduct/
Disallow: /seller-feedback/
Disallow: /!forums/
Disallow: /*?
#
# Crawlers which we'd rather not have.
#
User-Agent: Googlebot
Allow: /*?f=
Allow: /*?utm_source=
Allow: /*?_escaped_fragment_=
Allow: /*?dspm=
Allow: /*?d1_page_num=
Allow: /*?m=
Allow: /*?yangfaninfo=
Allow: /*?admsg=
Allow: /*?shareToken=
Allow: /*?invitorid=
Allow: /*?from=
Allow: /*?utm_terms=
Disallow: /w/*?dspm=
Disallow: /dcp/*?dspm=
Disallow: /dhrec/*?dspm=
Disallow: /disc.do*?dspm=
Disallow: /sellerPromise.do
Disallow: /search.do
Disallow: /promoproduct.do
Disallow: /disc.do
Disallow: /dcp/
Disallow: /mydhgate/
Disallow: /loadShipList.do?
Disallow: /logodata.do
Disallow: /viewcart.do
Disallow: /!forums/
Disallow: /dhrec/
Disallow: /w/
Disallow: /product/shippingpayment.do
Disallow: /product/mobileprict.do
Disallow: /product/productguaranteedajax.do
Disallow: /product/inventory.do
Disallow: /product/getstprodinfo.do
Disallow: /product/productreview.do
Disallow: /product/displayorder.do
Disallow: /product/getonlinestatus.do
Disallow: /product/getqrcode.do
Disallow: /*?
User-Agent: YandexBot
Allow: /*?f=bm
Allow: /*?utm_source
Allow: /*?_escaped_fragment_=
Disallow: /sellerPromise.do
Disallow: /search.do
Disallow: /promoproduct.do
Disallow: /disc.do
Disallow: /dcp/
Disallow: /mydhgate/
Disallow: /loadShipList.do?
Disallow: /logodata.do
Disallow: /viewcart.do
Disallow: /!forums/
Disallow: /dhrec/
Disallow: /w/
Disallow: /product/shippingpayment.do
Disallow: /product/mobileprict.do
Disallow: /product/productguaranteedajax.do
Disallow: /product/inventory.do
Disallow: /product/getstprodinfo.do
Disallow: /product/productreview.do
Disallow: /product/displayorder.do
Disallow: /product/getonlinestatus.do
Disallow: /product/getqrcode.do
Disallow: /*?
User-Agent: bingbot
Allow: /*?f=bm
Allow: /*?utm_source
Allow: /*?_escaped_fragment_=
Disallow: /sellerPromise.do
Disallow: /search.do
Disallow: /promoproduct.do
Disallow: /disc.do
Disallow: /dcp/
Disallow: /mydhgate/
Disallow: /loadShipList.do?
Disallow: /logodata.do
Disallow: /viewcart.do
Disallow: /!forums/
Disallow: /dhrec/
Disallow: /w/
Disallow: /product/shippingpayment.do
Disallow: /product/mobileprict.do
Disallow: /product/productguaranteedajax.do
Disallow: /product/inventory.do
Disallow: /product/getstprodinfo.do
Disallow: /product/productreview.do
Disallow: /product/displayorder.do
Disallow: /product/getonlinestatus.do
Disallow: /product/getqrcode.do
Disallow: /*?
User-agent: ia_archiver
Allow: /*?f=bm
Allow: /*?utm_source
Allow: /*?_escaped_fragment_=
Disallow: /sellerPromise.do
Disallow: /search.do
Disallow: /promoproduct.do
Disallow: /disc.do
Disallow: /dcp/
Disallow: /mydhgate/
Disallow: /loadShipList.do?
Disallow: /logodata.do
Disallow: /viewcart.do
Disallow: /!forums/
Disallow: /dhrec/
Disallow: /w/
Disallow: /product/shippingpayment.do
Disallow: /product/mobileprict.do
Disallow: /product/productguaranteedajax.do
Disallow: /product/inventory.do
Disallow: /product/getstprodinfo.do
Disallow: /product/productreview.do
Disallow: /product/displayorder.do
Disallow: /product/getonlinestatus.do
Disallow: /product/getqrcode.do
Disallow: /*?
#
# Naverbot Yeti
#
User-agent: NaverBot
Allow: /*?f=bm
Allow: /*?utm_source
Allow: /*?_escaped_fragment_=
Disallow: /sellerPromise.do
Disallow: /search.do
Disallow: /promoproduct.do
Disallow: /disc.do
Disallow: /dcp/
Disallow: /mydhgate/
Disallow: /loadShipList.do?
Disallow: /logodata.do
Disallow: /viewcart.do
Disallow: /!forums/
Disallow: /dhrec/
Disallow: /w/
Disallow: /product/shippingpayment.do
Disallow: /product/mobileprict.do
Disallow: /product/productguaranteedajax.do
Disallow: /product/inventory.do
Disallow: /product/getstprodinfo.do
Disallow: /product/productreview.do
Disallow: /product/displayorder.do
Disallow: /product/getonlinestatus.do
Disallow: /product/getqrcode.do
Disallow: /*?
User-agent: Yeti
Allow: /*?f=bm
Allow: /*?utm_source
Allow: /*?_escaped_fragment_=
Disallow: /sellerPromise.do
Disallow: /search.do
Disallow: /promoproduct.do
Disallow: /disc.do
Disallow: /dcp/
Disallow: /mydhgate/
Disallow: /loadShipList.do?
Disallow: /logodata.do
Disallow: /viewcart.do
Disallow: /!forums/
Disallow: /dhrec/
Disallow: /w/
Disallow: /product/shippingpayment.do
Disallow: /product/mobileprict.do
Disallow: /product/productguaranteedajax.do
Disallow: /product/inventory.do
Disallow: /product/getstprodinfo.do
Disallow: /product/productreview.do
Disallow: /product/displayorder.do
Disallow: /product/getonlinestatus.do
Disallow: /product/getqrcode.do
Disallow: /*?
User-Agent: Googlebot-Image
Allow: /
User-Agent: BaiduSpider
Disallow: /
User-Agent: Sosospider
Disallow: /
User-Agent: HaosouSpider
Disallow: /
User-Agent: Sogou web spider
Disallow: /
User-Agent: Sogou inst spider
Disallow: /
User-Agent: Sogou spider
Disallow: /
User-Agent: Sogou wap spider
Disallow: /
User-Agent: almaden
Disallow: /
User-Agent: ASPSeek
Disallow: /
User-Agent: Axmo
Disallow: /
User-Agent: BecomeBot
Disallow: /
User-Agent: CherryPicker
Disallow: /
User-Agent: Crescent Internet ToolPak
Disallow: /
User-Agent: DISCo Pump
Disallow: /
User-agent: Download Ninja
Disallow: /
User-Agent: DTS Agent
Disallow: /
User-Agent: EmailCollector
Disallow: /
User-Agent: EmailSiphon
Disallow: /
User-Agent: EmailWolf
Disallow: /
User-Agent: Expired Domain Sleuth
Disallow: /
User-Agent: Franklin Locator
Disallow: /
User-Agent: Gaisbot
Disallow: /
User-Agent: grub
Disallow: /
User-Agent: htdig
Disallow: /
User-Agent: HTTrack
Disallow: /
User-Agent: iaea.org
Disallow: /
User-Agent: IconSurf
Disallow: /
User-Agent: Iltrovatore-Setaccio
Disallow: /
User-Agent: Indy Library
Disallow: /
User-agent: InternetSeer.com
Disallow: /
User-Agent: IUPUI
Disallow: /
User-Agent: larbin
Disallow: /
User-agent: libwww
Disallow: /
User-Agent: LNSpiderGuy
Disallow: /
User-Agent: lwp-trivial
Disallow: /
User-Agent: MetaTagRobot
Disallow: /
User-Agent: Missigua Locator
Disallow: /
User-Agent: mozDex
Disallow: /
User-agent: MSIECrawler
Disallow: /
User-Agent: NaverBot
Disallow: /
User-Agent: NextGenSearch
Disallow: /
User-Agent: NPbot
Disallow: /
User-Agent: Nutch
Disallow: /
User-agent: Offline Explorer
Disallow: /
User-Agent: Oracle Ultra Search
Disallow: /
User-Agent: PictureOfInternet
Disallow: /
User-Agent: PlantyNet
Disallow: /
User-Agent: psbot
Disallow: /
User-Agent: QuepasaCreep
Disallow: /
User-Agent: RPT-HTTPClient
Disallow: /
User-agent: sitecheck.internetseer.com
Disallow: /
User-agent: SiteSnagger
Disallow: /
User-Agent: spider.acont.de
Disallow: /
User-Agent: Sqworm
Disallow: /
User-Agent: SSM Agent
Disallow: /
User-Agent: szukacz
Disallow: /
User-Agent: TAMU
Disallow: /
User-agent: Teleport
Disallow: /
User-agent: TeleportPro
Disallow: /
User-Agent: Telesoft
Disallow: /
User-Agent: TurnitinBot
Disallow: /
User-Agent: TutorGig
Disallow: /
User-Agent: Ultraseek
Disallow: /
User-Agent: WebCopier
Disallow: /
User-Agent: WebEMailExtractor
Disallow: /
User-Agent: WebReaper
Disallow: /
User-Agent: Webster Pro
Disallow: /
User-Agent: WebStripper
Disallow: /
User-Agent: WebZIP
Disallow: /
User-Agent: Wotbox
Disallow: /
User-Agent: Xenu
Disallow: /
User-agent: Zao
Disallow: /
User-agent: Zealbot
Disallow: /
User-Agent: ZipppBot
Disallow: /
User-agent: ZyBORG
Disallow: /
User-agent: AdsBot-Google
Disallow:
#
# Sitemap
#
Sitemap:https://www.dhgate.com/sitemapseo/product_sitemap__index.xml
Sitemap:https://www.dhgate.com/sitemapseo/pc_item_sitemap_index.xml
Sitemap:https://www.dhgate.com/sitemapseo/pc_p_sitemap_index.xml
2 拒绝爬虫
网站大部分都欢迎搜索引擎,但是不欢迎恶意的商业性质的爬虫行为
3 项目遇到的爬虫问题
3.1 爬虫对服务器的压力
3.2 电商运营平台(大量产品/价格/排名等数据)-防爬
参考: