快速、优雅的爬虫框架对于golang开发者
colly提供一个干净的接口对于写任何类型的网络爬虫。
使用Colly,您可以轻松地从网站提取结构化数据,可以用于广泛的应用程序,如数据挖掘,数据处理或存档。
特性
- 干净的API
- 快速(单核>1k请求/秒)
- 管理每个域的请求延迟和最大并发性
- 自动cookie和会话处理
- 同步/异步并行抓取
- 缓存
- 非unicode响应的自动编码
- robots . txt的支持
- 分布式抓取
- 通过环境变量进行配置
- 扩展
例子
c := colly.NewCollector()// Find and visit all linksc.OnHTML("a[href]", func(e *colly.HTMLElement) {e.Request.Visit(e.Attr("href"))})c.OnRequest(func(r *colly.Request) {fmt.Println("Visiting", r.URL)})c.Visit("http://go-colly.org/")}
请参阅示例文件夹以获得更详细的示例。
安装
添加colloy到你的 go.mod文件:
module github.com/x/ygo 1.14require (github.com/gocolly/colly/v2 latest)
Bugs
bug或建议吗?访问问题跟踪器或加入#colly的freenode
使用Colly的其他项目
下面是使用Colly的公共开源项目列表:
- greenpeace/check-my-pages Scraping script to test the Spanish Greenpeace web archive.
- altsab/gowap Wappalyzer implementation in Go.
- jesuiscamille/goquotes A quotes scrapper, making your day a little better!
- jivesearch/jivesearch A search engine that doesn’t track you.
- Leagify/colly-draft-prospects A scraper for future NFL Draft prospects.
- lucasepe/go-ps4 Search playstation store for your favorite PS4 games using the command line.
- yringler/inside-chassidus-scraper Scrapes Rabbi Paltiel’s web site for lesson metadata.
- gamedb/gamedb A database of Steam games.
- lawzava/scrape CLI for email scraping from any website.
- eureka101v/WeiboSpiderGo A sina weibo(chinese twitter) scrapper
- Go-phie/gophie Search, Download and Stream movies from your terminal
- imthaghost/goclone Clone websites to your computer within seconds.
- superiss/spidy Crawl the web and collect expired domains.
- docker-slim/docker-slim Optimize your Docker containers to make them smaller and better.
