快速、优雅的爬虫框架对于golang开发者
colly提供一个干净的接口对于写任何类型的网络爬虫。
使用Colly,您可以轻松地从网站提取结构化数据,可以用于广泛的应用程序,如数据挖掘,数据处理或存档。
特性
- 干净的API
- 快速(单核>1k请求/秒)
- 管理每个域的请求延迟和最大并发性
- 自动cookie和会话处理
- 同步/异步并行抓取
- 缓存
- 非unicode响应的自动编码
- robots . txt的支持
- 分布式抓取
- 通过环境变量进行配置
- 扩展
例子
c := colly.NewCollector()
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
请参阅示例文件夹以获得更详细的示例。
安装
添加colloy到你的 go.mod文件:
module github.com/x/y
go 1.14
require (
github.com/gocolly/colly/v2 latest
)
Bugs
bug或建议吗?访问问题跟踪器或加入#colly的freenode
使用Colly的其他项目
下面是使用Colly的公共开源项目列表:
- greenpeace/check-my-pages Scraping script to test the Spanish Greenpeace web archive.
- altsab/gowap Wappalyzer implementation in Go.
- jesuiscamille/goquotes A quotes scrapper, making your day a little better!
- jivesearch/jivesearch A search engine that doesn’t track you.
- Leagify/colly-draft-prospects A scraper for future NFL Draft prospects.
- lucasepe/go-ps4 Search playstation store for your favorite PS4 games using the command line.
- yringler/inside-chassidus-scraper Scrapes Rabbi Paltiel’s web site for lesson metadata.
- gamedb/gamedb A database of Steam games.
- lawzava/scrape CLI for email scraping from any website.
- eureka101v/WeiboSpiderGo A sina weibo(chinese twitter) scrapper
- Go-phie/gophie Search, Download and Stream movies from your terminal
- imthaghost/goclone Clone websites to your computer within seconds.
- superiss/spidy Crawl the web and collect expired domains.
- docker-slim/docker-slim Optimize your Docker containers to make them smaller and better.