读到一篇 Brave Search 的专访

    Brave 是一个新的搜索引擎 + 浏览器,不同于 DuckDuckGo, Neeva 依赖 Google / Bing API 来提供搜索结果, Brave 主要是自建索引,文章里提到一个 independence score 说是有 92,代表 92% 的搜索结果是来自自建索引的。

    一些摘抄

    The WDP helps with ranking, but the biggest contribution to ranking is the index itself, which is intentionally smaller than that of Google or Bing. We do not aim to index the whole Web, rather only the Web that is worth indexing. The biggest problem of search engines is noise-reduction. As with other complex machine learning systems, they suffer from the garbage-in, garbage-out problem. Google puts all the effort in trying to minimize the garbage-out by using very sophisticated models, trained with data. That’s a very good approach, but it’s something that we cannot replicate, not only because privacy-sensitive data is a no-go for Brave, but also because it requires a lot of resources. Brave’s approach to reduce the garbage-out is to be careful with what is being ingested, and that makes the algorithmic part of recall and rankings less resource-intensive.

    Independence is not something directly actionable, but it’s a fundamental property.

    Independence means that Brave Search would continue to work even if Google and Microsoft opposed it. Independence means choice and diversity: if results are drawn from the same provider, they are inherently limited by that provider.

    Independence is freedom to do as we see fit and to own our mistakes. If we were to censor Russia Today or CNBC, which we wouldn’t, it would be our choice, not our provider’s decision.

    The whole argument of “if you are not paying for the product then you are the product”, is problematic when the product is your privacy, because it’s a dangerous currency. It might be cheap now but can be very expensive in the future. Attention, however, is time-bounded and well understood by whoever is paying. Ads of course will always be a nuisance, and there is a tendency to show more and more ads. The solution to that is competition.

    Appending Reddit to every query would not work for the majority of searches, however, it works very well on certain types of queries.

    读完后有几个感想

    首先是自主可控,「Independence is not something directly actionable, but it’s a fundamental property」,这句是整篇文章最核心的思想。最近国家提倡的自主可控,内循环大方向我也是挺认同的,虽然行进过程中免不了各种乱象。

    第二个感想是关于 Reddit 的,最近混迹 Reddit 一些专业话题,里面的问题和回答质量相当高,所以有了append site:reddit.com 这样的技巧。在国内也可以 append 知乎site:zhihu.com,专业话题通常比在 baidu.com 里直接搜要好一些。

    第三个感想是关于信息获取的,文中提到 noise-reduction,brave 是通过减少索引网页来做的。最近有小伙伴咨询获取高质量信息源的方法,想了下,其中核心点也是控制信息源的数量,不好的就立马拉黑。「一言不合就拉黑」很多时候是挺好的,精力有限,没有必要消耗在不必要的人和事上,当然能经常这样任性的前提也是可以「自主可控」。

    最后一个感想是搜索业务本身,文中吐槽 Google 的搜索,但相比国内的搜索质量而言,绝对算是净土了。从自己这段时间做内容的实践看,好内容不通过推广,要在国内的搜索结果里脱颖而出,还是挺难的。360,腾讯风风火火搞搜索引擎都 10 多年前的事情了,虽然大家都用 App,但搜索引擎 + 浏览器的市场本身还是很大的。尤其是没有业务边界的字节跳动,个人觉得相比现在火山引擎的云计算业务,反倒是搜索引擎更有搞头,从技术,内容储备上本身也更有竞争力。

    希望巨头进场,也是希望这可以提升搜索质量,就像文章里说的 「The solution to that is competition」。自主可控是很重要,但品质过硬才算数,而要做出品质过硬的自主可控产品还是要靠人,那么投喂这些人的精神食粮就显得尤为重要了。