介绍

Seaborn库是开源库并且代码托管在GitHub上
Seaborn库
代码托管地址

Seaborn库中的kdeplot方法是用来打印核密度图(概率密度图)的
Seaborn是这么介绍这个方法的:
Fit and plot a univariate or bivariate kernel density estimate.

做直方图或概率密度图使都会用到这个方法

  1. import seaborn as sns
  2. sns.distplot(data)

image.png

  1. import seaborn as sns
  2. sns.kdeplot(data)

image.png

平时作图的时候都是只见曲线不见数值,因此对概率密度曲线是怎么计算的很好奇。

今天就通过Seaborn的源码来看看概率密度曲线是怎么画出来的吧

解读

Seabron项目在GitHub中的样子,其中主要代码都在seaborn文件夹下了。
image.png

文件的命名很直观,直方图与核密度图的代码大概就在seaborn中的distributions.py中
**

kdeplot()

找到kdeplot定义的位置

  1. @_deprecate_positional_args
  2. def kdeplot(
  3. *,
  4. x=None, y=None,
  5. shade=False, vertical=False, kernel="gau",
  6. bw="scott", gridsize=100, cut=3, clip=None, legend=True,
  7. cumulative=False, shade_lowest=True, cbar=False, cbar_ax=None,
  8. cbar_kws=None, ax=None,
  9. data=None, data2=None, # TODO move data once * is enforced
  10. **kwargs,
  11. ):

**
首先,一堆参数用到再说

  1. x_passed_as_data = (
  2. x is None
  3. and data is not None
  4. and np.ndim(data) == 1
  5. )
  6. if x_passed_as_data:
  7. x = data

如果传入的data是1维的并且参数x保持默认值None,那么将data赋值给x

  1. if data2 is not None:
  2. msg = "The `data2` param is now named `y`; please update your code."
  3. warnings.warn(msg)
  4. y = data2

如果data2不为None那么发一个警告,将data2赋值给y

  1. if isinstance(x, list):
  2. x = np.asarray(x)
  3. if isinstance(y, list):
  4. y = np.asarray(y)

如果x、y的数据类型为list那么将数据转变为numpy.array类型

  1. bivariate = x is not None and y is not None
  2. if bivariate and cumulative:
  3. raise TypeError("Cumulative distribution plots are not"
  4. "supported for bivariate distributions.")

如果x、y都存在并且cumulative为真时报错。
前面作的核密度图是1维的,只有x的值不为None,因此bivariate为False
因为如果x、y都存在那么画出的图类似等高线形式的分布图,与累计分布的形式不兼容
例如:

  1. mean, cov = [0, 1], [(1, .5), (.5, 1)]
  2. data = np.random.multivariate_normal(mean, cov, 200)
  3. df = pd.DataFrame(data, columns=["x", "y"])
  4. f, ax = plt.subplots(figsize=(6, 6))
  5. sns.kdeplot(df.x, df.y, ax=ax)

image.png
回到Seaborn的代码

  1. if ax is None:
  2. ax = plt.gca()
  3. if bivariate:
  4. ax = _bivariate_kdeplot(x, y, shade, shade_lowest,
  5. kernel, bw, gridsize, cut, clip, legend,
  6. cbar, cbar_ax, cbar_kws, ax, **kwargs)
  7. else:
  8. ax = _univariate_kdeplot(x, shade, vertical, kernel, bw,
  9. gridsize, cut, clip, legend, ax,
  10. cumulative=cumulative, **kwargs)
  11. return ax

首先为作图做了一下准备。
然后根据bivariate的值决定是二元kde作图还是一元的kde作图。
由于bivariate为False,因此调用一元kde的函数_univariate_kdeplot。

_univariate_kdeplot()

_univariate_kdeplot在同文件中

  1. def _univariate_kdeplot(data, shade, vertical, kernel, bw, gridsize, cut,
  2. clip, legend, ax, cumulative=False, **kwargs):

又是一堆参数

  1. # Sort out the clipping
  2. if clip is None:
  3. clip = (-np.inf, np.inf)
  4. # Preprocess the data
  5. data = remove_na(data)

第一句是如果clip没有设置就给了clip一个默认值

第二句的作用是去除na值

下面进入计算概率密度曲线的代码了

  1. # Calculate the KDE
  2. if np.nan_to_num(data.var()) == 0:
  3. # Don't try to compute KDE on singular data
  4. msg = "Data must have variance to compute a kernel density estimate."
  5. warnings.warn(msg, UserWarning)
  6. x, y = np.array([]), np.array([])
  7. elif _has_statsmodels:
  8. # Prefer using statsmodels for kernel flexibility
  9. x, y = _statsmodels_univariate_kde(data, kernel, bw,
  10. gridsize, cut, clip,
  11. cumulative=cumulative)
  12. else:
  13. # Fall back to scipy if missing statsmodels
  14. if kernel != "gau":
  15. kernel = "gau"
  16. msg = "Kernel other than `gau` requires statsmodels."
  17. warnings.warn(msg, UserWarning)
  18. if cumulative:
  19. raise ImportError("Cumulative distributions are currently "
  20. "only implemented in statsmodels. "
  21. "Please install statsmodels.")
  22. x, y = _scipy_univariate_kde(data, bw, gridsize, cut, clip)

代码是一个分支结构
np.nan_to_num的作用是
Replace NaN with zero and infinity with large finite numbers.
目的是防止数据中都是相同的值。
_has_statsmodels的定义是

  1. try:
  2. import statsmodels.nonparametric.api as smnp
  3. _has_statsmodels = True
  4. except ImportError:
  5. _has_statsmodels = False

我的环境里是有statsmodels包的因此进入第二个分支,否则进入第三个分支,使用scipy包来作为后端。
接下来进入_statsmodels_univariate_kde函数

_statsmodels_univariate_kde()

用statsmodels包来计算kde

  1. def _statsmodels_univariate_kde(data, kernel, bw, gridsize, cut, clip,
  2. cumulative=False):
  3. """Compute a univariate kernel density estimate using statsmodels."""

一堆参数

  1. # statsmodels 0.8 fails on int type data
  2. data = data.astype(np.float64)
  3. fft = kernel == "gau"

第一句,是避免0.8版本对int数据的报错
第二句,如果使用的是高斯核则使用快速傅里叶变换加快计算速度。

  1. kde = smnp.KDEUnivariate(data)
  2. try:
  3. kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
  4. except RuntimeError as err: # GH#1990
  5. if stats.iqr(data) > 0:
  6. raise err
  7. msg = "Default bandwidth for data is 0; skipping density estimation."
  8. warnings.warn(msg, UserWarning)
  9. return np.array([]), np.array([])
  10. if cumulative:
  11. grid, y = kde.support, kde.cdf
  12. else:
  13. grid, y = kde.support, kde.density
  14. return grid, y

接下来就是用数据fit一个概率密度曲线了。
我们找到前面的这些fit参数,
试一下直接fit我们的数据会得到什么数据。

  1. kernel = 'gau'
  2. bw = 'scott'
  3. fft = True
  4. gridsize = 100
  5. cut = 3
  6. clip = (-np.inf, np.inf)
  7. kde = smnp.KDEUnivariate(data)
  8. kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
  9. grid, y = kde.support, kde.density
  10. plt.plot(grid,y)

image.png
🆗,到现在我们已经知道了概率密度曲线是怎么计算的了。
可以尝试一下改变参数会有什么变化。
其中,gridsize控制拟合点的数量,值越大曲线会越平滑。

_univariate_kdeplot()的绘图部分

接下来就是对数据的简单处理、验证跟绘图部分了,就不细说了。

  1. # Make sure the density is nonnegative
  2. y = np.amax(np.c_[np.zeros_like(y), y], axis=1)
  3. # Flip the data if the plot should be on the y axis
  4. if vertical:
  5. x, y = y, x
  6. # Check if a label was specified in the call
  7. label = kwargs.pop("label", None)
  8. # Otherwise check if the data object has a name
  9. if label is None and hasattr(data, "name"):
  10. label = data.name
  11. # Decide if we're going to add a legend
  12. legend = label is not None and legend
  13. label = "_nolegend_" if label is None else label
  14. # Use the active color cycle to find the plot color
  15. facecolor = kwargs.pop("facecolor", None)
  16. line, = ax.plot(x, y, **kwargs)
  17. color = line.get_color()
  18. line.remove()
  19. kwargs.pop("color", None)
  20. facecolor = color if facecolor is None else facecolor
  21. # Draw the KDE plot and, optionally, shade
  22. ax.plot(x, y, color=color, label=label, **kwargs)
  23. shade_kws = dict(
  24. facecolor=facecolor,
  25. alpha=kwargs.get("alpha", 0.25),
  26. clip_on=kwargs.get("clip_on", True),
  27. zorder=kwargs.get("zorder", 1),
  28. )
  29. if shade:
  30. if vertical:
  31. ax.fill_betweenx(y, 0, x, **shade_kws)
  32. else:
  33. ax.fill_between(x, 0, y, **shade_kws)
  34. # Set the density axis minimum to 0
  35. if vertical:
  36. ax.set_xlim(0, auto=None)
  37. else:
  38. ax.set_ylim(0, auto=None)
  39. # Draw the legend here
  40. handles, labels = ax.get_legend_handles_labels()
  41. if legend and handles:
  42. ax.legend(loc="best")
  43. return ax