• 两种情况下需要进行抽样估计:

    • 总体太大
    • 实验具有毁灭性(比如衡量火柴的合格率,不能将火柴都拿出来测试,因为测试完后就不能用了)

      1、sampling and estimation、sampling error

  • sampling(抽样)and estimation(估计)

    • Simple random sampling(简单随机抽样)
      • 随机抽,每个个体被抽中的概率要保证是一样的(公平原则)
    • Stratified random sampling(分层随机抽样)
      • to separate the population into smaller groups based on one or more distinguishing characteristics.
      • Stratum and cells=M*N
  • sampling error(抽样误差)
    • sampling error of the mean = sample mean - population mean
      • 样本均值减去总体均值
  • The sample statistic itself is a random variable and has a probability distribution.

    • sample statistic(样本统计量):如方差、均值等

      2、数据分类:Time-series data、Cross-sectional data

  • Time-series data(时间序列数据)

    • A collection of observations at equally spaced intervals of time.
      • A collection of data recorded over a period of time.
      • 如:一个公司不同时间点的数据
  • Cross-sectional data(横截面数据)

    • A collection of observations at single point in time.
      • A collection of data taken at a single point of time.
      • 如:同一时间,不同公司的数据

        3、中心极限定理(Central Limit Theory)

  • 样本均值(06)Sampling and Estimation - 图1服从什么样的分布?

  • Central Limit Theory(中心极限定理)(统计学家分析得到)
    • 描述样本均值(06)Sampling and Estimation - 图2服从什么样的分布
    • For sufficiently large sample sizes(06)Sampling and Estimation - 图3,for any underlying distribution for a random variable with known population mean and variance, the sampling distribution:
      • will be approximately normal
      • has mean equals to the population mean(06)Sampling and Estimation - 图4
      • has variance equal to the population variance of the variable divided by sample size, which equals(06)Sampling and Estimation - 图5
    • 条件:
      • (06)Sampling and Estimation - 图6
      • 总体均值、方差已知
    • 结论:
      • 服从正太分布
      • (06)Sampling and Estimation - 图7
      • (06)Sampling and Estimation - 图8
      • (06)Sampling and Estimation - 图9
      • (06)Sampling and Estimation - 图10
  • Standard error(标准误)of the sample mean

    • 即样本均值的标准差
    • Known population variance:
      • (06)Sampling and Estimation - 图11
    • Unknown population variance:
      • (06)Sampling and Estimation - 图12

        4、好的估计量的特征(Desirable properties of an estimator)

  • Unbiasedness(无偏性)

    • The expected value of the estimator equals the population parameter.
    • 估计得到的均值与目标均值相等
  • Efficiency(有效性)
    • The unbiased estimator(无偏估计量)has the smallest variance.
    • 无偏估计中方差最小的为有效的
  • Consistency(一致性)

    • The probability of accurate estimates increases as sample size increases.(样本估计量 n 增大时,估计越来越准确)
    • The standard deviation of the parameter estimate decreases as the sample size increases.
    • If the sample size raises, the standard error of the sample mean falls.
      • 随着 n 增大,样本均值的标准误减小
        • (06)Sampling and Estimation - 图13

          5、点估计、置信区间估计、区间估计(Hypothesis testing)

  • Point estimate(点估计)

    • the statistic, computed from sample information, which is used to estimate the population parameter.
  • Confidence interval estimate(置信区间估计)

    • a range of values constructed from sample data so the parameter occurs within that range at a specified probability.
    • (06)Sampling and Estimation - 图14(alpha):the level of significance
      • 显著性水平:置信区间两边尾巴的面积之和
    • Degree of Confidence((06)Sampling and Estimation - 图15
      • 置信度:(06)Sampling and Estimation - 图16,即置信区间的面积
    • Confidence Interval = Point Estimate(06)Sampling and Estimation - 图17reliability factor * Standard error
      • 即:(06)Sampling and Estimation - 图18
        • (06)Sampling and Estimation - 图19越大,估计的越准,区间越窄

          6、Student’s t-distribution

  • T 分布为啥叫 Student’s t-distribution?因为发明 T 分布的人的笔名为 Student。

  • Student’s t-distribution 的特点(5 个):
    • Symmetrical(因此 skewness = 0)
    • 所有的 T 分布的均值均为 0
    • Degrees of freedom(df):n-1
      • 自由度不同,则对应的 T 分布也不同
    • Less peaked than a normal distribution(”fatter tails”)
      • 相比于正太分布,T 分布“低峰肥尾”
        • 注意:之前讲的“高峰肥尾”,是在正太分布的离散程度(即方差(06)Sampling and Estimation - 图20)一样的前提下;而 T 分布的情况下,没有这样的前提。
        • 因此,T 分布的中间离散程度大,尾部的离散程度也打(肥尾)
        • (06)Sampling and Estimation - 图21大(不管是中间还是尾部,都相对较大)
        • T 分布低峰,因此 kurtosis < 3
    • Student’s t-distribution converges to the standard normal distribution as degrees of freedom goes to infinity.
    • 相同的显著性水平下((06)Sampling and Estimation - 图22相同),T 分布的置信区间((06)Sampling and Estimation - 图23)更宽(因为 T 分布低峰),因此 k 值偏向于更大

image.png

  • T 分布 VS. 正太分布
    • 方差已知用 Z,方差未知用 T
      • 基于中心极限定理(中心极限定理:(06)Sampling and Estimation - 图25,总体均值和方差已知,则(06)Sampling and Estimation - 图26
    • 非正太总体小样本(小于 30)不可估计
    • (06)Sampling and Estimation - 图27
    • (06)Sampling and Estimation - 图28

image.png

7、各种 bias(偏差)

  • Data-mining bias
    • Refers to results where the statistical significance of the pattern is overestimated because the results were found through data mining.
    • “把偶然当必然”
    • 如:模型预测时,用的是某数据库的数据,做检验时,用的还是那个数据库的数据
      • 改进:预测时使用某数据库数据,检验时使用另一个数据库的数据
  • Sample selection bias(选择性偏差)
    • Some data is systematically excluded from the analysis, usually because of the lack of availability.
  • Survivorship bias(幸存者偏差)
    • Usually derives from sample selection for only the existing portfolio are included.
    • 对冲基金的收益率存在比较明显的幸存者偏差(因为对冲基金的风险大,很多没有生存下来的对冲基金不存在了,都没在统计范围之内)
  • Look-ahead bias(前视偏差)
    • Occurs when a study tests a relationship using sample data that was not a available on the test date.
  • Time-period bias(时间序列偏差)

    • Time period over which the data is gathered is either too short or too long. If the time period is too short, research results may reflect phenomena specific to that time period, or perhaps even data mining.

      9、例题

      (1)抽样估计、数据分类

      image.png
  • 并没有明确是时间序列数据(时间序列数据如:同一个公司、不同时间)

    (2)中心极限定理:样本均值标准误

    image.png

  • (06)Sampling and Estimation - 图32

    (3)一个好的估计量的特征

    image.png

  • 第三个描述(关于一致性)不对

    (4)置信区间特性

    image.png

    (5)T 分布的特性

    image.png

    (6)各种 bias

    image.png