两种情况下需要进行抽样估计:
sampling(抽样)and estimation(估计)
- Simple random sampling(简单随机抽样)
- 随机抽,每个个体被抽中的概率要保证是一样的(公平原则)
- Stratified random sampling(分层随机抽样)
- to separate the population into smaller groups based on one or more distinguishing characteristics.
- Stratum and cells=M*N
- Simple random sampling(简单随机抽样)
- sampling error(抽样误差)
- sampling error of the mean = sample mean - population mean
- 样本均值减去总体均值
- sampling error of the mean = sample mean - population mean
The sample statistic itself is a random variable and has a probability distribution.
Time-series data(时间序列数据)
- A collection of observations at equally spaced intervals of time.
- A collection of data recorded over a period of time.
- 如:一个公司不同时间点的数据
- A collection of observations at equally spaced intervals of time.
Cross-sectional data(横截面数据)
样本均值服从什么样的分布?
- Central Limit Theory(中心极限定理)(统计学家分析得到)
- 描述样本均值服从什么样的分布
- For sufficiently large sample sizes,for any underlying distribution for a random variable with known population mean and variance, the sampling distribution:
- will be approximately normal
- has mean equals to the population mean
- has variance equal to the population variance of the variable divided by sample size, which equals
- 条件:
- 总体均值、方差已知
- 结论:
- 服从正太分布
Standard error(标准误)of the sample mean
Unbiasedness(无偏性)
- The expected value of the estimator equals the population parameter.
- 估计得到的均值与目标均值相等
- Efficiency(有效性)
- The unbiased estimator(无偏估计量)has the smallest variance.
- 无偏估计中方差最小的为有效的
Consistency(一致性)
- The probability of accurate estimates increases as sample size increases.(样本估计量 n 增大时,估计越来越准确)
- The standard deviation of the parameter estimate decreases as the sample size increases.
- If the sample size raises, the standard error of the sample mean falls.
Point estimate(点估计)
- the statistic, computed from sample information, which is used to estimate the population parameter.
Confidence interval estimate(置信区间估计)
- a range of values constructed from sample data so the parameter occurs within that range at a specified probability.
- (alpha):the level of significance
- 显著性水平:置信区间两边尾巴的面积之和
- Degree of Confidence()
- 置信度:,即置信区间的面积
- Confidence Interval = Point Estimatereliability factor * Standard error
T 分布为啥叫 Student’s t-distribution?因为发明 T 分布的人的笔名为 Student。
- Student’s t-distribution 的特点(5 个):
- Symmetrical(因此 skewness = 0)
- 所有的 T 分布的均值均为 0
- Degrees of freedom(df):n-1
- 自由度不同,则对应的 T 分布也不同
- Less peaked than a normal distribution(”fatter tails”)
- 相比于正太分布,T 分布“低峰肥尾”
- 注意:之前讲的“高峰肥尾”,是在正太分布的离散程度(即方差)一样的前提下;而 T 分布的情况下,没有这样的前提。
- 因此,T 分布的中间离散程度大,尾部的离散程度也打(肥尾)
- 大(不管是中间还是尾部,都相对较大)
- T 分布低峰,因此 kurtosis < 3
- 相比于正太分布,T 分布“低峰肥尾”
- Student’s t-distribution converges to the standard normal distribution as degrees of freedom goes to infinity.
- 相同的显著性水平下(相同),T 分布的置信区间()更宽(因为 T 分布低峰),因此 k 值偏向于更大
- T 分布 VS. 正太分布
- 方差已知用 Z,方差未知用 T
- 基于中心极限定理(中心极限定理:,总体均值和方差已知,则)
- 非正太总体小样本(小于 30)不可估计
- 方差已知用 Z,方差未知用 T
7、各种 bias(偏差)
- Data-mining bias
- Refers to results where the statistical significance of the pattern is overestimated because the results were found through data mining.
- “把偶然当必然”
- 如:模型预测时,用的是某数据库的数据,做检验时,用的还是那个数据库的数据
- 改进:预测时使用某数据库数据,检验时使用另一个数据库的数据
- Sample selection bias(选择性偏差)
- Some data is systematically excluded from the analysis, usually because of the lack of availability.
- Survivorship bias(幸存者偏差)
- Usually derives from sample selection for only the existing portfolio are included.
- 对冲基金的收益率存在比较明显的幸存者偏差(因为对冲基金的风险大,很多没有生存下来的对冲基金不存在了,都没在统计范围之内)
- Look-ahead bias(前视偏差)
- Occurs when a study tests a relationship using sample data that was not a available on the test date.
Time-period bias(时间序列偏差)
并没有明确是时间序列数据(时间序列数据如:同一个公司、不同时间)
(2)中心极限定理:样本均值标准误
-
(3)一个好的估计量的特征
-
(4)置信区间特性
(5)T 分布的特性
(6)各种 bias