Hive 中所有的操作符、函数和聚合函数可见官方文档《Hive Operators and User-Defined Functions》
Hive 的其他使用方法,可以先看一遍《Hive Language Manual》,看过后处理一些问题比直接搜索引擎要有效率。
Return Type | Name(Signature) | Description |
---|---|---|
BIGINT | count(*), count(expr), count(DISTINCT expr[, expr…]) | 计数、去重计数 count(*) - Returns the total number of retrieved rows, including rows containing NULL values. 包含 NULL 的值 count(expr) - Returns the number of rows for which the supplied expression is non-NULL. 不含 NULL 的值 count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Execution of this can be optimized with hive.optimize.distinct.rewrite. |
DOUBLE | sum(col), sum(DISTINCT col) | 求和,去重后求和 Returns the sum of the elements in the group or the sum of the distinct values of the column in the group. |
DOUBLE | avg(col), avg(DISTINCT col) | 平均,去重后去平均 Returns the average of the elements in the group or the average of the distinct values of the column in the group. |
DOUBLE | min(col) | 取最小值 Returns the minimum of the column in the group. |
DOUBLE | max(col) | 取最大值 Returns the maximum value of the column in the group. |
DOUBLE | variance(col), var_pop(col) | 取方差 Returns the variance of a numeric column in the group. |
DOUBLE | var_samp(col) | Returns the unbiased sample variance of a numeric column in the group. |
DOUBLE | stddev_pop(col) | 取标准差 Returns the standard deviation of a numeric column in the group. |
DOUBLE | stddev_samp(col) | Returns the unbiased sample standard deviation of a numeric column in the group. |
DOUBLE | covar_pop(col1, col2) | 取两列的协方差 Returns the population covariance of a pair of numeric columns in the group. |
DOUBLE | covar_samp(col1, col2) | Returns the sample covariance of a pair of a numeric columns in the group. |
DOUBLE | corr(col1, col2) | Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group. |
DOUBLE | percentile(BIGINT col, p) | 取百分位数 Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral. |
array |
percentile(BIGINT col, array(p1 [, p2]…)) | 取一组百分位数 Returns the exact percentiles p1, p2, … of a column in the group (does not work with floating point types). pi must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral. |
DOUBLE | percentile_approx(DOUBLE col, p [, B]) | Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value. |
array |
percentile_approx(DOUBLE col, array(p1 [, p2]…) [, B]) | Same as above, but accepts and returns an array of percentile values instead of a single one. |
double | regr_avgx(independent, dependent) | Equivalent to avg(dependent). As of Hive 2.2.0. |
double | regr_avgy(independent, dependent) | Equivalent to avg(independent). As of Hive 2.2.0. |
double | regr_count(independent, dependent) | Returns the number of non-null pairs used to fit the linear regression line. As of Hive 2.2.0. |
double | regr_intercept(independent, dependent) | Returns the y-intercept of the linear regression line, i.e. the value of b in the equation dependent = a * independent + b. As of Hive 2.2.0. |
double | regr_r2(independent, dependent) | Returns the coefficient of determination for the regression. As of Hive 2.2.0. |
double | regr_slope(independent, dependent) | Returns the slope of the linear regression line, i.e. the value of a in the equation dependent = a * independent + b. As of Hive 2.2.0. |
double | regr_sxx(independent, dependent) | Equivalent to regr_count(independent, dependent) * var_pop(dependent). As of Hive 2.2.0. |
double | regr_sxy(independent, dependent) | Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent). As of Hive 2.2.0. |
double | regr_syy(independent, dependent) | Equivalent to regr_count(independent, dependent) * var_pop(independent). As of Hive 2.2.0. |
array |
histogram_numeric(col, b) | Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights |
array | collect_set(col) | 取集合(去重复) Returns a set of objects with duplicate elements eliminated. |
array | collect_list(col) | 取列表(不去重) Returns a list of objects with duplicates. (As of Hive 0.13.0.) |
INTEGER | ntile(INTEGER x) | Divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.) |