1. 水库抽样
1.1 水库抽样问题(海量数据随机抽样问题)
输入:一组数据,大小未知 输出:这组数据的K个均匀抽取 要求:仅扫描一次 总体要求:从N个元素中随机的抽取k个元素,其中N无法确定,保证每个元素抽到的概率相同
1.2 解决方法
n为数据总量,k为水库大小(要抽样个数),i为当前遍历的元素的位置。 当i
k时,以概率k/i选取该数据随机替换掉水库中的任一数据。通常实现的时候是生成一个0到i的随机数,跟k相比。 int r = random.nextInt(i + 1);
if (r < K) {
result[r] = pool[i];
}
1.3 证明
2. shuffle的实现
def shuffle(self, x, random=None, int=int):
"""x, random=random.random -> shuffle list x in place; return None.
Optional arg random is a 0-argument function returning a random
float in [0.0, 1.0); by default, the standard random.random.
"""
if random is None:
random = self.random
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random() * (i+1))
x[i], x[j] = x[j], x[i]
要证明算法的正确性也很简单,即任何一个元素shuffle之后出现在任意位置的概率都是1/N。任意一个元素,放在第N-1个位置的概率是1/N, 放在pos N-2的位置是 (N-1)/N * 1 / (N-1) = 1/N 。需要注意的是,一个元素一旦被交换到了序列的尾部,那么就不会再被选中,这也是算法一目了然的原因。
3. Sample的实现
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use xrange as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(xrange(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
n = len(population)
if not 0 <= k <= n:
raise ValueError("sample larger than population")
random = self.random
_int = int
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize or hasattr(population, "keys"):
# An n-length list is smaller than a k-length set, or this is a
# mapping type so the other algorithm wouldn't work.
pool = list(population)
for i in xrange(k): # invariant: non-selected at [0,n-i)
j = _int(random() * (n-i))
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
try:
selected = set()
selected_add = selected.add
for i in xrange(k):
j = _int(random() * n)
while j in selected:
j = _int(random() * n)
selected_add(j)
result[i] = population[j]
except (TypeError, KeyError): # handle (at least) sets
if isinstance(population, list):
raise
return self.sample(tuple(population), k)
return result