1. 水库抽样

1.1 水库抽样问题(海量数据随机抽样问题)

输入:一组数据,大小未知 输出:这组数据的K个均匀抽取 要求:仅扫描一次 总体要求:从N个元素中随机的抽取k个元素,其中N无法确定,保证每个元素抽到的概率相同

1.2 解决方法

n为数据总量,k为水库大小(要抽样个数),i为当前遍历的元素的位置。 当ik时,以概率k/i选取该数据随机替换掉水库中的任一数据。通常实现的时候是生成一个0到i的随机数,跟k相比。 int r = random.nextInt(i + 1);
if (r < K) {
result[r] = pool[i];
}

1.3 证明

截屏2021-06-09 下午2.09.32.png

2. shuffle的实现

  1. def shuffle(self, x, random=None, int=int):
  2. """x, random=random.random -> shuffle list x in place; return None.
  3. Optional arg random is a 0-argument function returning a random
  4. float in [0.0, 1.0); by default, the standard random.random.
  5. """
  6. if random is None:
  7. random = self.random
  8. for i in reversed(xrange(1, len(x))):
  9. # pick an element in x[:i+1] with which to exchange x[i]
  10. j = int(random() * (i+1))
  11. x[i], x[j] = x[j], x[i]

要证明算法的正确性也很简单,即任何一个元素shuffle之后出现在任意位置的概率都是1/N。任意一个元素,放在第N-1个位置的概率是1/N, 放在pos N-2的位置是 (N-1)/N * 1 / (N-1) = 1/N 。需要注意的是,一个元素一旦被交换到了序列的尾部,那么就不会再被选中,这也是算法一目了然的原因。

3. Sample的实现

  1. def sample(self, population, k):
  2. """Chooses k unique random elements from a population sequence.
  3. Returns a new list containing elements from the population while
  4. leaving the original population unchanged. The resulting list is
  5. in selection order so that all sub-slices will also be valid random
  6. samples. This allows raffle winners (the sample) to be partitioned
  7. into grand prize and second place winners (the subslices).
  8. Members of the population need not be hashable or unique. If the
  9. population contains repeats, then each occurrence is a possible
  10. selection in the sample.
  11. To choose a sample in a range of integers, use xrange as an argument.
  12. This is especially fast and space efficient for sampling from a
  13. large population: sample(xrange(10000000), 60)
  14. """
  15. # Sampling without replacement entails tracking either potential
  16. # selections (the pool) in a list or previous selections in a set.
  17. # When the number of selections is small compared to the
  18. # population, then tracking selections is efficient, requiring
  19. # only a small set and an occasional reselection. For
  20. # a larger number of selections, the pool tracking method is
  21. # preferred since the list takes less space than the
  22. # set and it doesn't suffer from frequent reselections.
  23. n = len(population)
  24. if not 0 <= k <= n:
  25. raise ValueError("sample larger than population")
  26. random = self.random
  27. _int = int
  28. result = [None] * k
  29. setsize = 21 # size of a small set minus size of an empty list
  30. if k > 5:
  31. setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
  32. if n <= setsize or hasattr(population, "keys"):
  33. # An n-length list is smaller than a k-length set, or this is a
  34. # mapping type so the other algorithm wouldn't work.
  35. pool = list(population)
  36. for i in xrange(k): # invariant: non-selected at [0,n-i)
  37. j = _int(random() * (n-i))
  38. result[i] = pool[j]
  39. pool[j] = pool[n-i-1] # move non-selected item into vacancy
  40. else:
  41. try:
  42. selected = set()
  43. selected_add = selected.add
  44. for i in xrange(k):
  45. j = _int(random() * n)
  46. while j in selected:
  47. j = _int(random() * n)
  48. selected_add(j)
  49. result[i] = population[j]
  50. except (TypeError, KeyError): # handle (at least) sets
  51. if isinstance(population, list):
  52. raise
  53. return self.sample(tuple(population), k)
  54. return result

4. 随机数的实现、模拟及变体