机器学习（算法与数据结构篇） - 水库抽样/shuffle/sample/洗牌算法 - 《机器学习》

1. 水库抽样
2. shuffle的实现
3. Sample的实现
4. 随机数的实现、模拟及变体

1. 水库抽样

1.1 水库抽样问题(海量数据随机抽样问题)

输入：一组数据，大小未知输出：这组数据的K个均匀抽取要求：仅扫描一次总体要求：从N个元素中随机的抽取k个元素，其中N无法确定，保证每个元素抽到的概率相同

1.2 解决方法

n为数据总量，k为水库大小(要抽样个数)，i为当前遍历的元素的位置。当ik时，以概率k/i选取该数据随机替换掉水库中的任一数据。通常实现的时候是生成一个0到i的随机数，跟k相比。 int r = random.nextInt(i + 1);
if (r < K) {
result[r] = pool[i];
}

1.3 证明

截屏2021-06-09 下午2.09.32.png

2. shuffle的实现

def shuffle(self, x, random=None, int=int):
        """x, random=random.random -> shuffle list x in place; return None.
        Optional arg random is a 0-argument function returning a random
        float in [0.0, 1.0); by default, the standard random.random.
        """
        if random is None:
            random = self.random
        for i in reversed(xrange(1, len(x))):
            # pick an element in x[:i+1] with which to exchange x[i]
            j = int(random() * (i+1))
            x[i], x[j] = x[j], x[i]

要证明算法的正确性也很简单，即任何一个元素shuffle之后出现在任意位置的概率都是1/N。任意一个元素，放在第N-1个位置的概率是1/N，放在pos N-2的位置是 (N-1)/N * 1 / (N-1) = 1/N 。需要注意的是，一个元素一旦被交换到了序列的尾部，那么就不会再被选中，这也是算法一目了然的原因。

3. Sample的实现

def sample(self, population, k):
        """Chooses k unique random elements from a population sequence.
        Returns a new list containing elements from the population while
        leaving the original population unchanged.  The resulting list is
        in selection order so that all sub-slices will also be valid random
        samples.  This allows raffle winners (the sample) to be partitioned
        into grand prize and second place winners (the subslices).
        Members of the population need not be hashable or unique.  If the
        population contains repeats, then each occurrence is a possible
        selection in the sample.
        To choose a sample in a range of integers, use xrange as an argument.
        This is especially fast and space efficient for sampling from a
        large population:   sample(xrange(10000000), 60)
        """
        # Sampling without replacement entails tracking either potential
        # selections (the pool) in a list or previous selections in a set.
        # When the number of selections is small compared to the
        # population, then tracking selections is efficient, requiring
        # only a small set and an occasional reselection.  For
        # a larger number of selections, the pool tracking method is
        # preferred since the list takes less space than the
        # set and it doesn't suffer from frequent reselections.
        n = len(population)
        if not 0 <= k <= n:
            raise ValueError("sample larger than population")
        random = self.random
        _int = int
        result = [None] * k
        setsize = 21        # size of a small set minus size of an empty list
        if k > 5:
            setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
        if n <= setsize or hasattr(population, "keys"):
            # An n-length list is smaller than a k-length set, or this is a
            # mapping type so the other algorithm wouldn't work.
            pool = list(population)
            for i in xrange(k):         # invariant:  non-selected at [0,n-i)
                j = _int(random() * (n-i))
                result[i] = pool[j]
                pool[j] = pool[n-i-1]   # move non-selected item into vacancy
        else:
            try:
                selected = set()
                selected_add = selected.add
                for i in xrange(k):
                    j = _int(random() * n)
                    while j in selected:
                        j = _int(random() * n)
                    selected_add(j)
                    result[i] = population[j]
            except (TypeError, KeyError):   # handle (at least) sets
                if isinstance(population, list):
                    raise
                return self.sample(tuple(population), k)
        return result

水库抽样/shuffle/sample/洗牌算法

1. 水库抽样

1.1 水库抽样问题(海量数据随机抽样问题)

1.2 解决方法

1.3 证明

2. shuffle的实现

3. Sample的实现

4. 随机数的实现、模拟及变体