Python - Generating Random Data in Python - 《代码笔记》

How Random Is Random?
What Is “Cryptographically Secure?”
What You’ll Cover Here
PRNGs in Python
- The random Module
- PRNGs for Arrays: numpy.random
CSPRNGs in Python
- os.urandom(): About as Random as It Gets
- Python’s Best Kept secrets
One Last Candidate: uuid
Recap
References:

How Random Is Random?

Most random data generated with Python is not fully random in the scientific sense of the word. Rather, it is pseudorandom: generated with a pseudorandom number generator (PRNG), which is essentially any algorithm for generating seemingly random but still reproducible data.

“True” random numbers can be generated by, you guessed it, a true random number generator (TRNG). One example is to repeatedly pick up a die off the floor, toss it in the air, and let it land how it may.

PRNGs, usually done with software rather than hardware. They start with a random number, known as the seed, and then use an algorithm to generate a pseudo-random sequence of bits based on it.
You’ve probably seen random.seed(999), random.seed(1234), or the like, in Python. This function call is seeding the underlying random number generator used by Python’s randommodule. It is what makes subsequent calls to generate random numbers deterministic: input A always produces output B. This blessing can also be a curse if it is used maliciously.
If you use the seed value 1234, the subsequent sequence of calls to random() should always be identical:

>>> seed(1234)
>>> [random() for _ in range(10)]
[16, 10, 11, 14, 4, 12, 17, 13, 1, 3]
>>> seed(1234)
>>> [random() for _ in range(10)]
[16, 10, 11, 14, 4, 12, 17, 13, 1, 3]

What Is “Cryptographically Secure?”

CSPRNGs, or cryptographically secure PRNGs, are suitable for generating sensitive data such as passwords, authenticators, and tokens. Given a random string, there is realistically no way for Malicious Joe to determine what string came before or after that string in a sequence of random strings. A key point about CSPRNGs is that they are still pseudorandom. They are engineered in some way that is internally deterministic, but they add some other variable or have some property that makes them “random enough” to prohibit backing into whatever function enforces determinism.

What You’ll Cover Here

In practical terms, this means that you should use plain PRNGs for statistical modeling, simulation, and to make random data reproducible. They’re also significantly faster than CSPRNGs, as you’ll see later on. Use CSPRNGs for security and cryptographic applications where data sensitivity is imperative.

In addition to expanding on the use cases above, in this tutorial, you’ll delve into Python tools for using both PRNGs and CSPRNGs:

PRNG options include the random module from Python’s standard library and its array-based NumPy counterpart, numpy.random.
Python’s os, secrets, and uuid modules contain functions for generating cryptographically secure objects.

You’ll touch on all of the above and wrap up with a high-level comparison.

PRNGs in Python

The `random` Module

Probably the most widely known tool for generating random data in Python is its randommodule, which uses the Mersenne Twister PRNG algorithm as its core generator.

Earlier, you touched briefly on random.seed(), and now is a good time to see how it works. First, let’s build some random data without seeding. The random.random() function returns a random float in the interval [0.0, 1.0). The result will always be less than the right-hand endpoint (1.0). This is also known as a semi-open range:

>>>
>>> # Don't call `random.seed()` yet
>>> import random
>>> random.random()
0.35553263284394376
>>> random.random()
0.6101992345575074

If you run this code yourself, I’ll bet my life savings that the numbers returned on your machine will be different. The default when you don’t seed the generator is to use your current system time or a “randomness source” from your OS if one is available.
With random.seed(), you can make results reproducible, and the chain of calls after random.seed() will produce the same trail of data:

>>>
>>> random.seed(444)
>>> random.random()
0.3088946587429545
>>> random.random()
0.01323751590501987
>>> random.seed(444)  # Re-seed
>>> random.random()
0.3088946587429545
>>> random.random()
0.01323751590501987

Notice the repetition of “random” numbers. The sequence of random numbers becomes deterministic, or completely determined by the seed value, 444.

Above, you generated a random float. You can generate a random integer between two endpoints in Python with the random.randint() function. This spans the full [x, y] interval and may include both endpoints:

>>> random.randint(0, 10)
7
>>> random.randint(500, 50000)
18601

With random.randrange(), you can exclude the right-hand side of the interval, meaning the generated number always lies within [x, y) and will always be smaller than the right endpoint:

>>> random.randrange(1, 10)
5

If you need to generate random floats that lie within a specific [x, y] interval, you can use random.uniform(), which plucks from the continuous uniform distribution:

>>> random.uniform(20, 30)
27.42639687016509
>>> random.uniform(30, 40)
36.33865802745107

To pick a random element from a non-empty sequence (like a list or a tuple), you can use random.choice(). There is also random.choices() for choosing multiple elements from a sequence with replacement (duplicates are possible):

>>> items = ['one', 'two', 'three', 'four', 'five']
>>> random.choice(items)
'four'
>>> random.choices(items, k=2)
['three', 'three']
>>> random.choices(items, k=3)
['three', 'five', 'four']

To mimic sampling without replacement, use random.sample():

>>> random.sample(items, 4)
['one', 'five', 'four', 'three']

You can randomize a sequence in-place using random.shuffle(). This will modify the sequence object and randomize the order of elements:

>>> random.shuffle(items)
>>> items
['four', 'three', 'two', 'one', 'five']

If you’d rather not mutate the original list, you’ll need to make a copy first and then shuffle the copy. You can create copies of Python lists with the copy module, or just x[:] or x.copy(), where x is the list.

Application one: generating a sequence of unique random strings of uniform length.

import string
def unique_strings(k: int, ntokens: int,
               pool: str=string.ascii_letters) -> set:
    """Generate a set of unique string tokens.
    k: Length of each token
    ntokens: Number of tokens
    pool: Iterable of characters to choose from
    For a highly optimized version:
    https://stackoverflow.com/a/48421303/7954504
    """
    seen = set()
    # An optimization for tightly-bound loops:
    # Bind these methods outside of a loop
    join = ''.join
    add = seen.add
    while len(seen) < ntokens:
        token = join(random.choices(pool, k=k))
        add(token)
    return seen

Examples:

>>> unique_strings(k=4, ntokens=5)
{'AsMk', 'Cvmi', 'GIxv', 'HGsZ', 'eurU'}
>>> unique_strings(5, 4, string.printable)
{"'O*1!", '9Ien%', 'W=m7<', 'mUD|z'}

PRNGs for Arrays: `numpy.random`

One thing you might have noticed is that a majority of the functions from random return a scalar value (a single int, float, or other object). If you wanted to generate a sequence of random numbers, one way to achieve that would be with a Python list comprehension:

>>> [random.random() for _ in range(5)]
[0.021655420657909374,
 0.4031628347066195,
 0.6609991871223335,
 0.5854998250783767,
 0.42886606317322706]

But there is another option that is specifically designed for this. You can think of NumPy’s own numpy.random package as being like the standard library’s random, but for NumPy arrays. (It also comes loaded with the ability to draw from a lot more statistical distributions.)

Take note that numpy.random uses its own PRNG that is separate from plain old random. You won’t produce deterministically random NumPy arrays with a call to Python’s own random.seed():

>>> import numpy as np
>>> np.random.seed(444)
>>> np.set_printoptions(precision=2)  # Output decimal fmt.

Without further ado, here are a few examples to whet your appetite:

>>> # Return samples from the standard normal distribution
>>> np.random.randn(5)
array([ 0.36,  0.38,  1.38,  1.18, -0.94])
>>> np.random.randn(3, 4)
array([[-1.14, -0.54, -0.55,  0.21],
       [ 0.21,  1.27, -0.81, -3.3 ],
       [-0.81, -0.36, -0.88,  0.15]])
>>> # `p` is the probability of choosing each element
>>> np.random.choice([0, 1], p=[0.6, 0.4], size=(5, 4))
array([[0, 0, 1, 0],
       [0, 1, 1, 1],
       [1, 1, 1, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 1]])

In the syntax for randn(d0, d1, ..., dn), the parameters d0, d1, ..., dn are optional and indicate the shape of the final object. Here, np.random.randn(3, 4) creates a 2d array with 3 rows and 4 columns. The data will be i.i.d., meaning that each data point is drawn independent of the others.

Another common operation is to create a sequence of random Boolean values, True or False. One way to do this would be with np.random.choice([True, False]). However, it’s actually about 4x faster to choose from (0, 1) and then view-cast these integers to their corresponding Boolean values:

>>> # NumPy's `randint` is [inclusive, exclusive), unlike `random.randint()`
>>> np.random.randint(0, 2, size=25, dtype=np.uint8).view(bool)
array([ True, False,  True,  True, False,  True, False, False, False,
       False, False,  True,  True, False, False, False,  True, False,
        True, False,  True,  True,  True, False,  True])

What about generating correlated data? Let’s say you want to simulate two correlated time series. One way of going about this is with NumPy’s multivariate_normal() function, which takes a covariance matrix into account. In other words, to draw from a single normally distributed random variable, you need to specify its mean and variance (or standard deviation).

To sample from the multivariate normal distribution, you specify the means and covariance matrix, and you end up with multiple, correlated series of data that are each approximately normally distributed. However, rather than covariance, correlation is a measure that is more familiar and intuitive to most. It’s the covariance normalized by the product of standard deviations, and so you can also define covariance in terms of correlation and standard deviation:
Generating Random Data in Python - 图1
So, could you draw random samples from a multivariate normal distribution by specifying a correlation matrix and standard deviations? Yes, but you’ll need to get the above into matrix form first. Here, S is a vector of the standard deviations, P is their correlation matrix, and C is the resulting (square) covariance matrix:
Generating Random Data in Python - 图2

This can be expressed in NumPy as follows:

def corr2cov(p: np.ndarray, s: np.ndarray) -> np.ndarray:
    """Covariance matrix from correlation & standard deviations"""
    d = np.diag(s)
    return d @ p @ d

Now, you can generate two time series that are correlated but still random:

>>> # Start with a correlation matrix and standard deviations.
>>> # -0.40 is the correlation between A and B, and the correlation
>>> # of a variable with itself is 1.0.
>>> corr = np.array([[1., -0.40],
...                  [-0.40, 1.]])
>>> # Standard deviations/means of A and B, respectively
>>> stdev = np.array([6., 1.])
>>> mean = np.array([2., 0.5])
>>> cov = corr2cov(corr, stdev)
>>> # `size` is the length of time series for 2d data
>>> # (500 months, days, and so on).
>>> data = np.random.multivariate_normal(mean=mean, cov=cov, size=500)
>>> data[:10]
array([[ 0.58,  1.87],
       [-7.31,  0.74],
       [-6.24,  0.33],
       [-0.77,  1.19],
       [ 1.71,  0.7 ],
       [-3.33,  1.57],
       [-1.13,  1.23],
       [-6.58,  1.81],
       [-0.82, -0.34],
       [-2.32,  1.1 ]])
>>> data.shape
(500, 2)

You can think of data as 500 pairs of inversely correlated data points. Here’s a sanity check that you can back into the original inputs, which approximate corr, stdev, and mean from above:

>>> np.corrcoef(data, rowvar=False)
array([[ 1.  , -0.39],
       [-0.39,  1.  ]])
>>> data.std(axis=0)
array([5.96, 1.01])
>>> data.mean(axis=0)
array([2.13, 0.49])

Before we move on to CSPRNGs, it might be helpful to summarize some random functions and their numpy.random counterparts:

Python `random` Module	NumPy Counterpart	Desc
`random()`	`rand()`	Random float in [0.0, 1.0)
`randint(a, b)`	`random_integers()`	Random integer in [a, b]
`randrange(a, b[, step])`	`randint()`	Random integer in [a, b)
`uniform(a, b)`	`uniform()`	Random float in [a, b]
`choice(seq)`	`choice()`	Random element from `seq`
`choices(seq, k=1)`	`choice()`	Random `k` elements from `seq`with replacement
`sample(population, k)`	`choice()` with `replace=False`	Random `k` elements from `seq`without replacement
`shuffle(x[, random])`	`shuffle()`	Shuffle the sequence `x` in place
`normalvariate(mu, sigma)`or `gauss(mu, sigma)`	`normal()`	Sample from a normal distribution with mean `mu` and standard deviation `sigma`

Note: NumPy is specialized for building and manipulating large, multidimensional arrays. If you just need a single value, random will suffice and will probably be faster as well. For small sequences, random may even be faster too, because NumPy does come with some overhead.

CSPRNGs in Python

`os.urandom()`: About as Random as It Gets

Python’s os.urandom() function is used by both secrets and uuid (both of which you’ll see here in a moment). Without getting into too much detail, os.urandom() generates operating-system-dependent random bytes that can safely be called cryptographically secure:

On Unix operating systems, it reads random bytes from the special file /dev/urandom, which in turn “allow access to environmental noise collected from device drivers and other sources.” (Thank you, Wikipedia.) This is garbled information that is particular to your hardware and system state at an instance in time but at the same time sufficiently random.
On Windows, the C++ function CryptGenRandom().aspx) is used. This function is still technically pseudorandom, but it works by generating a seed value from variables such as the process ID, memory status, and so on.

With os.urandom(), there is no concept of manually seeding. While still technically pseudorandom, this function better aligns with how we think of randomness. The only argument is the number of bytes to return:

>>> os.urandom(3)
b'\xa2\xe8\x02'
>>> x = os.urandom(6)
>>> x
b'\xce\x11\xe7"!\x84'
>>> type(x), len(x)
(bytes, 6)

Python’s Best Kept `secrets`

Introduced in Python 3.6 by one of the more colorful PEPs out there, the secrets module is intended to be the de facto Python module for generating cryptographically secure random bytes and strings.

You can check out the source code for the module, which is short and sweet at about 25 lines of code. secrets is basically a wrapper around os.urandom(). It exports just a handful of functions for generating random numbers, bytes, and strings. Most of these examples should be fairly self-explanatory:

>>> n = 16
>>> # Generate secure tokens
>>> secrets.token_bytes(n)
b'A\x8cz\xe1o\xf9!;\x8b\xf2\x80pJ\x8b\xd4\xd3'
>>> secrets.token_hex(n)
'9cb190491e01230ec4239cae643f286f'  
>>> secrets.token_urlsafe(n)
'MJoi7CknFu3YN41m88SEgQ'
>>> # Secure version of `random.choice()`
>>> secrets.choice('rain')
'a'

One Last Candidate: `uuid`

One last option for generating a random token is the uuid4() function from Python’s uuidmodule. A UUID is a Universally Unique IDentifier, a 128-bit sequence (str of length 32) designed to “guarantee uniqueness across space and time.” uuid4() is one of the module’s most useful functions, and this function also uses os.urandom():

>>> import uuid
>>> uuid.uuid4()
UUID('3e3ef28d-3ff0-4933-9bba-e5ee91ce0e7b')
>>> uuid.uuid4()
UUID('2e115fcb-5761-4fa1-8287-19f4ee2877ac')

The nice thing is that all of uuid’s functions produce an instance of the UUID class, which encapsulates the ID and has properties like .int, .bytes, and .hex:

>>> tok = uuid.uuid4()
>>> tok.bytes
b'.\xb7\x80\xfd\xbfIG\xb3\xae\x1d\xe3\x97\xee\xc5\xd5\x81'
>>> len(tok.bytes)
16
>>> len(tok.bytes) * 8  # In bits
128
>>> tok.hex
'2eb780fdbf4947b3ae1de397eec5d581'
>>> tok.int
62097294383572614195530565389543396737

You may also have seen some other variations: uuid1(), uuid3(), and uuid5(). The key difference between these and uuid4() is that those three functions all take some form of input and therefore don’t meet the definition of “random” to the extent that a Version 4 UUID does:

uuid1() uses your machine’s host ID and current time by default. Because of the reliance on current time down to nanosecond resolution, this version is where UUID derives the claim “guaranteed uniqueness across time.”
uuid3() and uuid5() both take a namespace identifier and a name. The former uses an MD5 hash and the latter uses SHA-1.

uuid4(), conversely, is entirely pseudorandom (or random). It consists of getting 16 bytes via os.urandom(), converting this to a big-endian integer, and doing a number of bitwise operations to comply with the formal specification.

One common use of uuid is in Django, which has a UUIDField that is often used as a primary key in a model’s underlying relational database.

Recap

You’ve covered a lot of ground in this tutorial. To recap, here is a high-level comparison of the options available to you for engineering randomness in Python:

Package/Module	Description	Cryptographically Secure
`random`	Fasty & easy random data using Mersenne Twister	No
`numpy.random`	Like `random` but for (possibly multidimensional) arrays	No
`os`	Contains `urandom()`, the base of other functions covered here	Yes
`secrets`	Designed to be Python’s de facto module for generating secure random numbers, bytes, and strings	Yes
`uuid`	Home to a handful of functions for building 128-bit identifiers	Yes, `uuid4()`

References:

Generating Random Data in Python (Guide)

Generating Random Data in Python