Uniques in a bootstrap
This post introduces the bootstrap as a sampling technique and calculates the number of unique elements in a bootstrap sample.
What is the bootstrap?
In statistical machine learning the bootstrap refers to the process of selecting $n$ items from $n$ items with replacement.
Why is the boostrap relevant?
In certain scenarios, such as clinical research trials data is sometimes not available in abundance and point estimates need to be calculated. The bootstrap is a commonly used techinque in such cases.
Where is the bootstrap used?
It is used in the Random Forests algorithm. Each decision tree is created using a bootstraped sample of the dataset.
How many uniques are there in a bootstrap sample?
Suppose the dataset has $n$ elements and we select $n$ elements from it with replacement. We assume that each element is equally likely to be selected i.e. we sample uniformly at random. Let $x$ be an arbitrary element of the dataset. Then,
\[\begin{equation*} \begin{array}{lll} Pr(\textrm{selecting}\; x) & = & \frac{1}{n} \\ Pr(\textrm{not selecting}\; x) & = & 1 - \frac{1}{n} \\ Pr(\textrm{not selecting}\; x\; \textrm{in bootstrap sample}) & = & (1 - \frac{1}{n})^n \\ Pr(\textrm{selecting}\; x\; \textrm{in bootstrap sample}) & = & 1 - (1 - \frac{1}{n})^n \end{array} \end{equation*}\]Note that
\[\lim_{n \to \infty} 1 - \bigg(1-\frac{1}{n}\bigg)^n = 1 - \frac{1}{e} \approx 0.632\]where $e$ is the exponential number $2.718281828 \dots$
The above analysis holds for any element $x$. If we sample $n$ points and each one of them has a $0.632$ chance of being selected. In other words, the number of uniques will be approximately $0.632 \cdot n$ or $63.2\%$ of the number of elements of the original dataset.
For large values we expect approximately 63% unique values in a boostrap sample, but what does large mean in this context?
To answer this, I did some simulations with mock datasets of sizes from 10 to 1,000,000. For each dataset 30 simulations were done. The median, average and standard deviation of the number of uniques observed from the trials are reported below.
Median, Average and Standard Deviation of number of uniques
Dataset size | Median | Average | Standard Deviation |
---|---|---|---|
10 | 7 | 6.5 | 0.92 |
100 | 63 | 63.2 | 3.41 |
1000 | 636 | 631.16 | 9.94 |
10000 | 6321 | 6322.86 | 27.79 |
100000 | 63220 | 63206.86 | 97.69 |
1000000 | 632080 | 632090.13 | 284.57 |
Here’s the code used to run the simulations
import numpy as np
def uniques_in_bootstrap(list_size: int) -> int:
# select uniformly at random with replacement and return the number of unique elements
samples = np.random.choice(list_size, size=list_size, replace=True)
return len(set(samples))
def main() -> None:
dataset_sizes = [10 ** i for i in range(1, 7)]
# iterate over different datasets
for size in dataset_sizes:
uniques = list()
# do 30 simulations for each dataset
for _ in range(30):
uniques.append(uniques_in_bootstrap(size))
print(f"Dataset size = {size}, Median = {np.median(uniques)}, Average = {np.mean(uniques)}, Std. Dev. = {np.std(uniques)}")
if __name__ == "__main__":
main()
To summarize, the experiments show that for datasets of size 100 and larger, the empirical approximations are close to the theoretical calculation.