top of page
  • Writer's pictureGio

Bootstrapping and Its Usage in Machine Learning

Updated: Feb 11, 2023



Bootstrapping is a statistical method that uses a small sample of data to make inferences about a larger population. It is called "bootstrapping" because it involves "pulling oneself up by one's own bootstraps" or using one's data to estimate population parameters without needing external information or assumptions.


In python, You can quickly bootstrap your data using the numpy function choice; here e a quick example:

import numpy as np
np.random.seed(42) # enforce reproducibility

sample = ['a', 'b', 'c', 'd', 'e', 'f', 'g'] 

# generate 3 bootstrap
for i in range(3):
    print(np.random.choice(a=sample, size=4, replace=True))

>>> ['g' 'd' 'e' 'g'] # 1st bootstrap
>>> ['c' 'e' 'e' 'g'] # 2nd bootstrap 
>>> ['b' 'c' 'g' 'c'] # 3rd bootstrap

You can easily notice that each element of the list can be included in the bootstrapped sample multiple times, and this is achieved by setting the option replace=True, i.e. sampling with replacement.


In practice, bootstrapping involves generating a large number of samples with replacements from the original sample and then using these samples to calculate estimates of various statistical quantities, such as means, standard deviations, and confidence intervals. These estimates can then be used to make inferences about the larger population from which the original sample was drawn.


Within a Machine Learning framework, the bootstrap method provides a direct computational way of assessing uncertainty by sampling from the training data. However, it is often confused with cross-validation as, at first sight, they seem to be doing the same thing.

Graphical visualization of a standard cross-validation technique called K-Fold cross-validation which uses part of the available data to fit the model, and a different part to test the model performance. In this example, the data are splitted into K=5

While bootstrapping samples randomly the data with replacement and the bootstrap sample size can be as equal as the original dataset, cross-validation samples the data without replacement, i.e., there are no repeated elements in every subset and does not rely on random sampling, since it splits the dataset into K unique subsets with K − 1 part of the data having an always smaller size than the original dataset.


Now that you understand the main differences between bootstrapping and cross-validation let's focus on a practical case where both techniques are used in an ML context; you will find the python script in this link.

 

Let us generate a dataset with N = 101 data points, with x denoting a one-dimensional vector and y the corresponding continuous outcome (it can also be categorical). The goal is to train an ML model μ(x) capable of learning the relationship between x and y. In order to estimate the generalization performance of a learning method on independent test data, let us set aside 25% of the data (26 data points) .



Suppose we decide to fit a Support Vector Machine (SVM) to the data, and we optimize the hyperparameter set {degree, C} via the cross-validated grid-search with shuffle split (number of fold k=5) and optimization metric equal to mean absolute error regression loss. The procedure delivers an SVM with C=40 and degree=2. For this example, the Root Mean Square Error (RMSE) are 1.528 and 1.852 for train and test, respectively.


Here is how we could apply the bootstrap in this example. Let us draw B=10, 50 and 200 datasets with replacement from the training data and equal size, i.e. N = 75. For each bootstrap dataset, we fit the SVM derived in the previous step with the


Bootstrap during model fit
From the left to right, 10, 50 and 200 bootstrap replicates produced from the training set.

As the number of B increases, it is possible to derive a reliable approximation of the 95% pointwise confidence bands in a nonparametric fashion, meaning that such a method is “model-free” since it uses the data itself, not a specific parametric model.



In general, cross-validation is usually used to test the ML model's generalization capabilities and optimize the hyperparameters with grid search techniques, while bootstrapping is used to assess the model's uncertainty via confidence bands.


Instead of using cross-validation techniques, you might think of estimating prediction error using the bootstrap, e.g., using the bootstrap dataset as our training sample and the original sample as our validation sample. However, because each bootstrap replicates significantly overlaps the original data, you will inevitably underestimate the true prediction error.



Finally, it is worth highlighting a few potential disadvantages of using bootstrapping methods:

  1. The bootstrapping process can be computationally intensive, especially when generating a large number of samples,

  2. if the sample dataset taken into consideration is not representative of the larger population, the estimates produced by the bootstrapping process may be biased.

  3. In more complex data situations, it might not be straightforward to generate bootstrap samples, e.g., in the case of time series data.




bottom of page