balanced data-partition

  1. library(caret)
  2. set.seed(3456)
  3. trainIndex <- createDataPartition(iris$Species, p = .8,
  4. list = FALSE,
  5. times = 1)
  6. head(trainIndex)

p :the percentage of data that goes to training
times: the number of partitions to create

Similarly, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data.

createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
createMultiFolds(y, k = 10, times = 5)
createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE,
skip = 0)
groupKFold(group, k = length(unique(group)))
createResample(y, times = 10, list = TRUE)

Data Splitting for Time Series

Simple random sampling of time series is probably not the best way to resample times series data. Hyndman and Athanasopoulos (2013) discuss rolling forecasting origin techniques that move the training and test sets in time. caret contains a function called createTimeSlices that can create the indices for this type of splitting.
The three parameters for this type of splitting are:

  • initialWindow: the initial number of consecutive values in each training set sample
  • horizon: The number of consecutive values in test set sample
  • fixedWindow: A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits.