PyTorch
PyTorch提供的两个Dataset和DataLoader类分别负责可被Pytorch使用的数据集的创建以及向训练传递数据的任务。如果想个性化自己的数据集或者数据传递方式,也可以自己重写子类。
Dataset是**DataLoader**实例化的一个参数。

什么时候使用Dataset

CIFAR10是CV训练中经常使用到的一个数据集,在PyTorch中CIFAR10是一个写好的Dataset,使用时只需以下代码:

  1. data = datasets.CIFAR10("./data/", transform=transform, train=True, download=True)

datasets.CIFAR10就是一个Datasets子类,data是这个类的一个实例。
有的时候需要用自己在一个文件夹中的数据作为数据集,这个时候,可以使用ImageFolder这个方便的API。

  1. FaceDataset = datasets.ImageFolder('./data', transform=img_transform)

如何自定义一个数据集?

**torch.utils.data.Dataset** 是一个表示数据集的抽象类。任何自定义的数据集都需要继承这个类并覆写相关方法
所谓数据集,其实就是一个负责处理索引(index)到样本(sample)映射的一个类(class)。
Pytorch提供两种数据集:Map式数据集;Iterable式数据集。

Map式数据集

一个Map式的数据集必须要重写**getitem(self, index),len(self)** 两个内建方法,用来表示从索引到样本的映射(Map)
这样一个数据集dataset,举个例子,当使用dataset[idx]命令时,可以在硬盘中读取数据集中第idx张图片以及其标签(如果有的话);len(dataset)则会返回这个数据集的容量。
自定义类大致是这样的:

  1. class CustomDataset(data.Dataset):#需要继承data.Dataset
  2. def __init__(self):
  3. # TODO
  4. # 1. Initialize file path or list of file names.
  5. pass
  6. def __getitem__(self, index):
  7. # TODO
  8. # 1. Read one data from file (e.g. using numpy.fromfile, PIL.Image.open).
  9. # 2. Preprocess the data (e.g. torchvision.Transform).
  10. # 3. Return a data pair (e.g. image and label).
  11. #这里需要注意的是,第一步:read one data,是一个data
  12. pass
  13. def __len__(self):
  14. # You should change 0 to the total size of your dataset.
  15. return 0

例子-1:自己实验中写的一个例子:这里图片文件储存在“./data/faces/”文件夹下,图片的名字并不是从1开始,而是从final_train_tag_dict.txt这个文件保存的字典中读取,label信息也是用这个文件中读取。大家可以照着上面的注释阅读这段代码。

  1. from torch.utils import data
  2. import numpy as np
  3. from PIL import Image
  4. class face_dataset(data.Dataset):
  5. def __init__(self):
  6. self.file_path = './data/faces/'
  7. f=open("final_train_tag_dict.txt","r")
  8. self.label_dict=eval(f.read())
  9. f.close()
  10. def __getitem__(self,index):
  11. label = list(self.label_dict.values())[index-1]
  12. img_id = list(self.label_dict.keys())[index-1]
  13. img_path = self.file_path+str(img_id)+".jpg"
  14. img = np.array(Image.open(img_path))
  15. return img,label
  16. def __len__(self):
  17. return len(self.label_dict)

下面看一下官方MNIST数据集的例子:

  1. class MNIST(data.Dataset):
  2. """`MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.
  3. Args:
  4. root (string): Root directory of dataset where ``processed/training.pt``
  5. and ``processed/test.pt`` exist.
  6. train (bool, optional): If True, creates dataset from ``training.pt``,
  7. otherwise from ``test.pt``.
  8. download (bool, optional): If true, downloads the dataset from the internet and
  9. puts it in root directory. If dataset is already downloaded, it is not
  10. downloaded again.
  11. transform (callable, optional): A function/transform that takes in an PIL image
  12. and returns a transformed version. E.g, ``transforms.RandomCrop``
  13. target_transform (callable, optional): A function/transform that takes in the
  14. target and transforms it.
  15. """
  16. urls = [
  17. 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
  18. 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
  19. 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
  20. 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz',
  21. ]
  22. raw_folder = 'raw'
  23. processed_folder = 'processed'
  24. training_file = 'training.pt'
  25. test_file = 'test.pt'
  26. classes = ['0 - zero', '1 - one', '2 - two', '3 - three', '4 - four',
  27. '5 - five', '6 - six', '7 - seven', '8 - eight', '9 - nine']
  28. class_to_idx = {_class: i for i, _class in enumerate(classes)}
  29. @property
  30. def targets(self):
  31. if self.train:
  32. return self.train_labels
  33. else:
  34. return self.test_labels
  35. def __init__(self, root, train=True, transform=None, target_transform=None, download=False):
  36. self.root = os.path.expanduser(root)
  37. self.transform = transform
  38. self.target_transform = target_transform
  39. self.train = train # training set or test set
  40. if download:
  41. self.download()
  42. if not self._check_exists():
  43. raise RuntimeError('Dataset not found.' +
  44. ' You can use download=True to download it')
  45. if self.train:
  46. self.train_data, self.train_labels = torch.load(
  47. os.path.join(self.root, self.processed_folder, self.training_file))
  48. else:
  49. self.test_data, self.test_labels = torch.load(
  50. os.path.join(self.root, self.processed_folder, self.test_file))
  51. def __getitem__(self, index):
  52. """
  53. Args:
  54. index (int): Index
  55. Returns:
  56. tuple: (image, target) where target is index of the target class.
  57. """
  58. if self.train:
  59. img, target = self.train_data[index], self.train_labels[index]
  60. else:
  61. img, target = self.test_data[index], self.test_labels[index]
  62. # doing this so that it is consistent with all other datasets
  63. # to return a PIL Image
  64. img = Image.fromarray(img.numpy(), mode='L')
  65. if self.transform is not None:
  66. img = self.transform(img)
  67. if self.target_transform is not None:
  68. target = self.target_transform(target)
  69. return img, target
  70. def __len__(self):
  71. if self.train:
  72. return len(self.train_data)
  73. else:
  74. return len(self.test_data)
  75. def _check_exists(self):
  76. return os.path.exists(os.path.join(self.root, self.processed_folder, self.training_file)) and \
  77. os.path.exists(os.path.join(self.root, self.processed_folder, self.test_file))
  78. def download(self):
  79. """Download the MNIST data if it doesn't exist in processed_folder already."""
  80. from six.moves import urllib
  81. import gzip
  82. if self._check_exists():
  83. return
  84. # download files
  85. try:
  86. os.makedirs(os.path.join(self.root, self.raw_folder))
  87. os.makedirs(os.path.join(self.root, self.processed_folder))
  88. except OSError as e:
  89. if e.errno == errno.EEXIST:
  90. pass
  91. else:
  92. raise
  93. for url in self.urls:
  94. print('Downloading ' + url)
  95. data = urllib.request.urlopen(url)
  96. filename = url.rpartition('/')[2]
  97. file_path = os.path.join(self.root, self.raw_folder, filename)
  98. with open(file_path, 'wb') as f:
  99. f.write(data.read())
  100. with open(file_path.replace('.gz', ''), 'wb') as out_f, \
  101. gzip.GzipFile(file_path) as zip_f:
  102. out_f.write(zip_f.read())
  103. os.unlink(file_path)
  104. # process and save as torch files
  105. print('Processing...')
  106. training_set = (
  107. read_image_file(os.path.join(self.root, self.raw_folder, 'train-images-idx3-ubyte')),
  108. read_label_file(os.path.join(self.root, self.raw_folder, 'train-labels-idx1-ubyte'))
  109. )
  110. test_set = (
  111. read_image_file(os.path.join(self.root, self.raw_folder, 't10k-images-idx3-ubyte')),
  112. read_label_file(os.path.join(self.root, self.raw_folder, 't10k-labels-idx1-ubyte'))
  113. )
  114. with open(os.path.join(self.root, self.processed_folder, self.training_file), 'wb') as f:
  115. torch.save(training_set, f)
  116. with open(os.path.join(self.root, self.processed_folder, self.test_file), 'wb') as f:
  117. torch.save(test_set, f)
  118. print('Done!')
  119. def __repr__(self):
  120. fmt_str = 'Dataset ' + self.__class__.__name__ + '\n'
  121. fmt_str += ' Number of datapoints: {}\n'.format(self.__len__())
  122. tmp = 'train' if self.train is True else 'test'
  123. fmt_str += ' Split: {}\n'.format(tmp)
  124. fmt_str += ' Root Location: {}\n'.format(self.root)
  125. tmp = ' Transforms (if any): '
  126. fmt_str += '{0}{1}\n'.format(tmp, self.transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
  127. tmp = ' Target Transforms (if any): '
  128. fmt_str += '{0}{1}'.format(tmp, self.target_transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
  129. return fmt_str

Iterable式数据集

一个Iterable(迭代)式数据集是抽象类**data.IterableDataset**的子类,并且覆写了iter方法成为一个迭代器。这种数据集主要用于数据大小未知,或者以流的形式的输入,本地文件不固定的情况,需要以迭代的方式来获取样本索引

DataLoader

Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset. —PyTorch Documents

一般来说PyTorch中深度学习训练的流程是这样的:

  1. 创建Dateset
  2. Dataset传递给DataLoader
  3. DataLoader迭代产生训练数据提供给模型

对应的一般都会有这三部分代码:

  1. # 创建Dateset(可以自定义)
  2. dataset = face_dataset # Dataset部分自定义过的face_dataset
  3. # Dataset传递给DataLoader
  4. dataloader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=False,num_workers=8)
  5. # DataLoader迭代产生训练数据提供给模型
  6. for i in range(epoch):
  7. for index,(img,label) in enumerate(dataloader):
  8. pass

到这里应该就PyTorch的数据集和数据传递机制应该就比较清晰明了了。Dataset负责建立索引到样本的映射**DataLoader**负责以特定的方式从数据集中迭代的产生 一个个batch的样本集合。在enumerate过程中实际上是dataloader按照其参数sampler规定的策略调用了其dataset的getitem方法。

参数介绍

先看一下实例化一个DataLoader所需的参数,只关注几个重点即可。

  1. DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
  2. batch_sampler=None, num_workers=0, collate_fn=None,
  3. pin_memory=False, drop_last=False, timeout=0,
  4. worker_init_fn=None)

参数介绍:

  • dataset (Dataset) – 定义好的Map式或者Iterable式数据集。
  • batch_size (python:int, optional) – 一个batch含有多少样本 (default: 1)。
  • shuffle (bool, optional) – 每一个epoch的batch样本是相同还是随机 (default: False)。
  • sampler (Sampler, optional) – 决定数据集中采样的方法. 如果有,则shuffle参数必须为False。
  • batch_sampler (Sampler, optional) – 和 sampler 类似,但是一次返回的是一个batch内所有样本的index。和 batch_size, shuffle, sampler, and drop_last 三个参数互斥。
  • num_workers (python:int, optional) – 多少个子程序同时工作来获取数据,多线程。(default: 0)
  • collate_fn (callable, optional) – 合并样本列表以形成小批量。
  • pin_memory (bool, optional) – 如果为True,数据加载器在返回前将张量复制到CUDA固定内存中。
  • drop_last (bool, optional) – 如果数据集大小不能被batch_size整除,设置为True可删除最后一个不完整的批处理。如果设为False并且数据集的大小不能被batch_size整除,则最后一个batch将更小。(default: False)
  • timeout (numeric, optional) – 如果是正数,表明等待从worker进程中收集一个batch等待的时间,若超出设定的时间还没有收集到,那就不收集这个内容了。这个numeric应总是大于等于0。(default: 0)
  • worker_init_fn (callable, optional*) – 每个worker初始化函数 (default: None)

dataset 没什么好说的,很重要,需要按照前面所说的两种dataset定义好,完成相关函数的重写。
batch_size 也没啥好说的,就是训练的一个批次的样本数。
shuffle 表示每一个epoch中训练样本的顺序是否相同,一般True。

采样器

sampler 重点参数,采样器,是一个迭代器。PyTorch提供了多种采样器,用户也可以自定义采样器。
所有sampler都是继承 torch.utils.data.sampler.Sampler这个抽象类。

  1. class Sampler(object):
  2. # """Base class for all Samplers.
  3. # Every Sampler subclass has to provide an __iter__ method, providing a way
  4. # to iterate over indices of dataset elements, and a __len__ method that
  5. # returns the length of the returned iterators.
  6. # """
  7. # 一个 迭代器 基类
  8. def __init__(self, data_source):
  9. pass
  10. def __iter__(self):
  11. raise NotImplementedError
  12. def __len__(self):
  13. raise NotImplementedError

PyTorch自带的Sampler

  • SequentialSampler
  • RandomSampler
  • SubsetRandomSampler
  • WeightedRandomSampler

**SequentialSampler** 很好理解就是顺序采样器
其原理是首先在初始化的时候拿到数据集data_source,之后在__iter__方法中首先得到一个和data_source一样长度的range可迭代器。每次只会返回一个索引值

  1. class SequentialSampler(Sampler):
  2. # r"""Samples elements sequentially, always in the same order.
  3. # Arguments:
  4. # data_source (Dataset): dataset to sample from
  5. # """
  6. # 产生顺序 迭代器
  7. def __init__(self, data_source):
  8. self.data_source = data_source
  9. def __iter__(self):
  10. return iter(range(len(self.data_source)))
  11. def __len__(self):
  12. return len(self.data_source)

参数作用:

  • data_source:同上
  • num_samples:指定采样的数量,默认是所有。
  • replacement:若为True,则表示可以重复采样,即同一个样本可以重复采样,这样可能导致有的样本采样不到。所以此时可以设置num_samples来增加采样数量使得每个样本都可能被采样到。

RandomSampler

  1. class RandomSampler(Sampler):
  2. # r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
  3. # If with replacement, then user can specify ``num_samples`` to draw.
  4. # Arguments:
  5. # data_source (Dataset): dataset to sample from
  6. # num_samples (int): number of samples to draw, default=len(dataset)
  7. # replacement (bool): samples are drawn with replacement if ``True``, default=False
  8. # """
  9. def __init__(self, data_source, replacement=False, num_samples=None):
  10. self.data_source = data_source
  11. self.replacement = replacement
  12. self.num_samples = num_samples
  13. if self.num_samples is not None and replacement is False:
  14. raise ValueError("With replacement=False, num_samples should not be specified, "
  15. "since a random permute will be performed.")
  16. if self.num_samples is None:
  17. self.num_samples = len(self.data_source)
  18. if not isinstance(self.num_samples, int) or self.num_samples <= 0:
  19. raise ValueError("num_samples should be a positive integeral "
  20. "value, but got num_samples={}".format(self.num_samples))
  21. if not isinstance(self.replacement, bool):
  22. raise ValueError("replacement should be a boolean value, but got "
  23. "replacement={}".format(self.replacement))
  24. def __iter__(self):
  25. n = len(self.data_source)
  26. if self.replacement:
  27. return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
  28. return iter(torch.randperm(n).tolist())
  29. def __len__(self):
  30. return len(self.data_source)

这个采样器常见的使用场景是将训练集划分成训练集和验证集

  1. class SubsetRandomSampler(Sampler):
  2. # r"""Samples elements randomly from a given list of indices, without replacement.
  3. # Arguments:
  4. # indices (sequence): a sequence of indices
  5. # """
  6. def __init__(self, indices):
  7. self.indices = indices
  8. def __iter__(self):
  9. return (self.indices[i] for i in torch.randperm(len(self.indices)))
  10. def __len__(self):
  11. return len(self.indices)

batch_sampler
前面的采样器每次都只返回一个索引,但是在训练时是对批量的数据进行训练,而这个工作就需要BatchSampler来做。也就是说BatchSampler的作用就是将前面的Sampler采样得到的索引值进行合并,当数量等于一个batch大小后就将这一批的索引值返回。

  1. class BatchSampler(Sampler):
  2. # Wraps another sampler to yield a mini-batch of indices.
  3. # Args:
  4. # sampler (Sampler): Base sampler.
  5. # batch_size (int): Size of mini-batch.
  6. # drop_last (bool): If ``True``, the sampler will drop the last batch if
  7. # its size would be less than ``batch_size``
  8. # Example:
  9. # >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
  10. # [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
  11. # >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
  12. # [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
  13. # 批次采样
  14. def __init__(self, sampler, batch_size, drop_last):
  15. if not isinstance(sampler, Sampler):
  16. raise ValueError("sampler should be an instance of "
  17. "torch.utils.data.Sampler, but got sampler={}"
  18. .format(sampler))
  19. if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
  20. batch_size <= 0:
  21. raise ValueError("batch_size should be a positive integeral value, "
  22. "but got batch_size={}".format(batch_size))
  23. if not isinstance(drop_last, bool):
  24. raise ValueError("drop_last should be a boolean value, but got "
  25. "drop_last={}".format(drop_last))
  26. self.sampler = sampler
  27. self.batch_size = batch_size
  28. self.drop_last = drop_last
  29. def __iter__(self):
  30. batch = []
  31. for idx in self.sampler:
  32. batch.append(idx)
  33. if len(batch) == self.batch_size:
  34. yield batch
  35. batch = []
  36. if len(batch) > 0 and not self.drop_last:
  37. yield batch
  38. def __len__(self):
  39. if self.drop_last:
  40. return len(self.sampler) // self.batch_size
  41. else:
  42. return (len(self.sampler) + self.batch_size - 1) // self.batch_size

多线程

num_workers 参数表示同时参与数据读取的线程数量,多线程技术可以加快数据读取,提供GPU/CPU利用率。