https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/L09_mlp/code/custom-dataloader/custom-dataloader-example.ipynb

Custom DataLoader Example

image.png

2) Custom Dataset Class

torch.utils.data.Dataset为抽象类
自定义类需要继承这个类,并实现两个函数,一个是__len__(提供数据的大小)另一个是__getitem__(通过给定索引获取数据和标签)

  1. import torch
  2. from PIL import Image
  3. from torch.utils.data import Dataset
  4. import os
  5. import pandas as pd
  6. class MyDataset(Dataset):
  7. def __init__(self,csv_path,img_dir,transform=None):
  8. df = pd.read_csv(csv_path)
  9. self.img_dir = img_dir
  10. self.img_names = df['File Name']
  11. self.y = df['Class Label']
  12. self.transform = transform
  13. def __getitem__(self,index):
  14. img = Image.open(os.path.join(self.img_dir,
  15. self.img_names[index]))
  16. if self.transform is not None:
  17. img = self.transform(img)
  18. label = self.y[index]
  19. return img,label
  20. def __len__(self):
  21. return self.y.shape[0]

通过上面的方式可以定义我们需要的数据类,可以通过迭代的方法来取得每一个数据,但是这样很难实现取batch,shuffle或者多线程去处理数据,所以PyTorch中提供了一个简单的方法来做这个事情

3) Custom Dataloader

__getitem__一次只能获取一个数据,所以通过torch.utils.data.DataLoader来定义一个新的迭代器

  1. from torchvision import transforms
  2. from torch.utils.data import DataLoader
  3. custom_transform = transform.Compose([
  4. transforms.ToTensoir(),
  5. transforms.Normalize([0.5],[0.5])
  6. ])
  7. train_dataset = MyDataset(csv_path='mnist_train.csv',
  8. img_dir='mnist_train',
  9. transform=custom_transform)
  10. train_loader = DataLoader(dataset = train_dataset,
  11. batch_size = 32,
  12. shuffle = True,
  13. num_works = 4) # number processes/CPUs to use
  1. transforms.Compose()可以把一些转换函数组合在一起
  2. transforms.Normalize([0.5],[0.5])对张量进行归一化,这两个0.5分别表示对张量进行归一化的全局平均值和方差,因图像是灰色的只有一个通道,如果有多个通道,需要有多个数字,如3个通道,应该是Normalize([0.5,0.5,0.5],[0.5,0.5,0.5])
  1. test_loader = data.DataLoader(Test, batch_size=2, shuffle=True, num_workers=2)
  2. for i, data in enumerate(test_loader):
  3. Data, Label = data

批量读取数据,可以像使用迭代器一样使用它,比如对它进行循环操作
不过它不是迭代器,我们可以通过iter命令将其转化为迭代器

  1. dataiter = iter(test_loader)
  2. imgs, labels = next(dataiter)

可视化

  1. def imshow(inp, title=None):
  2. """Imshow for Tensor."""
  3. inp = inp.numpy().transpose((1, 2, 0))
  4. mean = np.array([0.485, 0.456, 0.406])
  5. std = np.array([0.229, 0.224, 0.225])
  6. inp = std * inp + mean
  7. inp = np.clip(inp, 0, 1)
  8. plt.imshow(inp)
  9. if title is not None:
  10. plt.title(title)
  11. plt.pause(0.001) # pause a bit so that plots are updated
  12. # Get a batch of training data
  13. inputs, classes = next(iter(dataloaders['train']))
  14. # Make a grid from batch
  15. out = torchvision.utils.make_grid(inputs)
  16. imshow(out, title=[class_names[x] for x in classes])

4) Iterating Through the Dataset

  1. device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
  2. torch.manual_seed(0)
  3. num_epochs = 2
  4. for epoch in range(num_epochs):
  5. for batch_idx, (x,y) in enumerate(train_loader):
  6. print("Epoch:",epoch+1,end='')
  7. print(" | Batch index:", batch_idx, end='')
  8. print(" | Batch size:", y.size()[0])
  9. x = x.to(device)
  10. y = y.to(device)