需要调整权重的场景
- 误分类的代价很高,比如对合法用户和非法用户进行分类,将非法用户分类为合法用户的代价很高,我们宁愿将合法用户分类为非法用户,这时可以人工再甄别,但是却不愿将非法用户分类为合法用户。
- 第二种是样本是高度失衡的,比如我们有合法用户和非法用户的二元样本数据10000条,里面合法用户有9995条,非法用户只有5条,如果我们不考虑权重,则我们可以将所有的测试集都预测为合法用户,这样预测准确率理论上有99.95%,但是却没有任何意义。
pytorch下应对方法
机器学习分类模型中(比如说scikit-learn)一般可以用调整类别权重或样本权重两种方式来满足上述场景,pytorch中我只找到调整样本权重的方式 WeightedRandomSampler
#根据所属类别样本数量比例调整权重def make_weights_for_balanced_classes(images, nclasses):count = [0] * nclassesfor item in images:count[item[1]] += 1weight_per_class = [0.] * nclassesN = float(sum(count))for i in range(nclasses):weight_per_class[i] = N/float(count[i])weight = [0] * len(images)for idx, val in enumerate(images):weight[idx] = weight_per_class[val[1]]return weight
train_set = ImageFolder(opt.train_data_root, train_tf, loader=pil_loader)valid_set = ImageFolder(opt.val_data_root, valid_tf, loader=pil_loader)weights = make_weights_for_balanced_classes(train_set.imgs, len(train_set.classes))#固定根据所属类别设定不同权重#weights = [2 if label == 1 else 1 for data, label in train_set]weights = t.DoubleTensor(weights)sampler = t.utils.data.sampler.WeightedRandomSampler(weights, len(weights))train_dataloader = DataLoader(train_set, opt.batch_size, sampler=sampler, num_workers=opt.num_workers)
