pytorch下调整样本权重 - 《机器学习》

需要调整权重的场景
pytorch下应对方法

需要调整权重的场景

误分类的代价很高,比如对合法用户和非法用户进行分类，将非法用户分类为合法用户的代价很高，我们宁愿将合法用户分类为非法用户，这时可以人工再甄别，但是却不愿将非法用户分类为合法用户。
第二种是样本是高度失衡的，比如我们有合法用户和非法用户的二元样本数据10000条，里面合法用户有9995条，非法用户只有5条，如果我们不考虑权重，则我们可以将所有的测试集都预测为合法用户，这样预测准确率理论上有99.95%，但是却没有任何意义。

pytorch下应对方法

机器学习分类模型中(比如说scikit-learn)一般可以用调整类别权重或样本权重两种方式来满足上述场景,pytorch中我只找到调整样本权重的方式 WeightedRandomSampler

#根据所属类别样本数量比例调整权重
def make_weights_for_balanced_classes(images, nclasses):                        
    count = [0] * nclasses                                                      
    for item in images:                                                         
        count[item[1]] += 1                                                     
    weight_per_class = [0.] * nclasses                                      
    N = float(sum(count))                                                   
    for i in range(nclasses):                                                   
        weight_per_class[i] = N/float(count[i])                                 
    weight = [0] * len(images)                                              
    for idx, val in enumerate(images):                                          
        weight[idx] = weight_per_class[val[1]]                                  
    return weight

    train_set = ImageFolder(opt.train_data_root, train_tf, loader=pil_loader)
    valid_set = ImageFolder(opt.val_data_root, valid_tf, loader=pil_loader)
    weights = make_weights_for_balanced_classes(train_set.imgs, len(train_set.classes))
    #固定根据所属类别设定不同权重
    #weights = [2 if label == 1 else 1 for data, label in train_set]
    weights = t.DoubleTensor(weights)
    sampler = t.utils.data.sampler.WeightedRandomSampler(weights, len(weights))
    train_dataloader = DataLoader(train_set, opt.batch_size, sampler=sampler, num_workers=opt.num_workers)