- 1、分析脚本
- jy: 基于生物医药领域论文数据(仅 en 的标题和摘要)训练得到的模型;
- jy: 抽取出生物医药领域论文特有词汇词干后将其扩充到 bert-base-uncased 的 vocab.txt
- jy: 抽取出生物医药领域论文特有词汇词干并将原有 token 视为前后缀去除存在的前后缀后将其扩充
- 到 bert-base-uncased 的 vocab.txt
- jy: 基于生物医药领域论文数据(en 和 cn 的标题和摘要)训练得到的模型;
- jy: 论文中的 sup-simcse-bert-base-uncased
- jy: 论文中的 unsup-simcse-bert-base-uncased
- sent0: you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him
- sent1: You lose the things to the following level if the people recall.
- hard_neg: They never perform recalls on anything.
- sent0: How do you know? All this is their information again.
- sent1: This information belongs to them.
- hard_neg: They have no information at all.
- sent0: (Read for Slate ‘s take on Jackson’s findings.)
- sent1: Slate had an opinion on Jackson’s findings.
- hard_neg: Slate did not hold any opinion on Jackson’s findings.
- sent0: Gays and lesbians.
- sent1: Homosexuals.
- hard_neg: Heterosexuals.
- sent0: Fun for adults and children.
- sent1: Fun for both adults and children.
- hard_neg: Fun for only children.
- sent0: well it’s been very interesting
- sent1: It has been very intriguing.
- hard_neg: It hasn’t been interesting.
- sent0: It’s not that the questions they asked weren’t interesting or legitimate (though most did fall under the category of already asked and answered).
- sent1: Most of the questions that were asked had already been answered.
- hard_neg: Every query that came up wasn’t interesting or legitimate, but no one had asked them.
- sent0: I don’t mean to be glib about your concerns, but if I were you, I might be more concerned about the near-term rate implications of this $1.
- sent1: The near-term implications are concerning.
- hard_neg: I am concerned more about your issues than the near-term rate implications.
- 论文方法提供了两个训练好的模型:sup-simcse-bert-base-uncased(NLI 监督式训练模型,下文中用 sup_simcse 表示)和 unsup-simcse-bert-base-uncased(无监督训练模型,下文中用 unsup_simcse 表示)
- 以下对 unsup_simcse 和 sup_simcse 和其它训练好的模型进行分析。
1、分析脚本
```python import requests import os from datasets import load_dataset
SUPPORTED_MODEL = [
jy: 基于生物医药领域论文数据(仅 en 的标题和摘要)训练得到的模型;
“en”,
jy: 抽取出生物医药领域论文特有词汇词干后将其扩充到 bert-base-uncased 的 vocab.txt
“en_addToken”,
jy: 抽取出生物医药领域论文特有词汇词干并将原有 token 视为前后缀去除存在的前后缀后将其扩充
到 bert-base-uncased 的 vocab.txt
“en_addSubToken”,
jy: 基于生物医药领域论文数据(en 和 cn 的标题和摘要)训练得到的模型;
“encn”,
jy: 论文中的 sup-simcse-bert-base-uncased
“sup_simcse”,
jy: 论文中的 unsup-simcse-bert-base-uncased
“unsup_simcse”, ]
def get_sents_similarity(ls_sents1, ls_sents2, model): if model not in SUPPORTED_MODEL: raise Exception(“暂不支持输入的模型:【%s】” % model) headers = {}
json_data = {'ls_sents1': ls_sents1,
'ls_sents2': ls_sents2,
'model': model}
# jy: 模型对应的 api 请求接口; 服务发布方式见上一章节;
url_ = 'http://192.1xx.x.xxx:5005/get_sents_similarity'
response = requests.post(url_, headers=headers, json=json_data)
return response.text
def analysisby_nli_data(ls_model_name, topK=100): “”” model_name: 将模型发布为对应的 api 接口, 供调用; topK: 分析 top-k 条 nli 数据; “”” DIR = “/home/huangjiayue/04_SimCSE/SimCSE/data/“
# jy: 调用 /datasets/load.py 中的 load_dataset 函数;
# 第一个参数传入 csv 格式文件的读取脚本(下载到本地, 避免每次远程访问时可能出错)
# data_files 参数传入 simcse 模型中使用的 nli 训练数据, 每行包含的数据格式为:
# 【sent0, sent1, hard_neg】, sent0 与 sent1 语义相同, 与 hard_neg 语义相反;
ds = load_dataset(os.path.join(DIR_, "dataset_script_jy/csv.py"),
data_files=os.path.join(DIR_, "nli_for_simcse.csv"))
ds_train = ds["train"]
len_ds = len(ds_train)
for i in range(len_ds):
# jy: 仅对 topK 条数据进行测试分析;
if i > topK:
break
sent0 = ds_train[i]['sent0']
sent1 = ds_train[i]['sent1']
hard_neg = ds_train[i]['hard_neg']
print("\n" + "=" * 88)
print("sent0: %s" % sent0)
print("-" * 66)
print("sent1: %s" % sent1)
print("-" * 66)
print("hard_neg: %s" % hard_neg)
print("-" * 66)
ls_sents1 = [sent0]
ls_sents2 = [sent1, hard_neg]
for model in ls_model_name:
str_sim_score_matrix = get_sents_similarity(ls_sents1, ls_sents2, model)
sim_score = [i.strip()[:5] for i in \
str_sim_score_matrix.strip("[]").split()]
print("【{:15} 模型相似度得分】 正样例: {}, 负样例: {}".format(
model, sim_score[0], sim_score[1]))
ls_model_name = SUPPORTED_MODEL analysis_by_nli_data(ls_model_name)
<a name="fvk9p"></a>
# 2、各模型结果示例
- 前 100 条数据的分析结果参见附件:[analysis.txt](https://www.yuque.com/attachments/yuque/0/2022/txt/25833371/1656567812491-d4c59fe1-5caf-4dc8-a191-6aaaaa634875.txt?_lake_card=%7B%22src%22%3A%22https%3A%2F%2Fwww.yuque.com%2Fattachments%2Fyuque%2F0%2F2022%2Ftxt%2F25833371%2F1656567812491-d4c59fe1-5caf-4dc8-a191-6aaaaa634875.txt%22%2C%22name%22%3A%22analysis.txt%22%2C%22size%22%3A104290%2C%22type%22%3A%22text%2Fplain%22%2C%22ext%22%3A%22txt%22%2C%22source%22%3A%22%22%2C%22status%22%3A%22done%22%2C%22mode%22%3A%22title%22%2C%22download%22%3Atrue%2C%22taskId%22%3A%22u2c2cf160-7bfa-439e-bddc-7ead247d639%22%2C%22taskType%22%3A%22upload%22%2C%22__spacing%22%3A%22both%22%2C%22id%22%3A%22u0fa317e6%22%2C%22margin%22%3A%7B%22top%22%3Atrue%2C%22bottom%22%3Atrue%7D%2C%22card%22%3A%22file%22%7D)
- 从附件中可以看出,往模型补充新 token 后进行训练得到的模型总体上比原模型的好(en_addToken 和 en_addSubToken 总体比 en 模型好;这三个模型均在同一领域数据中训练得到,因此可以直接进行效果比较);但 en_addToken 和 en_addSubToken 两者哪个更好不太好衡量(应放到同一领域的数据中进行衡量)
- 以下截取部分有代表性的样例(不同模型的含义参见以上代码示例中的相应注释说明)
- 注意:除了 unsup_simcse 和 sup_simcse 是属于与 NLI 同一领域的数据,故除了这两个模型以外的其它模型(基于生物医药领域论文的标题和摘要训练得到的模型)仅供参考。
- 同时, sup_simcse 模型的训练集中包含了以下测试集(暂没找其它测试集进行测试),因此效果会更加突出。
<a name="mb0Oq"></a>
## (1)篇幅差距大
- 搜索篇幅差距大,正样例的相似度得分总体偏低(即使是 sup 方法也如此)
- 初步猜测:模型训练过程中这种篇幅差距较大的正样例比较少导致。
========================================================================================
sent0: you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him
sent1: You lose the things to the following level if the people recall.
hard_neg: They never perform recalls on anything.
【en 模型相似度得分】 正样例: 0.245, 负样例: 0.107 【en_addToken 模型相似度得分】 正样例: 0.389, 负样例: 0.235 【en_addSubToken 模型相似度得分】 正样例: 0.411, 负样例: 0.164 【encn 模型相似度得分】 正样例: 0.383, 负样例: 0.242 【sup_simcse 模型相似度得分】 正样例: 0.462, 负样例: 0.048 【unsup_simcse 模型相似度得分】 正样例: 0.557, 负样例: 0.428
<a name="PBbe6"></a>
## (2)篇幅简短
- 篇幅简短、且正样例和负样例中包含类似的词重叠率:
- unsup 方式不能体现出区分度
- 极为简短的样例情况下,sup 方式的区分度也不好(初步猜测也与此类训练数据少有关)
========================================================================================
sent0: How do you know? All this is their information again.
sent1: This information belongs to them.
hard_neg: They have no information at all.
【en 模型相似度得分】 正样例: 0.769, 负样例: 0.770 【en_addToken 模型相似度得分】 正样例: 0.751, 负样例: 0.759 【en_addSubToken 模型相似度得分】 正样例: 0.646, 负样例: 0.676 【encn 模型相似度得分】 正样例: 0.683, 负样例: 0.708 【sup_simcse 模型相似度得分】 正样例: 0.713, 负样例: 0.479 【unsup_simcse 模型相似度得分】 正样例: 0.727, 负样例: 0.679
========================================================================================
sent0: (Read for Slate ‘s take on Jackson’s findings.)
sent1: Slate had an opinion on Jackson’s findings.
hard_neg: Slate did not hold any opinion on Jackson’s findings.
【en 模型相似度得分】 正样例: 0.870, 负样例: 0.819 【en_addToken 模型相似度得分】 正样例: 0.771, 负样例: 0.624 【en_addSubToken 模型相似度得分】 正样例: 0.839, 负样例: 0.767 【encn 模型相似度得分】 正样例: 0.824, 负样例: 0.776 【sup_simcse 模型相似度得分】 正样例: 0.850, 负样例: 0.440 【unsup_simcse 模型相似度得分】 正样例: 0.810, 负样例: 0.760
========================================================================================
sent0: Gays and lesbians.
sent1: Homosexuals.
hard_neg: Heterosexuals.
【en 模型相似度得分】 正样例: 0.805, 负样例: 0.705 【en_addToken 模型相似度得分】 正样例: 0.751, 负样例: 0.693 【en_addSubToken 模型相似度得分】 正样例: 0.214, 负样例: 0.348 【encn 模型相似度得分】 正样例: 0.839, 负样例: 0.603 【sup_simcse 模型相似度得分】 正样例: 0.822, 负样例: 0.737 【unsup_simcse 模型相似度得分】 正样例: 0.848, 负样例: 0.720
========================================================================================
sent0: Fun for adults and children.
sent1: Fun for both adults and children.
hard_neg: Fun for only children.
【en 模型相似度得分】 正样例: 0.982, 负样例: 0.936 【en_addToken 模型相似度得分】 正样例: 0.974, 负样例: 0.805 【en_addSubToken 模型相似度得分】 正样例: 0.964, 负样例: 0.899 【encn 模型相似度得分】 正样例: 0.970, 负样例: 0.902 【sup_simcse 模型相似度得分】 正样例: 0.983, 负样例: 0.609 【unsup_simcse 模型相似度得分】 正样例: 0.986, 负样例: 0.825
========================================================================================
sent0: well it’s been very interesting
sent1: It has been very intriguing.
hard_neg: It hasn’t been interesting.
【en 模型相似度得分】 正样例: 0.787, 负样例: 0.800 【en_addToken 模型相似度得分】 正样例: 0.770, 负样例: 0.838 【en_addSubToken 模型相似度得分】 正样例: 0.824, 负样例: 0.787 【encn 模型相似度得分】 正样例: 0.602, 负样例: 0.851 【sup_simcse 模型相似度得分】 正样例: 0.878, 负样例: 0.626 【unsup_simcse 模型相似度得分】 正样例: 0.738, 负样例: 0.786
<a name="SMe4S"></a>
## (3)特殊样例:sup 方式都不能体现出区分度
- 此类样例有至少如下特点之一:
- 负样例句子的表述方式与原句子显得有更高的词重叠率,如同样包含不少的否定词(尽管语义是相反的)
- 负样例的句长与原句子更切合
========================================================================================
sent0: It’s not that the questions they asked weren’t interesting or legitimate (though most did fall under the category of already asked and answered).
sent1: Most of the questions that were asked had already been answered.
hard_neg: Every query that came up wasn’t interesting or legitimate, but no one had asked them.
【en 模型相似度得分】 正样例: 0.732, 负样例: 0.748 【en_addToken 模型相似度得分】 正样例: 0.698, 负样例: 0.769 【en_addSubToken 模型相似度得分】 正样例: 0.740, 负样例: 0.787 【encn 模型相似度得分】 正样例: 0.885, 负样例: 0.711 【sup_simcse 模型相似度得分】 正样例: 0.618, 负样例: 0.648 【unsup_simcse 模型相似度得分】 正样例: 0.821, 负样例: 0.813
========================================================================================
sent0: I don’t mean to be glib about your concerns, but if I were you, I might be more concerned about the near-term rate implications of this $1.
sent1: The near-term implications are concerning.
hard_neg: I am concerned more about your issues than the near-term rate implications.
【en 模型相似度得分】 正样例: 0.354, 负样例: 0.636 【en_addToken 模型相似度得分】 正样例: 0.399, 负样例: 0.712 【en_addSubToken 模型相似度得分】 正样例: 0.264, 负样例: 0.629 【encn 模型相似度得分】 正样例: 0.377, 负样例: 0.720 【sup_simcse 模型相似度得分】 正样例: 0.463, 负样例: 0.727 【unsup_simcse 模型相似度得分】 正样例: 0.475, 负样例: 0.771 ```