1、基本介绍、安装
- （1）基本介绍
- （2）安装
2、使用示例
- （1）使用 /simcse/tool.py 中的 SimCSE 类进行模型导入
jy: 即初始化 /simcse/tool.py 中的 SimCSE 类(内部同样是基于 transformers 包进行模型管理);
jy: 如果安装了 GPU 版本的 torch 包, 需确保有空闲 GPU 可用, 否则报错如下:
RuntimeError: CUDA error: out of memory
jy: torch.Size([768])
jy:
jy: array([[0.01262083, 0.34469506],
[0.89384234, 0.04842842]], dtype=float32)
jy: 如果环境中有安装 faiss 包，则以下的 build_index 方法（在 /simcse/tool.py 的 SimCSE 类
中定义）会自动导入 faiss 包加速运算; 注意: faiss did not well support Nvidia AMPERE
GPUs (3090 and A100). In that case, you should change to other GPUs or install
the CPU version of faiss package.
jy: [(‘A man is playing a guitar.’, 0.8938424)]
- （3）使用示例汇总
- （4）示例代码流程分析（包含 transformers 的包管理流程）
3、参考

1、基本介绍、安装

（1）基本介绍

SimCSE：Simple Contrastive Learning of Sentence Embeddings
- a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings.
论文：【2021-09-09】SimCSE：Simple Contrastive Learning of Sentence Embeddings

（2）安装
创建虚拟环境，安装指定版本的 setuptools 包
- conda create --name jy-simCSE_py36 python==3.6.5
- pip install setuptools==49.3.0
  - 注意：低于此版本后续安装可能会报错（版本参考自项目 github 中的 requirements.txt 说明）

安装 torch

尽管 pip 安装simcse时会自动安装该依赖包；但相应版本可能与当前机器的 cuda 环境不符（带 GPU 的机器环境），如：

      ![image.png](https://cdn.nlark.com/yuque/0/2022/png/25833371/1650894520381-f8bfd5dc-cb21-4950-899b-8c9b2e40eda2.png#clientId=uf43c355b-139b-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=72&id=u8c6d6e07&margin=%5Bobject%20Object%5D&name=image.png&originHeight=152&originWidth=509&originalType=binary&ratio=1&rotation=0&showTitle=false&size=22956&status=done&style=none&taskId=u903c2598-741e-4b98-9aa7-e589cd83b6e&title=&width=242.38096338821427)

但 cuda 环境如下：

     ![image.png](https://cdn.nlark.com/yuque/0/2022/png/25833371/1650894558571-b84765c3-9e9e-4b65-a40f-26f455ac5b92.png#clientId=uf43c355b-139b-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=63&id=udef55f30&margin=%5Bobject%20Object%5D&name=image.png&originHeight=133&originWidth=1095&originalType=binary&ratio=1&rotation=0&showTitle=false&size=59818&status=done&style=none&taskId=uae62749a-96b6-41fb-87d2-72b0e653815&title=&width=521.4285951082409)

此时在使用 simcse 时可能产生如下错误：

     ![image.png](https://cdn.nlark.com/yuque/0/2022/png/25833371/1650894655015-a1fbedde-8c27-44d6-b379-afb29e8c65a2.png#clientId=uf43c355b-139b-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=341&id=ub0554453&margin=%5Bobject%20Object%5D&name=image.png&originHeight=717&originWidth=1331&originalType=binary&ratio=1&rotation=0&showTitle=false&size=536876&status=done&style=none&taskId=u1c2e53ec-146f-4d59-9619-ca4ea2a4ef2&title=&width=633.8095525927567)

因此，建议根据机器环境手动安装 torch 包

pip install --upgrade pip
- pip 版本太低会影响 torch 依赖包 pillow 的安装
GPU 版本（CUDA > 11）
- pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

CPU 版本（或 CUDA < 11）

pip install torch==1.7.1

参考官网：https://pytorch.org/get-started/locally/

               ![image.png](https://cdn.nlark.com/yuque/0/2022/png/25833371/1650894793648-9384bb47-cef7-4a7f-bee5-544de6d48d20.png#clientId=uf43c355b-139b-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=204&id=u26a33de9&margin=%5Bobject%20Object%5D&name=image.png&originHeight=429&originWidth=1223&originalType=binary&ratio=1&rotation=0&showTitle=false&size=127494&status=done&style=none&taskId=u67235d02-0bc5-4e43-9001-2db17ce45c8&title=&width=582.3809788286562)

安装 simcse
- pip install simcse
  2、使用示例
注意：以下两种方式导入模型最终均为基于 huggingface（transformers）导入。
模型使用时的代码问题：https://github.com/princeton-nlp/SimCSE/issues/186
（1）使用 /simcse/tool.py 中的 SimCSE 类进行模型导入
```python from simcse import SimCSE

jy: 即初始化 /simcse/tool.py 中的 SimCSE 类(内部同样是基于 transformers 包进行模型管理);

model = SimCSE(“princeton-nlp/sup-simcse-bert-base-uncased”)

jy: 如果安装了 GPU 版本的 torch 包, 需确保有空闲 GPU 可用, 否则报错如下:

RuntimeError: CUDA error: out of memory

embeddings = model.encode(“A woman is reading.”)

jy: torch.Size([768])

print(embeddings.shape)

jy:

print(type(embeddings))

sentences_a = [‘A woman is reading.’, ‘A man is playing a guitar.’] sentences_b = [‘He plays guitar.’, ‘A woman is making a photo.’] similarities = model.similarity(sentences_a, sentences_b)

jy: array([[0.01262083, 0.34469506],

[0.89384234, 0.04842842]], dtype=float32)

print(similarities)

sentences = [‘A woman is reading.’, ‘A man is playing a guitar.’]

jy: 如果环境中有安装 faiss 包，则以下的 build_index 方法（在 /simcse/tool.py 的 SimCSE 类

中定义）会自动导入 faiss 包加速运算; 注意: faiss did not well support Nvidia AMPERE

GPUs (3090 and A100). In that case, you should change to other GPUs or install

the CPU version of faiss package.

model.build_index(sentences) results = model.search(“He plays guitar.”)

jy: [(‘A man is playing a guitar.’, 0.8938424)]

print(results)

<a name="k4Yfa"></a>
## （2）基于 transformers 包进行模型导入
```python
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer
# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))
"""
Cosine similarity between "There's a kid on a skateboard." and "A kid is skateboarding." is: 0.943
Cosine similarity between "There's a kid on a skateboard." and "A kid is inside the house." is: 0.439
"""

（3）使用示例汇总

from simcse import SimCSE



def get_sents_similarity(model, ls_sents1, ls_sents2):
    similarities = model.similarity(ls_sents1, ls_sents2)
    return similarities

def get_sent_embedding(model, sentence):
    return model.encode(sentence)


def similarity_search(model, ls_sents, sent_or_ls_sents, threshold=0.6, top_k=5):
    # jy: 如果环境中有安装 faiss 包，则以下的 build_index 方法（在 /simcse/tool.py 
    #     的 SimCSE 类>中定义）会自动导入 faiss 包加速运算。
    #     注意: faiss did not well support Nvidia AMPERE GPUs (3090 and A100). In 
    #           that case, you should change to other GPUs or install the CPU 
    #           version of faiss package.
    model.build_index(ls_sents)
    results = model.search(sent_or_ls_sents, threshold=threshold, top_k=top_k)
    return results


def remove_cn_punct(str_cn):
    pattern_cn_punct = "[，|≥|。|、|...|？|=|’|‘|“|”|；|：|！|（|）|%|,|\-|:| |/|\]|\[|(|)|>|<]"
    #pattern_cn_punct = "[，|。|、|...|？|’|‘|“|”|；|：|！| |（|）"
    str_cn = re.sub(pattern_cn_punct, "", str_cn).strip()
    return str_cn


# jy: simcse 官方模型;
#model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")
#model = SimCSE("princeton-nlp/unsup-simcse-bert-base-uncased")


# jy: 官方 supervised 模型本地路径;
# a) supervised:
#model_path = "/home/huangjiayue/04_SimCSE/jy_model/sup-simcse-bert-base-uncased"
# b) unsupervised:
model_path = "/home/huangjiayue/04_SimCSE/jy_model/unsup-simcse-bert-base-uncased"
model = SimCSE(model_path)


# 注意：如果安装了 GPU 版本的 torch 包，需确保有空闲 GPU 可用，否则报错 OOM
# ============= (1) 句子向量化 ===========================================
#"""
sentence = "A woman is reading."
embeddings = get_sent_embedding(model, sentence)
# <class 'torch.Tensor'>
print(type(embeddings))
# torch.Size([768])
print(embeddings.shape)

sentence2 = "A man is playing a guitar."
embeddingss = get_sent_embedding(model, sentence2)
#"""

# ============= (2) 句子相似度计算 ========================================
"""
sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
# array([[0.01266468, 0.3446282 ],
#        [0.8938298 , 0.04850736]], dtype=float32)
similarities = get_sents_similarity(model, sentences_a, sentences_b)
"""


# ============= (3) 相似句子搜索 ========================================
"""
ls_sents = ['A woman is reading.', 'A man is playing a guitar.', "what are you doing"]
#sentences = ["她在阅读", "他在弹吉他"]
#sent_or_ls_sents = "He plays guitar."
sent_or_ls_sents = ["He plays guitar.", "I'm reading"]
#sent_or_ls_sents = ["I'm reading", "He plays guitar.", "I play guitar.", "I read and play guitar."]

similarities = get_sents_similarity(model, sent_or_ls_sents, ls_sents)
print(similarities)

res = similarity_search(model, ls_sents, sent_or_ls_sents, threshold=-1, top_k=5)
print(res)
"""

（4）示例代码流程分析（包含 transformers 的包管理流程）

SimCSE 代码流程分析.doc

3、参考

官方 github：https://github.com/princeton-nlp/SimCSE
demo：https://github.com/princeton-nlp/SimCSE/tree/main/demo

【03】机器学习、深度学习

（01）基本介绍、安装使用