【96】字典树应用 - （01）获取英文同义词（基于指定长度前缀） - 《【03】机器学习、深度学习》

1、简单功能测试
2、模拟真实场景测试
3、大文本文件测试

import io
import re
def remove_en_punct(text):
    # jy: 注意, "." 和 "?" 要进行转义, 因为其在 re 中都有特殊含义;
    text = re.sub('\.|,|;|:|\?|!', " ", text)
    return text
class TrieNode(object):
    def __init__(self):
        # jy: 子节点的类型也为 TrieNode 类型, 故使用默认值为 TrieNode 的字典表示(使用普
        #     通字典也可以, 因为带有默认值的字典也并不会对该属性进行约束, 只是更直观明确
        #     而已)
        self.children = {}
        # jy: 用于判断前缀树中截止当前 TrieNode 节点对应的字符为止是否为一个有效的单词;
        self.end_of_word = False
        # jy: 统计以根节点到当前节点组成的前缀的个数;
        self.prefix_count = 0
        # jy: 统计以根节点到当前节点组成的单词的个数;
        self.word_count = 0
class Trie_v1(object):
    def __init__(self, is_lower_case=False, min_prefix_len=5):
        # jy: 定义 Trie 树的根节点为一个 TrieNode, 该 TrieNode 的 self.children 属性将
        #     用于存放不同单词的前缀;
        # jy: 此处也可将 TrieNode 中的两个初始化属性直接搬过来, 因此可以减少 TrieNode 节
        #     点类的定义, 但节点不再是 TrieNode 类后(而是一个简单的 {...}), 则判断某个字
        #     符是否是单词的末尾字符时, 不再能依赖原先的字符节点 TrieNode 的 end_of_word
        #     属性, 而是需要通过字符对应的 {...} 中添加指定的标签来标记(参加 Trie_v2 的代
        #     码实现);
        self.root = TrieNode()
        # jy: 判断 word 是否统一转换为小写(默认不转换);
        self.is_lower_case = is_lower_case
        # jy: 最小前缀长度;
        self.min_prefix_len = min_prefix_len
        # jy: 记录前缀为 prefix 对应的单词列表 set_word: {prefix: set_word, ...}
        self.prefix_dict = {}
        # jy: 记录候选前缀;
        self.prefix_set = set()
    def insert(self, word):
        node = self.root
        word = word.lower() if self.is_lower_case else word
        tmp_prefix = ""
        # jy: 逐个遍历要插入的单词中的字符;
        for char in word:
            # jy: 记录候选前缀;
            tmp_prefix += char
            if len(tmp_prefix) >= self.min_prefix_len and tmp_prefix not in self.prefix_set:
                self.prefix_set.add(tmp_prefix)
            # jy: 判断截止单词 word 的当前字符 char 为止对应的前缀是否存在于字典树中, 如
            #     果存在, 则不断递进 word 中的字符(即扩大前缀的长度)并判断, 直到当前字符
            #     不在字典树中, 则将该字符插入到当前字典树节点的子节点中(字典树的叶子节点
            #     对应的 children 属性为 {}, 而 end_of_word 属性为 True);
            # jy: 注意, 此处必须调用字典的 get() 方法, 因为 char 可能不止字典树中;
            child = node.children.get(char)
            # jy: 如果当前字符 char 不在上一个字符对应的 TrieNode 的 children 属性中,
            #     则将其加入到上一个字符对应的 TrieNode 的 children 属性中;
            if not child:
                node.children[char] = TrieNode()
            node.children[char].prefix_count += 1
            node = node.children[char]
        # jy: 以上 for 循环全部执行完成后, node 为单词 word 最后一个字符对应的 TrieNode,
        #     因此将该 TrieNode 的 end_of_word 标记为 True, 表名截止该字符有一个真正的单
        #     词存在;
        # jy: 注意, 此时的 TrieNode 的 children 属性值不一定是空字典, 因为该单词 word 可
        #     能是其它单词的前缀, 此时的整个字典树中, 单词 word 的最后一个字符(当前字符)的
        #     children 属性值依然保留了其它字符;
        node.end_of_word = True
        node.word_count += 1
    def search(self, word):
        """
        判断字典树中是否存在单词 word
        返回元组: (True/False, word_count), 表示是否存在, 以及单词存在的个数;
        """
        node = self.root
        # jy: 循环遍历单词 word 中的每个字符, 需确保这些字符依次存在于字典树的根节点不
        #     断向下深度遍历中, 且遍历至 word 的最后一个字符时, 该字符对应的 TrieNode 的
        #     end_of_word 属性值为 True (表明有以该字符结尾的单词存在);
        for char in word:
            node = node.children.get(char)
            if not node:
                return (False, 0)
        # jy: 以上 for 循环遍历完成后, node 即为 word 中最后一个字符对应的 TrieNode, 如
        #     果其 end_of_word 属性值为 True, 表明字典树中存在单词 word (否则表明没有
        #     word 单词存在, 即 word 仅仅是某些单词的一个前缀而已);
        return (node.end_of_word, node.word_count)
    def startsWith(self, prefix):
        """
        判断字典树中是否存在前缀为 prefix 的单词, 返回元组: (True/False, prefix_word_count)
        表示是否存在该前缀的单词, 以及前缀为 prefix 的单词个数;
        """
        node = self.root
        # jy: 逻辑类似于从字典树中查找单词 word, 只是此处查找的是前缀 prefix, 由于是查找
        #     前缀, 只需要确保 prefix 对应的字符依据先后顺序均存在于字典树的深度遍历路径
        #     中, 并不需要判断 prefix 的最后一个字符是否是单词的末尾(如果 prefix 中的字
        #     符依据先后顺序均存在于字典树的深度遍历路径中, 则直接返回 True 即可);
        for char in prefix:
            node = node.children.get(char)
            if not node:
                return (False, 0)
        return (True, node.prefix_count)
    def prefix_words(self, prefix, is_get_synonyms=False):
        """
        获取字典树中所有前缀为 prefix 的单词, 没有则返回 [];
        (在 startsWith 的基础上进行修改即可)
        """
        set_prefix_words = set()
        # jy: 判断前缀 prefix 是否存在, 如果不存在, 则直接返回 []
        node = self.root
        for char in prefix:
            node = node.children.get(char)
            if not node:
                # jy: 当前方法没有被递归调用, 此处如果返回, 肯定是空列表;
                return set_prefix_words
        # jy: 如果是为了获取同义词, 则以 prefix 为前缀的单词个数要大于或等于 2
        #     否则也只能获取到一个词, 毫无意义;
        #     注意, 此处如果 node.prefix_count >= 2, 也不能代表就有多个以 prefix
        #     为前缀的单词, 因为同一个单词是可以重复出现多次的, 故以下得到的
        #     set_prefix_words 还是可能只有 1 个单词(后续还是需要判断其长度进行过滤);
        if is_get_synonyms:
            if node.prefix_count < 2:
                return set_prefix_words
        #print(prefix, " == ", "node.prefix_count: ", node.prefix_count)
        def _dfs(node, set_, prefix):
            '''
            深度优先遍历, 获取前缀为 prefix 的所有单词并加入到列表 ls_ 中;
            node 为前缀 prefix 的最后一个字符对应的 TrieNode 类;
            '''
            # jy: 如果前缀 prefix 的最后一个字符对应的 TrieNode 类的 end_of_word 属性
            #     为 True; 表明字典树中存在 prefix 单词, 因此将其加入到列表 ls_ 中;
            if node.end_of_word:
                set_.add(prefix)
            # jy: 遍历 node 节点的所有子节点, 将子节点对应的字符加到 prefix 后进行递归判
            #     断;
            for char, child_node in node.children.items():
                _dfs(child_node, set_, prefix + char)
        _dfs(node, set_prefix_words, prefix)
        return set_prefix_words
    def get_prefix_words(self, is_get_synonyms=False):
        """
        获取所有候选 prefix 的对应 word 列表;
        """
        for prefix in self.prefix_set:
            set_word = self.prefix_words(prefix, is_get_synonyms=is_get_synonyms)
            # jy: 可在 self.prefix_words 方法中基于 node.prefix_count 数值确认是否要遍历
            #     添加对应的前缀单词(如果 node.prefix_count 为 1, 则不需要再遍历添加, 因
            #     为此处的逻辑也会被过滤掉; 可避免重复操作;)
            #     注意: 以上的过滤逻辑在允许单词重复出现多次的情况下不可用, 因为这种情况
            #           即使 node.prefix_count >=2 , 但可能是同一个单词重复出现贡献的, 因
            #           此得到的 set_word 可能还是长度为 1, 需要基于 set_word 长度进行过滤;
            if is_get_synonyms:
                if len(set_word) > 1:
                    self.prefix_dict[prefix] = set_word
            else:
                if set_word:
                    self.prefix_dict[prefix] = set_word

1、简单功能测试

# jy: 前缀长度 min_prefix_len 设置为 2, 实际应用中建议设置大一些;
trie_tree = Trie_v1(is_lower_case=False, min_prefix_len=2)
# ls_words = ["apple", "app_03", "app_01", "app2_02", "subject", "sub"]
#ls_words = ["a", "aa", "aff", "afcd", "aad", "b", "c", "d", "e"]
ls_words = ["a", "aa", "aff", "afcd", "aad", "b", "c", "d", "e"]
#ls_prefix = ["ap", "app", "sub"]
# jy: 将候选词插入字典树;
for word in ls_words:
    trie_tree.insert(word)
# jy: 获取字典树中候选前缀的所有单词;
trie_tree.get_prefix_words(is_get_synonyms=True)
# jy: 打印输出指定前缀对应的 word(): {'af': {'aff', 'afcd'}, 'aa': {'aad', 'aa'}}
print(trie_tree.prefix_dict)
# jy: 获取指定前缀的 word 是否存在、以及以 prefix 为前缀的 word 的总个数;
print(trie_tree.startsWith("a"))  # (True, 5)
print(trie_tree.startsWith("af")) # (True, 2)
# jy: 获取指定 word 是否存在, 以及该 word 的个数;
print(trie_tree.search("aff"))    # (True, 1)
# jy: 获取所有同义词, 不含前缀: ({'afcd', 'aff'}, {'aad', 'aa'})
tuple_synonyms = tuple(trie_tree.prefix_dict.values())
print(tuple_synonyms)

2、模拟真实场景测试

text = """
A firm nodule on the arm. Cutaneous myoepithelioma.
Defective serotonin transport mechanism in platelets from endogenously depressed patients The kinetics of serotonin (I) uptake in platelets from 10 depressed patients were studied with a highly reproducible method which detects specific changes in the I transport by platelets. The 6 patients with endogenous depression had a disturbed I uptake. A passive diffusion of I predominated over the active I transport. This disturbance was not found in the 4 nonendogenously depressed patients or in the 50 normal controls. Thus, platelets from endogenously depressed patients may have abnormal phys. membrane characteristics.
Interfacial Reactions of Tetraphenylporphyrin with Cobalt-Oxide Thin Films We have studied the adsorption and interfacial reactions of 2H-tetraphenylporphyrin (2HTPP) with cobalt-terminated Co3O4(111) and oxygen-terminated CoO(111) thin films using synchrotron-radiation XPS. Already at 275 K, we find evidence for the formation of a metalated species, most likely CoTPP, on both surfaces. The degree of self-metalation increases with temperature on both surfaces until 475 K, where the metalation is almost complete. At 575 K the porphyrin coverage decreases drastically on the reducible cobalt-terminated Co3O4(111) surface, while higher temperatures are needed on the non-reducible oxygen-terminated CoO(111). The low temperature self-metalation is similar to that observed on MgO(100) surfaces, but drastically different from that observed on TiO2(110), where no self-metalation is observed at room temperature
Immunodiagnosis of Schistosomiasis mansoni with APIA (alkaline phosphatase immunoassay). The previously shown antigenicity of Schistosoma mansoni (JL venezuelan strain) alkaline phosphatase (Mg2+, pH 9.5) allowed its use in an immunodiagnosis assay, that consisted in the immunoadsorption of the enzymatic activity. Protein-A coated polyvinyl plates were used as solid phase to capture IgG from sera of infected human patients. After buffered saline washings, the plates were incubated with an enzyme-rich fraction (a n-butanol aqueous extract of adult worm obtained from pairs). Immunoadsorbed alkaline phosphatase (AP) was revealed by adding rho-nitrophenyl phosphate. Anti-AP antibodies were detected in 93% of coproparasitologically proven S. mansoni-infected venezuelan patients but not in parasite-free control sera and sera from patients infected with parasitosis other than schistosomiasis. The APIA did not correlate with cure but the anti-AP antibody response was progressively reduced after treatment. The use of an AP substrate amplifying system allowed an improvement of the assay sensitivity without loss of specificity. The data suggest that the APIA could be used as a marker of infection by S. mansoni.
Incidence of congenital uterine anomalies in repeated reproductive wastage and prognosis for pregnancy after metroplasty. To discover the exact incidence of congenital uterine anomalies among infertile patients, hysterosalpingography was performed on 1,200 married women with a history of repeated reproductive wastage. Out of 1,200 hysterosalpingographies, 188 revealed congenital uterine anomaly (15.7%). The degree of uterine cavity deformity in the anomalies was evaluated during hysterosalpingography using the X/M ratio. This indicated that the incidence of repeated spontaneous abortion in cases with low-grade anomalies is as high as the incidence among cases with more severe anomalies. A significant improvement in maintaining pregnancy was observed after metroplasty; more than 84% of postoperative pregnancies were successfully maintained, whereas none of the 233 presurgical pregnancies had lasted full term. As a control group, 47 other women with anomalies were randomly chosen, and their subsequent pregnancies were monitored, without metroplasty. Of their pregnancies, 94.4% terminated spontaneously before 12 weeks of gestation.
A novel functional estrogen receptor on human sperm membrane interferes with progesterone effects We have identified an estrogen receptor of approx. 29 kDa apparent mol. weight in human sperm membranes by ligand and Western blot anal., resp., using peroxidase-conjugated E2 and an antibody directed against the ligand binding region of the genomic receptor (αH222). Such receptor is functional since 17β-estradiol (17βE2) induces a rapid and sustained increase of intracellular calcium concentrations ([Ca2+]i) which is completely eliminated by preincubation with αH222. 17βE2 effects on calcium are clearly mediated by a membrane receptor, as they are reproduced by the membrane-impermeable conjugate of the hormone BSA-E2. Dose-response curve for this effect is biphasic with EC50s in the nanomolar and micromolar range. In addition to calcium increase, 17βE2 stimulates tyrosine phosphorylation of several sperm proteins including the 29-kDa protein band. Preincubation of human sperm with 17βE2 inhibits calcium and acrosome reaction increases in response to progesterone. We conclude that estrogens may play a role in the modulation of nongenomic action of progesterone (P) in human sperm during the process of fertilization.
Ig V region gene expression in small lymphocytic lymphoma with little or no somatic hypermutation Using the polymerase chain reaction, specific Ig κ-L chain V region gene (Vκ gene) rearrangement was investigated in small lymphocytic non-Hodgkin's lymphomas (SL NHL) that express Ig bearing a major κ-L chain associated cross-reactive Id, designated 17.109. Previously, the 17.109-cross-reactive Id was identified in chronic lymphocytic leukemia as a serol. marker for expression of a highly conserved Vκ gene, designated Humkv325. Using sense-strand oligonucleotides specific for the 5'-end of this Vκ gene and antisense oligonucleotide specific for a Jκ region consensus sequence, Humkv325 was amplified specifically when juxtaposed with Jκ through Ig gene rearrangement. This made it possible to amplify rearranged Vκ genes from DNA isolated from minute amounts of lymphoma biopsy material for mol. analyses. 17.109-Reactive SL NHL, with or without associated chronic lymphocytic leukemia, rearrange, and presumably express, Humkv325 without substantial somatic diversification. This data suggests that malignant B cells in SL NHL, in contrast to nonHodgkin's lymphoma of follicular central cell origin, may express Ig variable region genes with little or no somatic hypermutation.
Extensive burns caused by the abusive use of photosensitizing agents. Psoralens are photosensitizing agents used in dermatology as reinforcements in psoralen ultraviolet A-range therapy. We report observations of 14 young women hospitalized for severe burns caused by abusive use of psoralens. The burns were of superficial and deep second-degree depth and covered more than 76% of the body surface on average. All patients needed fluid resuscitation. Hospital stay was 11 days on average. Healing was obtained without skin grafting in all cases. Among the six patients who responded to the mailed questionnaire, negative effects are now present in all patients as inflammatory peaks. Two patients have esthetic sequelae such as dyschromia and scars. The misuse of photosensitizing agents poses many problems. These accidents are very expensive. The largeness of the burned surface can involve a fatal prognosis. And finally, one can suspect that a much larger portion of the population regularly uses these products without any serious accident. In this case carcinogenesis can be expected.
H1 antihistamines, cytochrome P450, and multidrug resistance. Is there a link? A review with 8 references is given on the 2nd generation cytochrome P45O-dependent H1-antihistamines and their relation to multidrug resistance discussing the biotransformation by cytochrome P 450, drug-drug interactions leading to unexpected side effects and toxicities, and drug interactions through modulation of the drug efflux pump Pgp.
Six ways to increase cosmetic dentistry in your practice.
"""
trie_tree = Trie_v1(is_lower_case=False, min_prefix_len=5)
# jy: 去除英文文本标点符号;
text_rm_punct = remove_en_punct(text)
ls_word = [word for word in text_rm_punct.split() if word]
for word in ls_word:
    trie_tree.insert(word)
trie_tree.get_prefix_words(is_get_synonyms=True)
#import pdb; pdb.set_trace()
#print(len(trie_tree.prefix_dict))
tuple_synonyms = tuple(trie_tree.prefix_dict.values())
ls_synonyms = []
for synonym in tuple_synonyms:
    if synonym not in ls_synonyms:
        ls_synonyms.append(synonym)
#print(len(set_synonyms))
for synonym in ls_synonyms:
    print(synonym)
# jy: 输出示例如下:
"""
{'disturbance', 'disturbed'}
{'hysterosalpingography', 'hysterosalpingographies'}
{'immunoassay)', 'immunoadsorption'}
{'specific', 'specificity', 'specifically'}
{'metalation', 'metalated'}
{'enzymatic', 'enzyme-rich'}
{'Schistosomiasis', 'Schistosoma'}
{'lymphoma', 'lymphomas'}
{'endogenous', 'endogenously'}
{'anomaly', 'anomalies'}
{'rearrangement', 'rearranged', 'rearrange'}
{'antibody', 'antibodies'}
{'controls', 'control'}
{'phosphorylation', 'phosphate', 'phosphatase'}
{'lymphoma', 'lymphomas', 'lymphocytic'}
{'increases', 'increase'}
{'immunoassay)', 'immunoadsorption', 'immunodiagnosis'}
{'spontaneously', 'spontaneous'}
{'pregnancies', 'pregnancy'}
{'detects', 'detected'}
{'suggests', 'suggest'}
{'temperatures', 'temperature'}
{'psoralens', 'psoralen'}
{'amplified', 'amplifying', 'amplify'}
{'maintained', 'maintaining'}
{'depressed', 'depression'}
{'expression', 'express'}
{'completely', 'complete'}
{'membrane', 'membrane-impermeable', 'membranes'}
{'infection', 'infected'}
{'reduced', 'reducible'}
{'accidents', 'accident'}
{'surfaces', 'surface'}
{'effects', 'effect'}
{'interferes', 'interfacial'}
{'reactions', 'reaction'}
{'reproductive', 'reproduced', 'reproducible'}
{'Immunodiagnosis', 'Immunoadsorbed'}
{'responded', 'response'}
{'phosphate', 'phosphatase'}
{'oligonucleotides', 'oligonucleotide'}
{'activity', 'active'}
{'sequence', 'sequelae'}
{'mansoni', 'mansoni-infected'}
{'parasite-free', 'parasitosis'}
{'observed', 'observations'}
{'amplifying', 'amplify'}
{'substantial', 'substrate'}
{'several', 'severe'}
{'estrogens', 'estrogen'}
{'largeness', 'larger'}
{'presurgical', 'presumably'}
{'protein', 'proteins'}
{'controls', 'control', 'contrast'}
{'interactions', 'interferes', 'interfacial'}
{'coverage', 'covered'}
{'where', 'whereas'}
{'plates', 'platelets'}
{'consensus', 'conserved'}
{'specific', 'specificity', 'specifically', 'species'}
"""

3、大文本文件测试

#    文件较大时, 建议先对文件进行预处理, 过滤掉确定的不符合要求的 word 
#    (如字符数小于指定长度的 word)
#    或对以上的核心方法(insert, get_prefix_words)进行并发处理;
trie_tree = Trie_v1(is_lower_case=False, min_prefix_len=8)
#trie_tree = Trie_v1(is_lower_case=True, min_prefix_len=8)
f_name = "100w-paper_tlt_abst_en_lineMax_250.txt"
with io.open(f_name, "r", encoding="utf8") as f_:
    for idx, line in enumerate(f_):
        #if idx > 100:
        #    break
        text_rm_punct = remove_en_punct(line.strip())
        ls_word = [word for word in text_rm_punct.split() if word]
        for word in ls_word:
            trie_tree.insert(word)
trie_tree.get_prefix_words(is_get_synonyms=True)
#import pdb; pdb.set_trace()
#print(len(trie_tree.prefix_dict))
tuple_synonyms = tuple(trie_tree.prefix_dict.values())
ls_synonyms = []
for synonym in tuple_synonyms:
    if synonym not in ls_synonyms:
        ls_synonyms.append(synonym)
#print(len(set_synonyms))
for synonym in ls_synonyms:
    print(synonym)