维基词向量

我们正在为 294 种语言发布预训练的单词向量, 并使用 fastText 在 维基百科 上进行了训练. 这些 300 维的向量是通过使用 Bojanowski 等人 (2016) 描述的 skip-gram 模型(使用: 默认参数)获得的.

请注意, 新版本的多语言词语向量可在: [https://fasttext.cc/docs/en/crawl-vectors.html].

Models(模型)

这些模型可以从下面下载:

Abkhazian: bin+text, text Acehnese: bin+text, text Adyghe: bin+text, text
Afar: bin+text, text Afrikaans: bin+text, text Akan: bin+text, text
Albanian: bin+text, text Alemannic: bin+text, text Amharic: bin+text, text
Anglo_Saxon: bin+text, text Arabic: bin+text, text Aragonese: bin+text, text
Aramaic: bin+text, text Armenian: bin+text, text Aromanian: bin+text, text
Assamese: bin+text, text Asturian: bin+text, text Avar: bin+text, text
Aymara: bin+text, text Azerbaijani: bin+text, text Bambara: bin+text, text
Banjar: bin+text, text Banyumasan: bin+text, text Bashkir: bin+text, text
Basque: bin+text, text Bavarian: bin+text, text Belarusian: bin+text, text
Bengali: bin+text, text Bihari: bin+text, text Bishnupriya Manipuri: bin+text, text
Bislama: bin+text, text Bosnian: bin+text, text Breton: bin+text, text
Buginese: bin+text, text Bulgarian: bin+text, text Burmese: bin+text, text
Buryat: bin+text, text Cantonese: bin+text, text Catalan: bin+text, text
Cebuano: bin+text, text Central Bicolano: bin+text, text Chamorro: bin+text, text
Chavacano: bin+text, text Chechen: bin+text, text Cherokee: bin+text, text
Cheyenne: bin+text, text Chichewa: bin+text, text Chinese: bin+text, text
Choctaw: bin+text, text Chuvash: bin+text, text Classical Chinese: bin+text, text
Cornish: bin+text, text Corsican: bin+text, text Cree: bin+text, text
Crimean Tatar: bin+text, text Croatian: bin+text, text Czech: bin+text, text
Danish: bin+text, text Divehi: bin+text, text Dutch: bin+text, text
Dutch Low Saxon: bin+text, text Dzongkha: bin+text, text Eastern Punjabi: bin+text, text
Egyptian Arabic: bin+text, text Emilian_Romagnol: bin+text, text English: bin+text, text
Erzya: bin+text, text Esperanto: bin+text, text Estonian: bin+text, text
Ewe: bin+text, text Extremaduran: bin+text, text Faroese: bin+text, text
Fiji Hindi: bin+text, text Fijian: bin+text, text Finnish: bin+text, text
Franco_Provençal: bin+text, text French: bin+text, text Friulian: bin+text, text
Fula: bin+text, text Gagauz: bin+text, text Galician: bin+text, text
Gan: bin+text, text Georgian: bin+text, text German: bin+text, text
Gilaki: bin+text, text Goan Konkani: bin+text, text Gothic: bin+text, text
Greek: bin+text, text Greenlandic: bin+text, text Guarani: bin+text, text
Gujarati: bin+text, text Haitian: bin+text, text Hakka: bin+text, text
Hausa: bin+text, text Hawaiian: bin+text, text Hebrew: bin+text, text
Herero: bin+text, text Hill Mari: bin+text, text Hindi: bin+text, text
Hiri Motu: bin+text, text Hungarian: bin+text, text Icelandic: bin+text, text
Ido: bin+text, text Igbo: bin+text, text Ilokano: bin+text, text
Indonesian: bin+text, text Interlingua: bin+text, text Interlingue: bin+text, text
Inuktitut: bin+text, text Inupiak: bin+text, text Irish: bin+text, text
Italian: bin+text, text Jamaican Patois: bin+text, text Japanese: bin+text, text
Javanese: bin+text, text Kabardian: bin+text, text Kabyle: bin+text, text
Kalmyk: bin+text, text Kannada: bin+text, text Kanuri: bin+text, text
Kapampangan: bin+text, text Karachay_Balkar: bin+text, text Karakalpak: bin+text, text
Kashmiri: bin+text, text Kashubian: bin+text, text Kazakh: bin+text, text
Khmer: bin+text, text Kikuyu: bin+text, text Kinyarwanda: bin+text, text
Kirghiz: bin+text, text Kirundi: bin+text, text Komi: bin+text, text
Komi_Permyak: bin+text, text Kongo: bin+text, text Korean: bin+text, text
Kuanyama: bin+text, text Kurdish (Kurmanji): bin+text, text Kurdish (Sorani): bin+text, text
Ladino: bin+text, text Lak: bin+text, text Lao: bin+text, text
Latgalian: bin+text, text Latin: bin+text, text Latvian: bin+text, text
Lezgian: bin+text, text Ligurian: bin+text, text Limburgish: bin+text, text
Lingala: bin+text, text Lithuanian: bin+text, text Livvi_Karelian: bin+text, text
Lojban: bin+text, text Lombard: bin+text, text Low Saxon: bin+text, text
Lower Sorbian: bin+text, text Luganda: bin+text, text Luxembourgish: bin+text, text
Macedonian: bin+text, text Maithili: bin+text, text Malagasy: bin+text, text
Malay: bin+text, text Malayalam: bin+text, text Maltese: bin+text, text
Manx: bin+text, text Maori: bin+text, text Marathi: bin+text, text
Marshallese: bin+text, text Mazandarani: bin+text, text Meadow Mari: bin+text, text
Min Dong: bin+text, text Min Nan: bin+text, text Minangkabau: bin+text, text
Mingrelian: bin+text, text Mirandese: bin+text, text Moksha: bin+text, text
Moldovan: bin+text, text Mongolian: bin+text, text Muscogee: bin+text, text
Nahuatl: bin+text, text Nauruan: bin+text, text Navajo: bin+text, text
Ndonga: bin+text, text Neapolitan: bin+text, text Nepali: bin+text, text
Newar: bin+text, text Norfolk: bin+text, text Norman: bin+text, text
North Frisian: bin+text, text Northern Luri: bin+text, text Northern Sami: bin+text, text
Northern Sotho: bin+text, text Norwegian (Bokmål): bin+text, text Norwegian (Nynorsk): bin+text, text
Novial: bin+text, text Nuosu: bin+text, text Occitan: bin+text, text
Old Church Slavonic: bin+text, text Oriya: bin+text, text Oromo: bin+text, text
Ossetian: bin+text, text Palatinate German: bin+text, text Pali: bin+text, text
Pangasinan: bin+text, text Papiamentu: bin+text, text Pashto: bin+text, text
Pennsylvania German: bin+text, text Persian: bin+text, text Picard: bin+text, text
Piedmontese: bin+text, text Polish: bin+text, text Pontic: bin+text, text
Portuguese: bin+text, text Quechua: bin+text, text Ripuarian: bin+text, text
Romani: bin+text, text Romanian: bin+text, text Romansh: bin+text, text
Russian: bin+text, text Rusyn: bin+text, text Sakha: bin+text, text
Samoan: bin+text, text Samogitian: bin+text, text Sango: bin+text, text
Sanskrit: bin+text, text Sardinian: bin+text, text Saterland Frisian: bin+text, text
Scots: bin+text, text Scottish Gaelic: bin+text, text Serbian: bin+text, text
Serbo_Croatian: bin+text, text Sesotho: bin+text, text Shona: bin+text, text
Sicilian: bin+text, text Silesian: bin+text, text Simple English: bin+text, text
Sindhi: bin+text, text Sinhalese: bin+text, text Slovak: bin+text, text
Slovenian: bin+text, text Somali: bin+text, text Southern Azerbaijani: bin+text, text
Spanish: bin+text, text Sranan: bin+text, text Sundanese: bin+text, text
Swahili: bin+text, text Swati: bin+text, text Swedish: bin+text, text
Tagalog: bin+text, text Tahitian: bin+text, text Tajik: bin+text, text
Tamil: bin+text, text Tarantino: bin+text, text Tatar: bin+text, text
Telugu: bin+text, text Tetum: bin+text, text Thai: bin+text, text
Tibetan: bin+text, text Tigrinya: bin+text, text Tok Pisin: bin+text, text
Tongan: bin+text, text Tsonga: bin+text, text Tswana: bin+text, text
Tulu: bin+text, text Tumbuka: bin+text, text Turkish: bin+text, text
Turkmen: bin+text, text Tuvan: bin+text, text Twi: bin+text, text
Udmurt: bin+text, text Ukrainian: bin+text, text Upper Sorbian: bin+text, text
Urdu: bin+text, text Uyghur: bin+text, text Uzbek: bin+text, text
Venda: bin+text, text Venetian: bin+text, text Vepsian: bin+text, text
Vietnamese: bin+text, text Volapük: bin+text, text Võro: bin+text, text
Walloon: bin+text, text Waray: bin+text, text Welsh: bin+text, text
West Flemish: bin+text, text West Frisian: bin+text, text Western Punjabi: bin+text, text
Wolof: bin+text, text Wu: bin+text, text Xhosa: bin+text, text
Yiddish: bin+text, text Yoruba: bin+text, text Zazaki: bin+text, text
Zeelandic: bin+text, text Zhuang: bin+text, text Zulu: bin+text, text

Format(格式化)

单词向量以 fastText 的二进制和文本默认格式出现. 在文本格式中,每行包含一个单词,后面跟着它的向量. 每个值都是空格分隔的. 单词按降序排序.

License(许可证)

该词向量分布在知识 共享署名 - 相同方式共享3.0许可下.

References(参考)

如果您使用这些单词向量, 请引用以下文章:

P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

  1. @article{bojanowski2016enriching,
  2. title={Enriching Word Vectors with Subword Information},
  3. author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  4. journal={arXiv preprint arXiv:1607.04606},
  5. year={2016}
  6. }