试玩了一下去年腾讯开源的 800 w 的中文词词向量

2019-11-28 18:51:24 +08:00
 leiuu

最近搞点词嵌入相关的东西,无意中发现腾讯去年开源的词向量模型:
https://mp.weixin.qq.com/s/b9NWR0F7GQLYtgGSL50gQw

这个模型涵盖 800w 中文词(虽然里边很多错误词),但总体还是挺强大的。

简单搭了个 api 哈哈: https://zhuanlan.zhihu.com/p/94124468

一些有意思的测试:
1.红烧肉相似词
output:

{
   "top_similar_words":[
      [
         "糖醋排骨",
         0.8907967209815979
      ],
      [
         "红烧排骨",
         0.8726683259010315
      ],
      [
         "回锅肉",
         0.858664333820343
      ],
      [
         "红烧鱼",
         0.8542774319648743
      ],
      [
         "梅菜扣肉",
         0.8500987887382507
      ],
      [
         "糖醋小排",
         0.8475514650344849
      ],
      [
         "小炒肉",
         0.8435966968536377
      ],
      [
         "红烧五花肉",
         0.8424086570739746
      ],
      [
         "红烧肘子",
         0.8400496244430542
      ],
      [
         "糖醋里脊",
         0.8381932377815247
      ],
      [
         "红烧猪蹄",
         0.8374584913253784
      ],
      [
         "青椒炒肉",
         0.8344883918762207
      ],
      [
         "粉蒸肉",
         0.8337559700012207
      ],
      [
         "水煮肉片",
         0.8311598300933838
      ],
      [
         "青椒肉丝",
         0.8294434547424316
      ],
      [
         "鱼香茄子",
         0.8291393518447876
      ],
      [
         "烧茄子",
         0.8272593021392822
      ],
      [
         "梅干菜扣肉",
         0.8267726898193359
      ],
      [
         "土豆炖牛肉",
         0.8263725638389587
      ],
      [
         "红烧茄子",
         0.8244959115982056
      ]
   ],
   "word":"红烧肉"
}

2.因吹斯汀相似词
output:

{
   "top_similar_words":[
      [
         "一颗赛艇",
         0.7618176937103271
      ],
      [
         "因吹斯听",
         0.7523878812789917
      ],
      [
         "城会玩",
         0.6856077909469604
      ],
      [
         "厉害了 word 哥",
         0.6615914702415466
      ],
      [
         "emmmmm",
         0.6590334177017212
      ],
      [
         "扎心了老铁",
         0.6527535915374756
      ],
      [
         "神吐槽",
         0.6382066011428833
      ],
      [
         "可以说是非常爆笑了",
         0.6365567445755005
      ],
      [
         "不明觉厉",
         0.6362186670303345
      ],
      [
         "段子哥",
         0.6293908357620239
      ],
      [
         "厉害了我的哥",
         0.6265187859535217
      ],
      [
         "脑洞大开",
         0.6255093216896057
      ],
      [
         "hhhhhh",
         0.6220428943634033
      ],
      [
         "233333",
         0.6189173460006714
      ],
      [
         "没想到你是这样的",
         0.6184067726135254
      ],
      [
         "屌炸天",
         0.6119771003723145
      ],
      [
         "interesting",
         0.6102393865585327
      ],
      [
         "emmmmmmm",
         0.6097372770309448
      ],
      [
         "开脑洞",
         0.6095746755599976
      ],
      [
         "猴赛雷",
         0.6095525026321411
      ]
   ],
   "word":"因吹斯汀"
}

3.ojbk 相似词
output:

{
   "top_similar_words":[
      [
         "我觉得 ok",
         0.6393940448760986
      ],
      [
         "emmmmmmm",
         0.6306545734405518
      ],
      [
         "hhhh",
         0.6229800581932068
      ],
      [
         "hhhhh",
         0.6225401163101196
      ],
      [
         "不存在的",
         0.6077110767364502
      ],
      [
         "溜了溜了",
         0.603063702583313
      ],
      [
         "hhhhhhh",
         0.6008774638175964
      ],
      [
         "emmmm",
         0.6002634167671204
      ],
      [
         "emmm",
         0.5958442687988281
      ],
      [
         "emmmmm",
         0.592516303062439
      ],
      [
         "阿喵",
         0.5918310880661011
      ],
      [
         "哈哈哈",
         0.590988039970398
      ],
      [
         "略略略",
         0.590296745300293
      ],
      [
         "hhhhhh",
         0.5870903730392456
      ],
      [
         "微笑脸",
         0.5860881209373474
      ],
      [
         "tan90°",
         0.5825910568237305
      ],
      [
         "没毛病",
         0.5802331566810608
      ],
      [
         "233333",
         0.5794929265975952
      ],
      [
         "我觉得不行",
         0.5762011408805847
      ],
      [
         "就酱",
         0.5751103162765503
      ]
   ],
   "word":"ojbk"
}
4549 次点击
所在节点    分享发现
12 条回复
nieyujiang
2019-11-28 18:52:46 +08:00
红烧肉相似词直接给我看饿了
leiuu
2019-11-28 18:53:28 +08:00
@nieyujiang 哈哈 不知道晚上吃啥就用这个模型推荐
nieyujiang
2019-11-28 18:59:11 +08:00
@leiuu #2 吃完直接胖三斤  🤣
leiuu
2019-11-28 19:00:42 +08:00
@nieyujiang
还有呢,烤串相似词:
```json
{
"top_similar_words":[
[
"我觉得 ok",
0.6393940448760986
],
[
"emmmmmmm",
0.6306545734405518
],
[
"hhhh",
0.6229800581932068
],
[
"hhhhh",
0.6225401163101196
],
[
"不存在的",
0.6077110767364502
],
[
"溜了溜了",
0.603063702583313
],
[
"hhhhhhh",
0.6008774638175964
],
[
"emmmm",
0.6002634167671204
],
[
"emmm",
0.5958442687988281
],
[
"emmmmm",
0.592516303062439
],
[
"阿喵",
0.5918310880661011
],
[
"哈哈哈",
0.590988039970398
],
[
"略略略",
0.590296745300293
],
[
"hhhhhh",
0.5870903730392456
],
[
"微笑脸",
0.5860881209373474
],
[
"tan90°",
0.5825910568237305
],
[
"没毛病",
0.5802331566810608
],
[
"233333",
0.5794929265975952
],
[
"我觉得不行",
0.5762011408805847
],
[
"就酱",
0.5751103162765503
]
],
"word":"ojbk"
}
```
leiuu
2019-11-28 19:01:40 +08:00
@nieyujiang 搞错了,重来。
{
"top_similar_words":[
[
"烤串儿",
0.927384614944458
],
[
"羊肉串",
0.894095778465271
],
[
"肉串",
0.8555537462234497
],
[
"烤腰子",
0.8516057729721069
],
[
"撸串",
0.8469321727752686
],
[
"涮串",
0.8465385437011719
],
[
"大肉串",
0.8420960903167725
],
[
"烤肉串",
0.838364839553833
],
[
"牛肉串",
0.8371975421905518
],
[
"烤海鲜",
0.8364357948303223
],
[
"烧烤摊",
0.8351374864578247
],
[
"炸串",
0.8339198231697083
],
[
"烧烤",
0.831093430519104
],
[
"烤羊肉串",
0.8277176022529602
],
[
"各种烤串",
0.8274507522583008
],
[
"烤鱿鱼",
0.8235615491867065
],
[
"烤羊腿",
0.8228681683540344
],
[
"烤猪蹄",
0.8225207328796387
],
[
"烤生蚝",
0.8220213055610657
],
[
"吃串",
0.820912778377533
]
],
"word":"烤串"
}
DEANHZED
2019-11-28 19:20:40 +08:00
emmmmm
devallin
2019-11-28 20:09:16 +08:00
为什么我第一想法是论文降重?
leiuu
2019-11-28 22:14:46 +08:00
@DEANHZED emmmmmmmmmmmm

@devallin 降重可能有其他的方法,这个模型计算词与词之间的相似度好用。句子和句子之间不好直接用。
elfive
2019-11-29 08:26:32 +08:00
这些词,都特么是微信、QQ 聊天信息里面分析提出来的吧。
leiuu
2019-11-29 11:20:12 +08:00
@elfive
官方的说明是这样的。
Data collection.
Our training data contains large-scale text collected from news, webpages, and novels. Text data from diverse domains enables the coverage of various types of words and phrases. Moreover, the recently collected webpages and news data enable us to learn the semantic representations of fresh words.

Vocabulary building. To enrich our vocabulary, we involve phrases in Wikipedia and Baidu Baike. We also apply the phrase discovery approach in Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches, which enhances the coverage of emerging phrases.

大概是说用了新闻、网页、小说、维基百科、百度百科的数据。
没提到聊天数据,不过新闻网页都有评论数据,可能也是数据来源之一。
aalikes95
2019-11-29 15:47:11 +08:00
看起来还是不错的
leiuu
2019-11-29 15:55:09 +08:00
@aalikes95
总体还不错,搜一些词,很多能得到意外之喜。
不过 bug 也比较明显,不少错词。也无法增量更新。

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://tanronggui.xyz/t/624064

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX