当前位置:首页 > 开发 > 开源软件 > 正文

基于word分词提供的文本相似度算法来实现通用的网页相似度检测

发表于: 2015-06-01   作者:yangshangchuan   来源:转载   浏览:
摘要: 实现代码:基于word分词提供的文本相似度算法来实现通用的网页相似度检测 运行结果: 检查的博文数:128 1、检查博文:192本软件著作用词分析(五)用词最复杂99级,相似度分值:Simple=0.968589 Cosine=0.955598 EditDistance=0.916884 EuclideanDistance=0.00825 ManhattanDistance=0.001209

实现代码:基于word分词提供的文本相似度算法来实现通用的网页相似度检测

运行结果:

检查的博文数:128

1、检查博文:192本软件著作用词分析(五)用词最复杂99级,相似度分值:Simple=0.968589 Cosine=0.955598 EditDistance=0.916884 EuclideanDistance=0.00825 ManhattanDistance=0.001209 Jaccard=0.859838 JaroDistance=0.824469 JaroWinklerDistance=0.894682 SørensenDiceCoefficient=0.924638 SimHashPlusHammingDistance=0.976563

博文地址1:http://my.oschina.net/apdplat/blog/388816
博文地址2:http://yangshangchuan.iteye.com/blog/2194214

2、检查博文:APDPlat的系统启动和关闭流程剖析,相似度分值:Simple=0.837996 Cosine=0.711649 EditDistance=0.55001 EuclideanDistance=0.003669 ManhattanDistance=0.000992 Jaccard=0.549422 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.709196 SimHashPlusHammingDistance=0.890625

博文地址1:http://my.oschina.net/apdplat/blog/197067
博文地址2:http://yangshangchuan.iteye.com/blog/2010808

3、检查博文:大数据系列8:Sqoop – HADOOP和RDBMS数据交换,相似度分值:Simple=0.421336 Cosine=0.856138 EditDistance=0.235512 EuclideanDistance=0.004961 ManhattanDistance=0.00052 Jaccard=0.379686 JaroDistance=0.595918 JaroWinklerDistance=0.757551 SørensenDiceCoefficient=0.550394 SimHashPlusHammingDistance=0.960938

博文地址1:http://my.oschina.net/apdplat/blog/396681
博文地址2:http://yangshangchuan.iteye.com/blog/1950171

4、检查博文:网络机器人的识别与攻防的经典案例(也即爬虫与反爬虫的经典案例),相似度分值:Simple=0.900024 Cosine=0.984046 EditDistance=0.709709 EuclideanDistance=0.002209 ManhattanDistance=0.000849 Jaccard=0.63695 JaroDistance=0.769486 JaroWinklerDistance=0.861692 SørensenDiceCoefficient=0.778215 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/399090
博文地址2:http://yangshangchuan.iteye.com/blog/2201599

5、检查博文:Cygwin运行nutch报错:Failed to set permissions of path,相似度分值:Simple=0.59337 Cosine=0.845891 EditDistance=0.161141 EuclideanDistance=0.019639 ManhattanDistance=0.001314 Jaccard=0.350962 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.519573 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396698
博文地址2:http://yangshangchuan.iteye.com/blog/1839784

6、检查博文:nutch2.1+mysql报错及解决,相似度分值:Simple=0.655839 Cosine=0.935856 EditDistance=0.306204 EuclideanDistance=0.019631 ManhattanDistance=0.001353 Jaccard=0.340532 JaroDistance=0.693991 JaroWinklerDistance=0.816395 SørensenDiceCoefficient=0.508055 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/397144
博文地址2:http://yangshangchuan.iteye.com/blog/1839782

7、检查博文:如何在你的应用中使用Jasypt来保护你的数据库用户名和密码,相似度分值:Simple=0.828777 Cosine=0.53085 EditDistance=0.520455 EuclideanDistance=0.002354 ManhattanDistance=0.000858 Jaccard=0.472622 JaroDistance=0.716332 JaroWinklerDistance=0.829799 SørensenDiceCoefficient=0.641879 SimHashPlusHammingDistance=0.859375

博文地址1:http://my.oschina.net/apdplat/blog/405306
博文地址2:http://yangshangchuan.iteye.com/blog/2205240

8、检查博文:cws_evaluation v1.1 发布,中文分词器分词效果评估对比,相似度分值:Simple=0.534674 Cosine=0.739811 EditDistance=0.126522 EuclideanDistance=0.019823 ManhattanDistance=0.001362 Jaccard=0.291045 JaroDistance=0.578053 JaroWinklerDistance=0.746832 SørensenDiceCoefficient=0.450867 SimHashPlusHammingDistance=0.9375

博文地址1:http://my.oschina.net/apdplat/blog/413623
博文地址2:http://yangshangchuan.iteye.com/blog/2210409

9、检查博文:计算ITEYE博文在百度的收录与排名情况,相似度分值:Simple=0.840311 Cosine=0.945342 EditDistance=0.5628 EuclideanDistance=0.019775 ManhattanDistance=0.001333 Jaccard=0.657439 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.793319 SimHashPlusHammingDistance=0.984375

博文地址1:http://my.oschina.net/apdplat/blog/395970
博文地址2:http://yangshangchuan.iteye.com/blog/2199419

10、检查博文:Ubuntu上安装HADOOP多机完全分布式集群,相似度分值:Simple=0.754929 Cosine=0.945111 EditDistance=0.462486 EuclideanDistance=0.01599 ManhattanDistance=0.001279 Jaccard=0.434655 JaroDistance=0.723925 JaroWinklerDistance=0.834355 SørensenDiceCoefficient=0.605937 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/397146
博文地址2:http://yangshangchuan.iteye.com/blog/1840481

11、检查博文:如何在你的应用中集成人机问答系统QuestionAnsweringSystem?,相似度分值:Simple=0.704358 Cosine=0.705715 EditDistance=0.235162 EuclideanDistance=0.007757 ManhattanDistance=0.001188 Jaccard=0.427877 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.599319 SimHashPlusHammingDistance=0.890625

博文地址1:http://my.oschina.net/apdplat/blog/308397
博文地址2:http://yangshangchuan.iteye.com/blog/2108250

12、检查博文:APDPlat中数据库备份恢复的设计与实现,相似度分值:Simple=0.950623 Cosine=0.674034 EditDistance=0.668374 EuclideanDistance=0.000293 ManhattanDistance=0.00024 Jaccard=0.666019 JaroDistance=0.741278 JaroWinklerDistance=0.844767 SørensenDiceCoefficient=0.799534 SimHashPlusHammingDistance=0.96875

博文地址1:http://my.oschina.net/apdplat/blog/196912
博文地址2:http://yangshangchuan.iteye.com/blog/2010680

13、检查博文:一个月的时间让你的词汇量翻一翻,相似度分值:Simple=0.801655 Cosine=0.184972 EditDistance=0.64471 EuclideanDistance=0.000351 ManhattanDistance=0.000117 Jaccard=0.742542 JaroDistance=0.721516 JaroWinklerDistance=0.83291 SørensenDiceCoefficient=0.852252 SimHashPlusHammingDistance=0.851563

博文地址1:http://my.oschina.net/apdplat/blog/379303
博文地址2:http://yangshangchuan.iteye.com/blog/2186301

14、检查博文:配置Cygwin支持无密码SSH登陆,相似度分值:Simple=0.536943 Cosine=0.695802 EditDistance=0.09919 EuclideanDistance=0.019992 ManhattanDistance=0.00137 Jaccard=0.321306 JaroDistance=0.612527 JaroWinklerDistance=0.767516 SørensenDiceCoefficient=0.486346 SimHashPlusHammingDistance=0.984375

博文地址1:http://my.oschina.net/apdplat/blog/397057
博文地址2:http://yangshangchuan.iteye.com/blog/1839812

15、检查博文:1208个合成词,相似度分值:Simple=0.965804 Cosine=0.097513 EditDistance=0.813562 EuclideanDistance=0.000415 ManhattanDistance=0.000281 Jaccard=0.928756 JaroDistance=0.779024 JaroWinklerDistance=0.867414 SørensenDiceCoefficient=0.963062 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/393724
博文地址2:http://yangshangchuan.iteye.com/blog/2197556

16、检查博文:30个JDK类库源代码中最频繁出现的词的深度分析,相似度分值:Simple=0.57495 Cosine=0.772278 EditDistance=0.092555 EuclideanDistance=0.020669 ManhattanDistance=0.001404 Jaccard=0.398703 JaroDistance=0.59485 JaroWinklerDistance=0.75691 SørensenDiceCoefficient=0.570104 SimHashPlusHammingDistance=0.945313

博文地址1:http://my.oschina.net/apdplat/blog/390615
博文地址2:http://yangshangchuan.iteye.com/blog/2194885

17、检查博文:分析996个词根在各大考纲词汇中的作用(五)总结精选篇,相似度分值:Simple=0.864495 Cosine=0.923546 EditDistance=0.535333 EuclideanDistance=0.00042 ManhattanDistance=0.000059 Jaccard=0.775869 JaroDistance=0.683946 JaroWinklerDistance=0.810368 SørensenDiceCoefficient=0.873791 SimHashPlusHammingDistance=0.929688

博文地址1:http://my.oschina.net/apdplat/blog/391865
博文地址2:http://yangshangchuan.iteye.com/blog/2195991

18、检查博文:HBase on CAP,相似度分值:Simple=0.630322 Cosine=0.729626 EditDistance=0.255472 EuclideanDistance=0.020057 ManhattanDistance=0.001366 Jaccard=0.353968 JaroDistance=0.680505 JaroWinklerDistance=0.808303 SørensenDiceCoefficient=0.52286 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/397628
博文地址2:http://yangshangchuan.iteye.com/blog/2002544

19、检查博文:根据76大细分词性对单词进行归组(二),相似度分值:Simple=0.816729 Cosine=0.864275 EditDistance=0.726828 EuclideanDistance=0.008324 ManhattanDistance=0.00024 Jaccard=0.740036 JaroDistance=0.752048 JaroWinklerDistance=0.851229 SørensenDiceCoefficient=0.850599 SimHashPlusHammingDistance=0.914063

博文地址1:http://my.oschina.net/apdplat/blog/393774
博文地址2:http://yangshangchuan.iteye.com/blog/2197877

20、检查博文:测试人机问答系统智能性的3760个问题,相似度分值:Simple=0.161384 Cosine=0.416116 EditDistance=0.029114 EuclideanDistance=0.00041 ManhattanDistance=0.000050 Jaccard=0.045144 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.086388 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/401622
博文地址2:http://yangshangchuan.iteye.com/blog/2202537

21、检查博文:The Future of Compass & ElasticSearch,相似度分值:Simple=0.875892 Cosine=0.956077 EditDistance=0.71091 EuclideanDistance=0.017992 ManhattanDistance=0.00133 Jaccard=0.57684 JaroDistance=0.767975 JaroWinklerDistance=0.860785 SørensenDiceCoefficient=0.73164 SimHashPlusHammingDistance=0.976563

博文地址1:http://my.oschina.net/apdplat/blog/397148
博文地址2:http://yangshangchuan.iteye.com/blog/2010721

22、检查博文:jsearch的索引文件结构,相似度分值:Simple=0.674823 Cosine=0.929826 EditDistance=0.312766 EuclideanDistance=0.01487 ManhattanDistance=0.001276 Jaccard=0.381494 JaroDistance=0.691813 JaroWinklerDistance=0.815088 SørensenDiceCoefficient=0.552291 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/416505
博文地址2:http://yangshangchuan.iteye.com/blog/2212348

23、检查博文:APDPlat中领域模型的自描述机制与事件通知机制,相似度分值:Simple=0.911947 Cosine=0.267994 EditDistance=0.585479 EuclideanDistance=0.000505 ManhattanDistance=0.000366 Jaccard=0.544767 JaroDistance=0.718252 JaroWinklerDistance=0.830951 SørensenDiceCoefficient=0.705306 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/196973
博文地址2:http://yangshangchuan.iteye.com/blog/2010734

24、检查博文:根据76大细分词性对单词进行归组(一),相似度分值:Simple=0.759641 Cosine=0.931402 EditDistance=0.706716 EuclideanDistance=0.010161 ManhattanDistance=0.000222 Jaccard=0.666394 JaroDistance=0.745764 JaroWinklerDistance=0.847459 SørensenDiceCoefficient=0.799803 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/393771
博文地址2:http://yangshangchuan.iteye.com/blog/2197874

25、检查博文:如何解决BUG?,相似度分值:Simple=0.57311 Cosine=0.797745 EditDistance=0.095066 EuclideanDistance=0.019418 ManhattanDistance=0.0013 Jaccard=0.406832 JaroDistance=0.571933 JaroWinklerDistance=0.74316 SørensenDiceCoefficient=0.578366 SimHashPlusHammingDistance=0.953125

博文地址1:http://my.oschina.net/apdplat/blog/394216
博文地址2:http://yangshangchuan.iteye.com/blog/1960489

26、检查博文:The Design of HDFS,相似度分值:Simple=0.744979 Cosine=0.839179 EditDistance=0.482531 EuclideanDistance=0.019056 ManhattanDistance=0.001328 Jaccard=0.442281 JaroDistance=0.714793 JaroWinklerDistance=0.828876 SørensenDiceCoefficient=0.613308 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/397149
博文地址2:http://yangshangchuan.iteye.com/blog/2002898

27、检查博文:APDPlat中备份文件异地容灾机制之FTP上传,相似度分值:Simple=0.899481 Cosine=0.27521 EditDistance=0.545071 EuclideanDistance=0.000563 ManhattanDistance=0.000396 Jaccard=0.546616 JaroDistance=0.707821 JaroWinklerDistance=0.824693 SørensenDiceCoefficient=0.706854 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/197005
博文地址2:http://yangshangchuan.iteye.com/blog/2010750

28、检查博文:大数据系列7:Storm – 流计算,相似度分值:Simple=0.740234 Cosine=0.992857 EditDistance=0.451383 EuclideanDistance=0.017052 ManhattanDistance=0.001316 Jaccard=0.39939 JaroDistance=0.717337 JaroWinklerDistance=0.830402 SørensenDiceCoefficient=0.570806 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396589
博文地址2:http://yangshangchuan.iteye.com/blog/1950165

29、检查博文:APDPlat如何自动建库建表并初始化数据?,相似度分值:Simple=0.883401 Cosine=0.472728 EditDistance=0.62593 EuclideanDistance=0.001341 ManhattanDistance=0.000659 Jaccard=0.59812 JaroDistance=0.738848 JaroWinklerDistance=0.843309 SørensenDiceCoefficient=0.748529 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/197703
博文地址2:http://yangshangchuan.iteye.com/blog/2012220

30、检查博文:分析在各大考纲词汇中同时拥有前缀后缀和词根的词(一),相似度分值:Simple=0.715624 Cosine=0.987889 EditDistance=0.162058 EuclideanDistance=0.000141 ManhattanDistance=0.000018 Jaccard=0.246482 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.395484 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/392490
博文地址2:http://yangshangchuan.iteye.com/blog/2198571

31、检查博文:如何使用Intellij IDEA开发Maven项目?,相似度分值:Simple=0.54095 Cosine=0.750646 EditDistance=0.100101 EuclideanDistance=0.019815 ManhattanDistance=0.001348 Jaccard=0.337907 JaroDistance=0.592775 JaroWinklerDistance=0.755665 SørensenDiceCoefficient=0.505128 SimHashPlusHammingDistance=0.960938

博文地址1:http://my.oschina.net/apdplat/blog/402634
博文地址2:http://yangshangchuan.iteye.com/blog/2203614

32、检查博文:Java实现的基于模板的网页结构化信息精准抽取组件:HtmlExtractor,相似度分值:Simple=0.813787 Cosine=0.656104 EditDistance=0.511139 EuclideanDistance=0.004777 ManhattanDistance=0.001002 Jaccard=0.462937 JaroDistance=0.730409 JaroWinklerDistance=0.838245 SørensenDiceCoefficient=0.632887 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/308400
博文地址2:http://yangshangchuan.iteye.com/blog/2110604

33、检查博文:大数据系列6:HBase – 基于Hadoop的分布式数据库,相似度分值:Simple=0.799285 Cosine=0.960764 EditDistance=0.45006 EuclideanDistance=0.015144 ManhattanDistance=0.001131 Jaccard=0.46224 JaroDistance=0.729751 JaroWinklerDistance=0.83785 SørensenDiceCoefficient=0.632235 SimHashPlusHammingDistance=0.984375

博文地址1:http://my.oschina.net/apdplat/blog/396587
博文地址2:http://yangshangchuan.iteye.com/blog/1954018

34、检查博文:动态索引结构和索引更新机制,相似度分值:Simple=0.710059 Cosine=0.867202 EditDistance=0.318482 EuclideanDistance=0.019811 ManhattanDistance=0.001361 Jaccard=0.393204 JaroDistance=0.695552 JaroWinklerDistance=0.817331 SørensenDiceCoefficient=0.56446 SimHashPlusHammingDistance=0.976563

博文地址1:http://my.oschina.net/apdplat/blog/308393
博文地址2:http://yangshangchuan.iteye.com/blog/2103647

35、检查博文:APDPlat的日志国际化实现方式,相似度分值:Simple=0.724513 Cosine=0.640231 EditDistance=0.246724 EuclideanDistance=0.006562 ManhattanDistance=0.001047 Jaccard=0.381955 JaroDistance=0.68584 JaroWinklerDistance=0.811504 SørensenDiceCoefficient=0.552775 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/196605
博文地址2:http://yangshangchuan.iteye.com/blog/1974027

36、检查博文:词组习语3054组,相似度分值:Simple=0.982007 Cosine=0.999109 EditDistance=0.957185 EuclideanDistance=0.020149 ManhattanDistance=0.001377 Jaccard=0.923142 JaroDistance=0.832513 JaroWinklerDistance=0.899508 SørensenDiceCoefficient=0.960035 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/393374
博文地址2:http://yangshangchuan.iteye.com/blog/2197555

37、检查博文:二百多部软件著作中最重要的9224个英语单词,相似度分值:Simple=0.965539 Cosine=0.998876 EditDistance=0.694819 EuclideanDistance=0.012909 ManhattanDistance=0.000311 Jaccard=0.88649 JaroDistance=0.836379 JaroWinklerDistance=0.901827 SørensenDiceCoefficient=0.93983 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/391023
博文地址2:http://yangshangchuan.iteye.com/blog/2195559

38、检查博文:大数据系列1:在win7上安装配置Hadoop伪分布式集群,相似度分值:Simple=0.781923 Cosine=0.947268 EditDistance=0.498998 EuclideanDistance=0.012338 ManhattanDistance=0.001206 Jaccard=0.449213 JaroDistance=0.724605 JaroWinklerDistance=0.834763 SørensenDiceCoefficient=0.619941 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396579
博文地址2:http://yangshangchuan.iteye.com/blog/1953929

39、检查博文:大数据系列9:Mahout – 机器学习,相似度分值:Simple=0.694397 Cosine=0.904952 EditDistance=0.325297 EuclideanDistance=0.018716 ManhattanDistance=0.001312 Jaccard=0.31453 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.478544 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396682
博文地址2:http://yangshangchuan.iteye.com/blog/1950172

40、检查博文:new一个Object对象占用多少内存?,相似度分值:Simple=0.773846 Cosine=0.82497 EditDistance=0.388785 EuclideanDistance=0.008579 ManhattanDistance=0.000847 Jaccard=0.580042 JaroDistance=0.717925 JaroWinklerDistance=0.830755 SørensenDiceCoefficient=0.734211 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/208456
博文地址2:http://yangshangchuan.iteye.com/blog/2021423

41、检查博文:使用CountDownLatch来模拟马拉松比赛,相似度分值:Simple=0.800658 Cosine=0.262993 EditDistance=0.405044 EuclideanDistance=0.001268 ManhattanDistance=0.000655 Jaccard=0.376563 JaroDistance=0.627187 JaroWinklerDistance=0.776312 SørensenDiceCoefficient=0.547106 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/385448
博文地址2:http://yangshangchuan.iteye.com/blog/2198572

42、检查博文:Hadoop分布式文件系统HDFS和OpenStack对象存储系统Swift有何不同?,相似度分值:Simple=0.545455 Cosine=0.706179 EditDistance=0.103397 EuclideanDistance=0.019213 ManhattanDistance=0.001311 Jaccard=0.333887 JaroDistance=0.587214 JaroWinklerDistance=0.752328 SørensenDiceCoefficient=0.500623 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/396126
博文地址2:http://yangshangchuan.iteye.com/blog/1969491

43、检查博文:Hadoop发行版的比较与选择,相似度分值:Simple=0.819282 Cosine=0.925438 EditDistance=0.530046 EuclideanDistance=0.017123 ManhattanDistance=0.001174 Jaccard=0.613161 JaroDistance=0.738283 JaroWinklerDistance=0.84297 SørensenDiceCoefficient=0.760198 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/397625
博文地址2:http://yangshangchuan.iteye.com/blog/1972846

44、检查博文:中文分词算法 之 基于词典的正向最大匹配算法,相似度分值:Simple=0.949163 Cosine=0.822691 EditDistance=0.491317 EuclideanDistance=0.00024 ManhattanDistance=0.000176 Jaccard=0.615854 JaroDistance=0.689317 JaroWinklerDistance=0.81359 SørensenDiceCoefficient=0.762264 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/209211
博文地址2:http://yangshangchuan.iteye.com/blog/2031813

45、检查博文:大数据系列3:用Python编写MapReduce,相似度分值:Simple=0.712381 Cosine=0.998152 EditDistance=0.384444 EuclideanDistance=0.015459 ManhattanDistance=0.001297 Jaccard=0.30648 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.469169 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396581
博文地址2:http://yangshangchuan.iteye.com/blog/1950157

46、检查博文:运行nutch报错:unzipBestEffort returned null,相似度分值:Simple=0.799439 Cosine=0.790457 EditDistance=0.496494 EuclideanDistance=0.007016 ManhattanDistance=0.001129 Jaccard=0.449635 JaroDistance=0.722079 JaroWinklerDistance=0.833247 SørensenDiceCoefficient=0.620342 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/207653
博文地址2:http://yangshangchuan.iteye.com/blog/2030096

47、检查博文:2000个软件开发领域的高频特殊词及精选例句(一),相似度分值:Simple=0.997733 Cosine=0.894828 EditDistance=0.969809 EuclideanDistance=0.00031 ManhattanDistance=0.000166 Jaccard=0.966481 JaroDistance=0.833402 JaroWinklerDistance=0.900041 SørensenDiceCoefficient=0.982955 SimHashPlusHammingDistance=0.945313

博文地址1:http://my.oschina.net/apdplat/blog/389200
博文地址2:http://yangshangchuan.iteye.com/blog/2195665

48、检查博文:312个免费高速HTTP代理IP(能隐藏自己真实IP地址),相似度分值:Simple=0.863259 Cosine=0.991079 EditDistance=0.721858 EuclideanDistance=0.019681 ManhattanDistance=0.00133 Jaccard=0.608095 JaroDistance=0.774801 JaroWinklerDistance=0.864881 SørensenDiceCoefficient=0.756292 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/399179
博文地址2:http://yangshangchuan.iteye.com/blog/2201661

49、检查博文:JDK源代码中最重要的4646个英语单词,相似度分值:Simple=0.983735 Cosine=0.998253 EditDistance=0.958238 EuclideanDistance=0.020571 ManhattanDistance=0.001416 Jaccard=0.947587 JaroDistance=0.831959 JaroWinklerDistance=0.899176 SørensenDiceCoefficient=0.973088 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/390915
博文地址2:http://yangshangchuan.iteye.com/blog/2195664

50、检查博文:分布式内存文件系统:Tachyon,相似度分值:Simple=0.666036 Cosine=0.888752 EditDistance=0.258614 EuclideanDistance=0.020062 ManhattanDistance=0.001339 Jaccard=0.424242 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.595745 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/377832
博文地址2:http://yangshangchuan.iteye.com/blog/2199538

51、检查博文:JAVA调用CSDN接口发博文,相似度分值:Simple=0.834603 Cosine=0.24022 EditDistance=0.370826 EuclideanDistance=0.000925 ManhattanDistance=0.000542 Jaccard=0.461646 JaroDistance=0.669012 JaroWinklerDistance=0.801407 SørensenDiceCoefficient=0.631679 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/200145
博文地址2:http://yangshangchuan.iteye.com/blog/2017751

52、检查博文:JDK源代码以及200多部软件著作中出现的以连字符构造的1011个合成词,相似度分值:Simple=0.940111 Cosine=0.990982 EditDistance=0.857646 EuclideanDistance=0.020575 ManhattanDistance=0.001418 Jaccard=0.810566 JaroDistance=0.808976 JaroWinklerDistance=0.885386 SørensenDiceCoefficient=0.895373 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/394495
博文地址2:http://yangshangchuan.iteye.com/blog/2199283

53、检查博文:中文分词之11946组同义词,相似度分值:Simple=0.653875 Cosine=0.36483 EditDistance=0.206922 EuclideanDistance=0.004696 ManhattanDistance=0.00106 Jaccard=0.482072 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.650538 SimHashPlusHammingDistance=0.859375

博文地址1:http://my.oschina.net/apdplat/blog/408779
博文地址2:http://yangshangchuan.iteye.com/blog/2207725

54、检查博文:分析在各大考纲词汇中既没有词根也没有前缀和后缀的独立单词,相似度分值:Simple=0.898462 Cosine=0.7621 EditDistance=0.750277 EuclideanDistance=0.020082 ManhattanDistance=0.001323 Jaccard=0.80887 JaroDistance=0.781002 JaroWinklerDistance=0.868601 SørensenDiceCoefficient=0.894337 SimHashPlusHammingDistance=0.96875

博文地址1:http://my.oschina.net/apdplat/blog/392483
博文地址2:http://yangshangchuan.iteye.com/blog/2196691

55、检查博文:给LUKE增加word分词器,相似度分值:Simple=0.588725 Cosine=0.826122 EditDistance=0.118762 EuclideanDistance=0.019581 ManhattanDistance=0.001323 Jaccard=0.32879 JaroDistance=0.625306 JaroWinklerDistance=0.775184 SørensenDiceCoefficient=0.494872 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/397069
博文地址2:http://yangshangchuan.iteye.com/blog/2200077

56、检查博文:大数据系列2:建立开发环境编写HDFS和Map Reduce程序,相似度分值:Simple=0.647931 Cosine=0.883387 EditDistance=0.166276 EuclideanDistance=0.018676 ManhattanDistance=0.001247 Jaccard=0.404223 JaroDistance=0.690409 JaroWinklerDistance=0.814246 SørensenDiceCoefficient=0.575725 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396580
博文地址2:http://yangshangchuan.iteye.com/blog/1950158

57、检查博文:分析113个前缀在各大考纲词汇中的作用(二)总结精选篇,相似度分值:Simple=0.949636 Cosine=0.841608 EditDistance=0.712749 EuclideanDistance=0.00362 ManhattanDistance=0.000494 Jaccard=0.803801 JaroDistance=0.760753 JaroWinklerDistance=0.856452 SørensenDiceCoefficient=0.89123 SimHashPlusHammingDistance=0.945313

博文地址1:http://my.oschina.net/apdplat/blog/392456
博文地址2:http://yangshangchuan.iteye.com/blog/2195996

58、检查博文:ITEYE博文抄袭检查,相似度分值:Simple=0.970753 Cosine=0.996009 EditDistance=0.815637 EuclideanDistance=0.006436 ManhattanDistance=0.000636 Jaccard=0.804027 JaroDistance=0.794149 JaroWinklerDistance=0.876489 SørensenDiceCoefficient=0.891369 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396411
博文地址2:http://yangshangchuan.iteye.com/blog/2199536

59、检查博文:word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估,相似度分值:Simple=0.407315 Cosine=0.150613 EditDistance=0.026527 EuclideanDistance=0.00010 ManhattanDistance=0.000053 Jaccard=0.18111 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.306678 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/228615
博文地址2:http://yangshangchuan.iteye.com/blog/2056537

60、检查博文:分析151个后缀在各大考纲词汇中的作用(三)总结精选篇,相似度分值:Simple=0.967367 Cosine=0.863802 EditDistance=0.742439 EuclideanDistance=0.002269 ManhattanDistance=0.000329 Jaccard=0.855789 JaroDistance=0.76582 JaroWinklerDistance=0.859492 SørensenDiceCoefficient=0.922291 SimHashPlusHammingDistance=0.90625

博文地址1:http://my.oschina.net/apdplat/blog/392466
博文地址2:http://yangshangchuan.iteye.com/blog/2196690

61、检查博文:利用word分词来计算文本相似度,相似度分值:Simple=0.871744 Cosine=0.701305 EditDistance=0.716055 EuclideanDistance=0.002361 ManhattanDistance=0.000734 Jaccard=0.394945 JaroDistance=0.769578 JaroWinklerDistance=0.861747 SørensenDiceCoefficient=0.566251 SimHashPlusHammingDistance=0.890625

博文地址1:http://my.oschina.net/apdplat/blog/417047
博文地址2:http://yangshangchuan.iteye.com/blog/2212667

62、检查博文:大数据系列4:Hive – 基于HADOOP的数据仓库,相似度分值:Simple=0.875955 Cosine=0.993649 EditDistance=0.706468 EuclideanDistance=0.006252 ManhattanDistance=0.001101 Jaccard=0.520858 JaroDistance=0.772404 JaroWinklerDistance=0.863442 SørensenDiceCoefficient=0.684953 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396582
博文地址2:http://yangshangchuan.iteye.com/blog/1950178

63、检查博文:利用word分词通过计算词的语境来获得相关词,相似度分值:Simple=0.914777 Cosine=0.270823 EditDistance=0.7006 EuclideanDistance=0.000958 ManhattanDistance=0.000566 Jaccard=0.716478 JaroDistance=0.75262 JaroWinklerDistance=0.851572 SørensenDiceCoefficient=0.834824 SimHashPlusHammingDistance=0.898438

博文地址1:http://my.oschina.net/apdplat/blog/417922
博文地址2:http://yangshangchuan.iteye.com/blog/2213330

64、检查博文:Java应用系统中自动实时检测资源文件内容变化,相似度分值:Simple=0.944844 Cosine=0.101055 EditDistance=0.589899 EuclideanDistance=0.000212 ManhattanDistance=0.000183 Jaccard=0.54994 JaroDistance=0.708615 JaroWinklerDistance=0.825169 SørensenDiceCoefficient=0.709627 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/312609
博文地址2:http://yangshangchuan.iteye.com/blog/2115461

65、检查博文:OSCHINA博文抄袭检查,相似度分值:Simple=0.972508 Cosine=0.995548 EditDistance=0.813105 EuclideanDistance=0.005584 ManhattanDistance=0.000579 Jaccard=0.816191 JaroDistance=0.791755 JaroWinklerDistance=0.875053 SørensenDiceCoefficient=0.898794 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396414
博文地址2:http://yangshangchuan.iteye.com/blog/2200451

66、检查博文:中文分词算法 之 词典机制性能优化与测试,相似度分值:Simple=0.781593 Cosine=0.460772 EditDistance=0.424893 EuclideanDistance=0.002293 ManhattanDistance=0.000831 Jaccard=0.475168 JaroDistance=0.697677 JaroWinklerDistance=0.818606 SørensenDiceCoefficient=0.644222 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/213968
博文地址2:http://yangshangchuan.iteye.com/blog/2035007

67、检查博文:一种防止用户生成内容站点出现商业广告以及非法有害等垃圾信息的方法,相似度分值:Simple=0.947755 Cosine=0.997995 EditDistance=0.849805 EuclideanDistance=0.017408 ManhattanDistance=0.001151 Jaccard=0.746667 JaroDistance=0.807946 JaroWinklerDistance=0.884768 SørensenDiceCoefficient=0.854962 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/398338
博文地址2:http://yangshangchuan.iteye.com/blog/2200810

68、检查博文:给JAVA源代码文件统一地添加licence信息头,相似度分值:Simple=0.876004 Cosine=0.228873 EditDistance=0.471621 EuclideanDistance=0.000488 ManhattanDistance=0.000354 Jaccard=0.460165 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.630292 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/396415
博文地址2:http://yangshangchuan.iteye.com/blog/1841150

69、检查博文:中文分词算法 之 基于词典的正向最小匹配算法,相似度分值:Simple=0.605042 Cosine=0.745649 EditDistance=0.076564 EuclideanDistance=0.014045 ManhattanDistance=0.001252 Jaccard=0.352431 JaroDistance=0.552118 JaroWinklerDistance=0.731271 SørensenDiceCoefficient=0.521181 SimHashPlusHammingDistance=0.90625

博文地址1:http://my.oschina.net/apdplat/blog/217588
博文地址2:http://yangshangchuan.iteye.com/blog/2040423

70、检查博文:APDPlat中的机器码生成机制,相似度分值:Simple=0.924309 Cosine=0.28588 EditDistance=0.549355 EuclideanDistance=0.000371 ManhattanDistance=0.000283 Jaccard=0.55618 JaroDistance=0.7063 JaroWinklerDistance=0.82378 SørensenDiceCoefficient=0.714801 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/197805
博文地址2:http://yangshangchuan.iteye.com/blog/2012401

71、检查博文:关于解析配置文件的一点思考,相似度分值:Simple=0.899769 Cosine=0.213149 EditDistance=0.567892 EuclideanDistance=0.000607 ManhattanDistance=0.000423 Jaccard=0.575942 JaroDistance=0.718013 JaroWinklerDistance=0.830808 SørensenDiceCoefficient=0.730918 SimHashPlusHammingDistance=0.851563

博文地址1:http://my.oschina.net/apdplat/blog/404695
博文地址2:http://yangshangchuan.iteye.com/blog/2204877

72、检查博文:struts2和spring mvc,孰优孰劣?,相似度分值:Simple=0.917758 Cosine=0.241255 EditDistance=0.592436 EuclideanDistance=0.000453 ManhattanDistance=0.000339 Jaccard=0.549367 JaroDistance=0.719762 JaroWinklerDistance=0.831857 SørensenDiceCoefficient=0.70915 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/403561
博文地址2:http://yangshangchuan.iteye.com/blog/2203767

73、检查博文:中文分词效果对比,相似度分值:Simple=0.726563 Cosine=0.289377 EditDistance=0.252894 EuclideanDistance=0.001605 ManhattanDistance=0.000703 Jaccard=0.443243 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.614232 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/228614
博文地址2:http://yangshangchuan.iteye.com/blog/2043184

74、检查博文:如何使用HtmlExtractor实现基于模板的网页结构化信息精准抽取?,相似度分值:Simple=0.905067 Cosine=0.390132 EditDistance=0.645882 EuclideanDistance=0.00082 ManhattanDistance=0.000511 Jaccard=0.692377 JaroDistance=0.741666 JaroWinklerDistance=0.845 SørensenDiceCoefficient=0.81823 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/402149
博文地址2:http://yangshangchuan.iteye.com/blog/2202879

75、检查博文:中文分词算法 之 基于词典的逆向最小匹配算法,相似度分值:Simple=0.60509 Cosine=0.746722 EditDistance=0.078228 EuclideanDistance=0.014311 ManhattanDistance=0.001274 Jaccard=0.350263 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.518807 SimHashPlusHammingDistance=0.90625

博文地址1:http://my.oschina.net/apdplat/blog/217589
博文地址2:http://yangshangchuan.iteye.com/blog/2040431

76、检查博文:大数据系列10:Spark – 内存计算,相似度分值:Simple=0.652939 Cosine=0.910027 EditDistance=0.295395 EuclideanDistance=0.019689 ManhattanDistance=0.001348 Jaccard=0.351133 JaroDistance=0.686576 JaroWinklerDistance=0.811946 SørensenDiceCoefficient=0.51976 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396683
博文地址2:http://yangshangchuan.iteye.com/blog/1950276

77、检查博文:基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎,相似度分值:Simple=0.69378 Cosine=0.876685 EditDistance=0.325701 EuclideanDistance=0.018531 ManhattanDistance=0.001232 Jaccard=0.495425 JaroDistance=0.706159 JaroWinklerDistance=0.823696 SørensenDiceCoefficient=0.662587 SimHashPlusHammingDistance=0.960938

博文地址1:http://my.oschina.net/apdplat/blog/308396
博文地址2:http://yangshangchuan.iteye.com/blog/2103664

78、检查博文:Amazon Dynamo的NWR模型,相似度分值:Simple=0.700984 Cosine=0.917696 EditDistance=0.369508 EuclideanDistance=0.019238 ManhattanDistance=0.001302 Jaccard=0.489189 JaroDistance=0.694207 JaroWinklerDistance=0.816524 SørensenDiceCoefficient=0.656987 SimHashPlusHammingDistance=0.96875

博文地址1:http://my.oschina.net/apdplat/blog/393783
博文地址2:http://yangshangchuan.iteye.com/blog/2010574

79、检查博文:采集电子报纸,相似度分值:Simple=0.947947 Cosine=0.119871 EditDistance=0.568618 EuclideanDistance=0.00018 ManhattanDistance=0.000158 Jaccard=0.503193 JaroDistance=0.679952 JaroWinklerDistance=0.807971 SørensenDiceCoefficient=0.669499 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/397051
博文地址2:http://yangshangchuan.iteye.com/blog/1996911

80、检查博文:人机问答系统介绍,相似度分值:Simple=0.576392 Cosine=0.795634 EditDistance=0.099 EuclideanDistance=0.019823 ManhattanDistance=0.001393 Jaccard=0.370184 JaroDistance=0.615931 JaroWinklerDistance=0.769559 SørensenDiceCoefficient=0.540342 SimHashPlusHammingDistance=0.945313

博文地址1:http://my.oschina.net/apdplat/blog/402411
博文地址2:http://yangshangchuan.iteye.com/blog/2203238

81、检查博文:配置Nutch模拟浏览器以绕过反爬虫限制,相似度分值:Simple=0.861469 Cosine=0.468709 EditDistance=0.567494 EuclideanDistance=0.001332 ManhattanDistance=0.000678 Jaccard=0.491826 JaroDistance=0.7279 JaroWinklerDistance=0.83674 SørensenDiceCoefficient=0.659361 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/208457
博文地址2:http://yangshangchuan.iteye.com/blog/2030741

82、检查博文:Ubuntu上安装HADOOP单机伪分布式集群,相似度分值:Simple=0.692959 Cosine=0.931583 EditDistance=0.354773 EuclideanDistance=0.019811 ManhattanDistance=0.001353 Jaccard=0.371336 JaroDistance=0.702344 JaroWinklerDistance=0.821406 SørensenDiceCoefficient=0.541568 SimHashPlusHammingDistance=0.984375

博文地址1:http://my.oschina.net/apdplat/blog/397145
博文地址2:http://yangshangchuan.iteye.com/blog/1839809

83、检查博文:搜索引擎的分片(shard)和副本(replica)机制,相似度分值:Simple=0.53135 Cosine=0.713158 EditDistance=0.104676 EuclideanDistance=0.019034 ManhattanDistance=0.001294 Jaccard=0.338843 JaroDistance=0.580059 JaroWinklerDistance=0.748035 SørensenDiceCoefficient=0.506173 SimHashPlusHammingDistance=0.882813

博文地址1:http://my.oschina.net/apdplat/blog/308395
博文地址2:http://yangshangchuan.iteye.com/blog/2103650

84、检查博文:大数据系列12:Hadoop2 – 全新的Hadoop,相似度分值:Simple=0.723373 Cosine=0.954384 EditDistance=0.332196 EuclideanDistance=0.012481 ManhattanDistance=0.000993 Jaccard=0.373219 JaroDistance=0.706972 JaroWinklerDistance=0.824183 SørensenDiceCoefficient=0.543568 SimHashPlusHammingDistance=0.914063

博文地址1:http://my.oschina.net/apdplat/blog/396685
博文地址2:http://yangshangchuan.iteye.com/blog/1967994

85、检查博文:开源项目中如何同时支持Git@OSC和Github,相似度分值:Simple=0.610243 Cosine=0.802775 EditDistance=0.134549 EuclideanDistance=0.017438 ManhattanDistance=0.001279 Jaccard=0.363636 JaroDistance=0.629487 JaroWinklerDistance=0.777692 SørensenDiceCoefficient=0.533333 SimHashPlusHammingDistance=0.976563

博文地址1:http://my.oschina.net/apdplat/blog/415849
博文地址2:http://yangshangchuan.iteye.com/blog/2211952

86、检查博文:Windows上安装HADOOP单机伪分布式集群,相似度分值:Simple=0.684061 Cosine=0.923693 EditDistance=0.323039 EuclideanDistance=0.019109 ManhattanDistance=0.001337 Jaccard=0.380573 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.551326 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/397147
博文地址2:http://yangshangchuan.iteye.com/blog/1839814

87、检查博文:实现JDK没有提供的AtomicFloat,相似度分值:Simple=0.825848 Cosine=0.382893 EditDistance=0.512846 EuclideanDistance=0.00173 ManhattanDistance=0.00077 Jaccard=0.519634 JaroDistance=0.712833 JaroWinklerDistance=0.8277 SørensenDiceCoefficient=0.683893 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/418019
博文地址2:http://yangshangchuan.iteye.com/blog/2213383

88、检查博文:Java开源项目cws_evaluation:中文分词器分词效果评估,相似度分值:Simple=0.847565 Cosine=0.946887 EditDistance=0.485049 EuclideanDistance=0.005634 ManhattanDistance=0.000907 Jaccard=0.529736 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.692585 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/308391
博文地址2:http://yangshangchuan.iteye.com/blog/2059040

89、检查博文:中文分词算法 之 基于词典的逆向最大匹配算法,相似度分值:Simple=0.926767 Cosine=0.415351 EditDistance=0.689871 EuclideanDistance=0.000695 ManhattanDistance=0.000452 Jaccard=0.660784 JaroDistance=0.754275 JaroWinklerDistance=0.852565 SørensenDiceCoefficient=0.79575 SimHashPlusHammingDistance=0.929688

博文地址1:http://my.oschina.net/apdplat/blog/210427
博文地址2:http://yangshangchuan.iteye.com/blog/2033843

90、检查博文:如何利用多核提升分词速度,相似度分值:Simple=0.811988 Cosine=0.341844 EditDistance=0.468702 EuclideanDistance=0.002047 ManhattanDistance=0.000827 Jaccard=0.520694 JaroDistance=0.709353 JaroWinklerDistance=0.825612 SørensenDiceCoefficient=0.684811 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/414076
博文地址2:http://yangshangchuan.iteye.com/blog/2210633

91、检查博文:QuestionAnsweringSystem v1.1 发布,人机问答系统,相似度分值:Simple=0.72864 Cosine=0.528121 EditDistance=0.272386 EuclideanDistance=0.00509 ManhattanDistance=0.001081 Jaccard=0.496423 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.66348 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/308392
博文地址2:http://yangshangchuan.iteye.com/blog/2101533

92、检查博文:11大Java开源中文分词器的使用方法和分词效果对比,相似度分值:Simple=0.937513 Cosine=0.270575 EditDistance=0.70589 EuclideanDistance=0.00046 ManhattanDistance=0.000332 Jaccard=0.569048 JaroDistance=0.755221 JaroWinklerDistance=0.853133 SørensenDiceCoefficient=0.725341 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/412921
博文地址2:http://yangshangchuan.iteye.com/blog/2209899

93、检查博文:中文分词之9271组反义词,相似度分值:Simple=0.574117 Cosine=0.65666 EditDistance=0.130794 EuclideanDistance=0.019911 ManhattanDistance=0.00137 Jaccard=0.414925 JaroDistance=0.513196 JaroWinklerDistance=0.707918 SørensenDiceCoefficient=0.586498 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/411301
博文地址2:http://yangshangchuan.iteye.com/blog/2208869

94、检查博文:使用Java8实现自己的个性化搜索引擎,相似度分值:Simple=0.586612 Cosine=0.780165 EditDistance=0.049751 EuclideanDistance=0.016146 ManhattanDistance=0.001099 Jaccard=0.351852 JaroDistance=0.640493 JaroWinklerDistance=0.784296 SørensenDiceCoefficient=0.520548 SimHashPlusHammingDistance=0.9375

博文地址1:http://my.oschina.net/apdplat/blog/396193
博文地址2:http://yangshangchuan.iteye.com/blog/2199420

95、检查博文:一种通用的网页相似度检测算法,相似度分值:Simple=0.982942 Cosine=0.98389 EditDistance=0.743325 EuclideanDistance=0.001949 ManhattanDistance=0.000255 Jaccard=0.777508 JaroDistance=0.761679 JaroWinklerDistance=0.857007 SørensenDiceCoefficient=0.874829 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/398361
博文地址2:http://yangshangchuan.iteye.com/blog/2200816

96、检查博文:使用JSoup+CSSPath采集和讯网人物信息,相似度分值:Simple=0.900441 Cosine=0.141305 EditDistance=0.460352 EuclideanDistance=0.000324 ManhattanDistance=0.000262 Jaccard=0.412202 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.583772 SimHashPlusHammingDistance=0.859375

博文地址1:http://my.oschina.net/apdplat/blog/397143
博文地址2:http://yangshangchuan.iteye.com/blog/1966497

97、检查博文:APDPlat拓展搜索之集成ElasticSearch,相似度分值:Simple=0.924891 Cosine=0.240756 EditDistance=0.583444 EuclideanDistance=0.000404 ManhattanDistance=0.000312 Jaccard=0.593968 JaroDistance=0.71564 JaroWinklerDistance=0.829384 SørensenDiceCoefficient=0.745269 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/197012
博文地址2:http://yangshangchuan.iteye.com/blog/2010755

98、检查博文:运行nutch提示:0 records selected for fetching, exiting,相似度分值:Simple=0.857119 Cosine=0.776194 EditDistance=0.606574 EuclideanDistance=0.00331 ManhattanDistance=0.000936 Jaccard=0.518834 JaroDistance=0.750933 JaroWinklerDistance=0.85056 SørensenDiceCoefficient=0.6832 SimHashPlusHammingDistance=0.875

博文地址1:http://my.oschina.net/apdplat/blog/396699
博文地址2:http://yangshangchuan.iteye.com/blog/2033009

99、检查博文:36本Java英文原版电子书,相似度分值:Simple=0.439281 Cosine=0.599467 EditDistance=0.089839 EuclideanDistance=0.019907 ManhattanDistance=0.001395 Jaccard=0.2939 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.454286 SimHashPlusHammingDistance=0.953125

博文地址1:http://my.oschina.net/apdplat/blog/401287
博文地址2:http://yangshangchuan.iteye.com/blog/2202225

100、检查博文:软件业的奥斯卡奖:JOLT奖 之 最好的书,相似度分值:Simple=0.930154 Cosine=0.972151 EditDistance=0.827542 EuclideanDistance=0.017854 ManhattanDistance=0.001274 Jaccard=0.72969 JaroDistance=0.801542 JaroWinklerDistance=0.880925 SørensenDiceCoefficient=0.843723 SimHashPlusHammingDistance=0.96875

博文地址1:http://my.oschina.net/apdplat/blog/395681
博文地址2:http://yangshangchuan.iteye.com/blog/1837328

101、检查博文:Nutch的发展历程,相似度分值:Simple=0.600951 Cosine=0.796158 EditDistance=0.178988 EuclideanDistance=0.01937 ManhattanDistance=0.00133 Jaccard=0.34647 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.514634 SimHashPlusHammingDistance=0.984375

博文地址1:http://my.oschina.net/apdplat/blog/397151
博文地址2:http://yangshangchuan.iteye.com/blog/1949212

102、检查博文:中文分词算法 之 基于词典的全切分算法,相似度分值:Simple=0.990681 Cosine=0.336911 EditDistance=0.744182 EuclideanDistance=0.000048 ManhattanDistance=0.000046 Jaccard=0.838876 JaroDistance=0.751056 JaroWinklerDistance=0.850634 SørensenDiceCoefficient=0.912379 SimHashPlusHammingDistance=0.890625

博文地址1:http://my.oschina.net/apdplat/blog/412785
博文地址2:http://yangshangchuan.iteye.com/blog/2209761

103、检查博文:Java中的null引用,超乎你想象,相似度分值:Simple=0.557861 Cosine=0.669701 EditDistance=0.11941 EuclideanDistance=0.019681 ManhattanDistance=0.001348 Jaccard=0.330619 JaroDistance=0.643192 JaroWinklerDistance=0.785915 SørensenDiceCoefficient=0.49694 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/217587
博文地址2:http://yangshangchuan.iteye.com/blog/2038163

104、检查博文:利用word分词来对文本进行词频统计,相似度分值:Simple=0.67997 Cosine=0.909786 EditDistance=0.281984 EuclideanDistance=0.01535 ManhattanDistance=0.001309 Jaccard=0.376689 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.547239 SimHashPlusHammingDistance=0.976563

博文地址1:http://my.oschina.net/apdplat/blog/417641
博文地址2:http://yangshangchuan.iteye.com/blog/2213194

105、检查博文:使用Java调用谷歌搜索,相似度分值:Simple=0.842954 Cosine=0.254785 EditDistance=0.342058 EuclideanDistance=0.000884 ManhattanDistance=0.000486 Jaccard=0.431907 JaroDistance=0.670313 JaroWinklerDistance=0.802188 SørensenDiceCoefficient=0.603261 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/397127
博文地址2:http://yangshangchuan.iteye.com/blog/1961059

106、检查博文:英语单词后缀规则总结,相似度分值:Simple=0.977161 Cosine=0.962232 EditDistance=0.873122 EuclideanDistance=0.00725 ManhattanDistance=0.000652 Jaccard=0.90356 JaroDistance=0.803434 JaroWinklerDistance=0.882061 SørensenDiceCoefficient=0.949337 SimHashPlusHammingDistance=0.90625

博文地址1:http://my.oschina.net/apdplat/blog/379330
博文地址2:http://yangshangchuan.iteye.com/blog/2186326

107、检查博文:大数据系列11:Gora – 大数据持久化,相似度分值:Simple=0.937385 Cosine=0.999418 EditDistance=0.832276 EuclideanDistance=0.001872 ManhattanDistance=0.000787 Jaccard=0.566059 JaroDistance=0.804615 JaroWinklerDistance=0.882769 SørensenDiceCoefficient=0.722909 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396684
博文地址2:http://yangshangchuan.iteye.com/blog/1953733

108、检查博文:使用Java调用百度搜索,相似度分值:Simple=0.886541 Cosine=0.221597 EditDistance=0.273351 EuclideanDistance=0.000495 ManhattanDistance=0.000301 Jaccard=0.463905 JaroDistance=0.637882 JaroWinklerDistance=0.782729 SørensenDiceCoefficient=0.633791 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/397129
博文地址2:http://yangshangchuan.iteye.com/blog/1961058

109、检查博文:Java8全新打造,英语学习supertool,相似度分值:Simple=0.61907 Cosine=0.758435 EditDistance=0.075541 EuclideanDistance=0.019815 ManhattanDistance=0.001312 Jaccard=0.413534 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.585106 SimHashPlusHammingDistance=0.945313

博文地址1:http://my.oschina.net/apdplat/blog/393187
博文地址2:http://yangshangchuan.iteye.com/blog/2196853

110、检查博文:分布式搜索算法,相似度分值:Simple=0.568918 Cosine=0.796011 EditDistance=0.129047 EuclideanDistance=0.019407 ManhattanDistance=0.001355 Jaccard=0.379479 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.550177 SimHashPlusHammingDistance=0.976563

博文地址1:http://my.oschina.net/apdplat/blog/396196
博文地址2:http://yangshangchuan.iteye.com/blog/1965212

111、检查博文:模拟浏览器的神器 - HtmlUnit,相似度分值:Simple=0.853159 Cosine=0.37586 EditDistance=0.396646 EuclideanDistance=0.00164 ManhattanDistance=0.000617 Jaccard=0.462577 JaroDistance=0.687916 JaroWinklerDistance=0.81275 SørensenDiceCoefficient=0.63255 SimHashPlusHammingDistance=0.867188

博文地址1:http://my.oschina.net/apdplat/blog/217586
博文地址2:http://yangshangchuan.iteye.com/blog/2036809

112、检查博文:APDPlat中的用户密码安全策略,相似度分值:Simple=0.919992 Cosine=0.633571 EditDistance=0.7097 EuclideanDistance=0.001001 ManhattanDistance=0.00058 Jaccard=0.640674 JaroDistance=0.762376 JaroWinklerDistance=0.857426 SørensenDiceCoefficient=0.780989 SimHashPlusHammingDistance=0.921875

博文地址1:http://my.oschina.net/apdplat/blog/207124
博文地址2:http://yangshangchuan.iteye.com/blog/2029367

113、检查博文:软件熵:软件开发中推倒重来的过程就是软件熵不断增加的过程,相似度分值:Simple=0.663658 Cosine=0.86042 EditDistance=0.225667 EuclideanDistance=0.019304 ManhattanDistance=0.001312 Jaccard=0.495302 JaroDistance=0.680842 JaroWinklerDistance=0.808505 SørensenDiceCoefficient=0.662478 SimHashPlusHammingDistance=0.96875

博文地址1:http://my.oschina.net/apdplat/blog/311291
博文地址2:http://yangshangchuan.iteye.com/blog/2113923

114、检查博文:网络爬虫面临的挑战 之 链接构造,相似度分值:Simple=0.892412 Cosine=0.875158 EditDistance=0.725053 EuclideanDistance=0.005005 ManhattanDistance=0.001028 Jaccard=0.63859 JaroDistance=0.778366 JaroWinklerDistance=0.86702 SørensenDiceCoefficient=0.779438 SimHashPlusHammingDistance=0.945313

博文地址1:http://my.oschina.net/apdplat/blog/208716
博文地址2:http://yangshangchuan.iteye.com/blog/2031642

115、检查博文:APDPlat拓展搜索之集成Solr,相似度分值:Simple=0.930587 Cosine=0.265953 EditDistance=0.635602 EuclideanDistance=0.000416 ManhattanDistance=0.00032 Jaccard=0.590962 JaroDistance=0.732765 JaroWinklerDistance=0.839659 SørensenDiceCoefficient=0.742899 SimHashPlusHammingDistance=0.851563

博文地址1:http://my.oschina.net/apdplat/blog/197020
博文地址2:http://yangshangchuan.iteye.com/blog/2010760

116、检查博文:Nutch抓取需要登录的网站,相似度分值:Simple=0.677419 Cosine=0.789777 EditDistance=0.227579 EuclideanDistance=0.015439 ManhattanDistance=0.001155 Jaccard=0.394469 JaroDistance=0.684062 JaroWinklerDistance=0.810437 SørensenDiceCoefficient=0.565762 SimHashPlusHammingDistance=0.960938

博文地址1:http://my.oschina.net/apdplat/blog/208723
博文地址2:http://yangshangchuan.iteye.com/blog/2031742

117、检查博文:英语单词音近形似转化规律研究,相似度分值:Simple=0.870524 Cosine=0.057626 EditDistance=0.234843 EuclideanDistance=0.000133 ManhattanDistance=0.000114 Jaccard=0.916158 JaroDistance=0.775934 JaroWinklerDistance=0.86556 SørensenDiceCoefficient=0.956245 SimHashPlusHammingDistance=0.851563

博文地址1:http://my.oschina.net/apdplat/blog/378569
博文地址2:http://yangshangchuan.iteye.com/blog/2186300

118、检查博文:一种基于词性序列的人名识别方法,相似度分值:Simple=0.54963 Cosine=0.728393 EditDistance=0.085926 EuclideanDistance=0.015127 ManhattanDistance=0.001295 Jaccard=0.300725 JaroDistance=0.538423 JaroWinklerDistance=0.723054 SørensenDiceCoefficient=0.462396 SimHashPlusHammingDistance=0.914063

博文地址1:http://my.oschina.net/apdplat/blog/411032
博文地址2:http://yangshangchuan.iteye.com/blog/2208691

119、检查博文:如何使用Eclipse在Github上开发使用Java8的Maven项目?,相似度分值:Simple=0.585388 Cosine=0.727306 EditDistance=0.093607 EuclideanDistance=0.018296 ManhattanDistance=0.001282 Jaccard=0.36246 JaroDistance=0.623102 JaroWinklerDistance=0.773861 SørensenDiceCoefficient=0.532067 SimHashPlusHammingDistance=0.9375

博文地址1:http://my.oschina.net/apdplat/blog/403394
博文地址2:http://yangshangchuan.iteye.com/blog/2203623

120、检查博文:英语单词前缀规则总结,相似度分值:Simple=0.972856 Cosine=0.961455 EditDistance=0.821063 EuclideanDistance=0.005796 ManhattanDistance=0.000576 Jaccard=0.873817 JaroDistance=0.787185 JaroWinklerDistance=0.872311 SørensenDiceCoefficient=0.93266 SimHashPlusHammingDistance=0.953125

博文地址1:http://my.oschina.net/apdplat/blog/378753
博文地址2:http://yangshangchuan.iteye.com/blog/2186327

121、检查博文:APDPlat中业务日志和监控日志的设计与实现,相似度分值:Simple=0.914701 Cosine=0.688401 EditDistance=0.683705 EuclideanDistance=0.001488 ManhattanDistance=0.000661 Jaccard=0.631373 JaroDistance=0.762758 JaroWinklerDistance=0.857655 SørensenDiceCoefficient=0.774038 SimHashPlusHammingDistance=0.898438

博文地址1:http://my.oschina.net/apdplat/blog/196604
博文地址2:http://yangshangchuan.iteye.com/blog/2010571

122、检查博文:HtmlExtractor 1.1 发布,网页信息抽取组件,相似度分值:Simple=0.497303 Cosine=0.693784 EditDistance=0.104639 EuclideanDistance=0.019003 ManhattanDistance=0.001333 Jaccard=0.317857 JaroDistance=0.562032 JaroWinklerDistance=0.737219 SørensenDiceCoefficient=0.482385 SimHashPlusHammingDistance=0.929688

博文地址1:http://my.oschina.net/apdplat/blog/402138
博文地址2:http://yangshangchuan.iteye.com/blog/2202864

123、检查博文:大数据系列5:Pig – 大数据分析平台,相似度分值:Simple=0.744606 Cosine=0.946269 EditDistance=0.428863 EuclideanDistance=0.016885 ManhattanDistance=0.001314 Jaccard=0.416914 JaroDistance=0.717753 JaroWinklerDistance=0.830652 SørensenDiceCoefficient=0.588482 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/396584
博文地址2:http://yangshangchuan.iteye.com/blog/1950274

124、检查博文:SOLR4.2+NUTCH1.6,相似度分值:Simple=0.575516 Cosine=0.851802 EditDistance=0.116792 EuclideanDistance=0.019144 ManhattanDistance=0.001344 Jaccard=0.326316 JaroDistance=0.624108 JaroWinklerDistance=0.774465 SørensenDiceCoefficient=0.492063 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/397150
博文地址2:http://yangshangchuan.iteye.com/blog/2200131

125、检查博文:对Nutch2.1抽象存储层的一些看法,相似度分值:Simple=0.61795 Cosine=0.829765 EditDistance=0.188967 EuclideanDistance=0.014661 ManhattanDistance=0.001238 Jaccard=0.41847 JaroDistance=0.648482 JaroWinklerDistance=0.789089 SørensenDiceCoefficient=0.590031 SimHashPlusHammingDistance=0.96875

博文地址1:http://my.oschina.net/apdplat/blog/396129
博文地址2:http://yangshangchuan.iteye.com/blog/1835074

126、检查博文:一种利用ngram模型来消除歧义的中文分词方法,相似度分值:Simple=0.679279 Cosine=0.934887 EditDistance=0.306223 EuclideanDistance=0.01882 ManhattanDistance=0.001318 Jaccard=0.423369 JaroDistance=0.695029 JaroWinklerDistance=0.817017 SørensenDiceCoefficient=0.594883 SimHashPlusHammingDistance=0.992188

博文地址1:http://my.oschina.net/apdplat/blog/411112
博文地址2:http://yangshangchuan.iteye.com/blog/2208737

127、检查博文:Java分布式中文分词组件word分词v1.2发布,相似度分值:Simple=0.542051 Cosine=0.736881 EditDistance=0.115897 EuclideanDistance=0.019516 ManhattanDistance=0.001346 Jaccard=0.324373 JaroDistance=0.597623 JaroWinklerDistance=0.758574 SørensenDiceCoefficient=0.489851 SimHashPlusHammingDistance=1.0

博文地址1:http://my.oschina.net/apdplat/blog/402144
博文地址2:http://yangshangchuan.iteye.com/blog/2202878

128、检查博文:技术框架太多,多的眼花缭乱,如何在众多选择中找到自己的方向?,相似度分值:Simple=0.655518 Cosine=0.839188 EditDistance=0.171823 EuclideanDistance=0.020387 ManhattanDistance=0.001366 Jaccard=0.465857 JaroDistance=0.0 JaroWinklerDistance=0.4 SørensenDiceCoefficient=0.635611 SimHashPlusHammingDistance=0.96875

博文地址1:http://my.oschina.net/apdplat/blog/393810
博文地址2:http://yangshangchuan.iteye.com/blog/2197217

 

 

 

 

 

 

基于word分词提供的文本相似度算法来实现通用的网页相似度检测

  • 0

    开心

    开心

  • 0

    板砖

    板砖

  • 0

    感动

    感动

  • 0

    有用

    有用

  • 0

    疑问

    疑问

  • 0

    难过

    难过

  • 0

    无聊

    无聊

  • 0

    震惊

    震惊

编辑推荐
文本相似度算法 文本相似度算法 1.信息检索中的重要发明TF-IDF 1.1TF Term frequency即关键词词频,
文本相似度算法 1.信息检索中的重要发明TF-IDF 1.1TF Term frequency即关键词词频,是指一篇文章中
文本相似度算法 1.信息检索中的重要发明TF-IDF 1.1 TF Term frequency即关键词词频,是指一篇文章中
思路是:把字符串的字符放入一个字典中,计算他们(相同的个数/开平方(字符串1的个数*字符串2的个数),
来源: http://www.cnblogs.com/liangxiaxu/archive/2012/05/05/2484972.html 文本相似度算法 1.信
一、Simhash简介 SimHash是用来网页去重最常用的hash方法,速度很快。Google采用这种算法来解决万亿
近日逛博客的时候偶然发现了一个有关图片相似度的Python算法实现。想着很有意思便搬到C#上来了,给
转自:http://www.cnblogs.com/wuchaodong/archive/2010/04/02/1444792.html 近日逛博客的时候偶然
不久前(6.29),参加了ChinaHadoop的夏季沙龙,听了人人的大牛讲了基于Map-Reduce的相似度计算的优
不久前(6.29),参加了ChinaHadoop的夏季沙龙,听了人人的大牛讲了基于Map-Reduce的相似度计算的优
版权所有 IT知识库 CopyRight © 2009-2015 IT知识库 IT610.com , All Rights Reserved. 京ICP备09083238号