Python学习 | 2021-09-10 数据可视化-词云图

目录

一、课堂练习

操作步骤:

发现问题:

原因分析:

解决方案:  

结果展示:

具体代码:

二、扩展练习

新增步骤:

结果展示:

具体代码:


一、课堂练习

分词绘制十四五规划的词云图

操作步骤:

  1. 读取文本,使用jieba.cut(txt,cut_all=False)将文本切分,返回一个列表
  2. 遍历这个列表,统计词频,依次存入字典中,过滤某些没有意义的词
  3. 逐个将键值对存入列表中,使用list.sort()、list.reverse()降序排列,重新生成字典
  4. 设置配色、图形、字体等参数,使用generate_from_frequencies() ,根据词频生成词云图

发现问题:

打开.txt文件(utf-8编码):textFile = open("text.txt", "r").read()

出现报错:UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 2: illegal multibyte sequence


原因分析:

open函数中,encoding参数的默认值是None,不能读取中文字符,要给encoding参数重新传入值“utf-8”才能读取中文字符

Python学习 | 2021-09-10 数据可视化-词云图_第1张图片

解决方案:  

textFile = open("test.txt", "r", encoding="utf-8").read()

结果展示:

Python学习 | 2021-09-10 数据可视化-词云图_第2张图片

具体代码:

from wordcloud import WordCloud as wc
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from imageio import imread
import jieba
with open('China145.txt','r',encoding='utf-8') as f:
    rword=f.read()
seg_list=jieba.cut(rword,cut_all=False)

tf={}
for seg in seg_list:
    if seg in tf:
        tf[seg]+=1
    else:
        tf[seg]=1    #词频统计
word=list(tf.keys())
with open('stopword.txt','r',encoding='utf-8') as sw:
    stopword=sw.read()
for seg in word:
    if tf[seg]<5 or len(seg)<2 or seg in stopword or "一" in seg:
        tf.pop(seg)    #过滤词语
        
word, num, data = list(tf.keys()), list(tf.values()),[]
for i in range(len(tf)):
    data.append((num[i],word[i]))    #逐个将键值对存入data中
data.sort()    #升序排列
data.reverse()    #逆序,得到所需的降序排列
tf_sorted={}
for i in range(len(data)):  
    tf_sorted[data[i][1]]=data[i][0]    #重新生成字典

font=r'C:\Users\ZZX\AppData\Local\Microsoft\Windows\Fonts\STZHONGS.TTF'
mask = imread("heart.png")
colormaps = colors.ListedColormap(['#FF0000','#FF7F50','#FFE4C4'])
mywc=wc(font_path=font,width=1600,height=900,max_words=300,background_color='white',colormap=colormaps,mask=mask).generate_from_frequencies(tf_sorted)
plt.axis('off')    #去除坐标轴
plt.imshow(mywc)    #负责对图像进行处理,并显示其格式,但是不能显示
plt.show()

二、扩展练习

分词绘制《西游记》的词云图,作为这本书的概览理解

新增步骤:

按照图片颜色绘制词云

from wordcloud import WordCloud,ImageColorGenerator

image_colors=ImageColorGenerator(mask)
plt.imshow(wc.recolor(color_func=image_colors))

结果展示:

Python学习 | 2021-09-10 数据可视化-词云图_第3张图片

具体代码:

from wordcloud import WordCloud as wc
import matplotlib.pyplot as plt
from wordcloud import ImageColorGenerator
from PIL import Image
import numpy as np
import jieba
with open('Journey to the West.txt','r',encoding='utf-8') as f:
    rword=f.read()
seg_list=jieba.cut(rword,cut_all=False)

tf={}
for seg in seg_list:
    if seg in tf:
        tf[seg]+=1
    else:
        tf[seg]=1    
word=list(tf.keys())
with open('stopword.txt','r',encoding='utf-8') as sw:
    stopword=sw.read()
for seg in word:
    if tf[seg]<5 or len(seg)<2 or seg in stopword or "一" in seg:
        tf.pop(seg)    
        
word, num, data = list(tf.keys()), list(tf.values()),[]
for i in range(len(tf)):
    data.append((num[i],word[i]))    
data.sort()    
data.reverse()   
tf_sorted={}
for i in range(len(data)):  
    tf_sorted[data[i][1]]=data[i][0]   

font=r'C:\Users\ZZX\AppData\Local\Microsoft\Windows\Fonts\STZHONGS.TTF'
mask=np.array(Image.open("photo.png"))
mywc=wc(width=600,height=600,max_words=300,font_path=font,background_color='white',mask=mask).generate_from_frequencies(tf_sorted)
image_colors=ImageColorGenerator(mask)
plt.imshow(mywc.recolor(color_func=image_colors))
plt.axis('off')  
plt.show()

你可能感兴趣的