Day01-数据分析实战-论文数量统计(DataWhale)

一、论文数量统计

统计2019年全年计算机各个方向论文数量

步骤

1.找到update为2019年的数据

2.找出categories为计算机的数据

3.统计数量

1. 读取原始数据

#导入包
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #爬取数据
import re #正则,匹配字符串模式
import requests #网络连接,发送网络请求,使用域名获取对应信息
import json #读取json格式数据
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图
data = [] #初始化
#使用with语句优势,1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open("arxiv-metadata-oai-snapshot.json",'r') as f:
    for line in f:
        data.append(json.loads(line))
        
'''    
    for idx, line in enumerate(f):         
# 读取前100行,查看数据的时候,不需要跑很多,此处一定要注意
        if idx >= 100:
            break    
'''
data = pd.DataFrame(data) #将list变为DataFrame格式,方便分析
data.shape #显示数据大小
(1796911, 14)
data.head(2)
id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed
0 0704.0001 Pavel Nadolsky C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-... Calculation of prompt diphoton production cros... 37 pages, 15 figures; published version Phys.Rev.D76:013009,2007 10.1103/PhysRevD.76.013009 ANL-HEP-PR-07-12 hep-ph None A fully differential calculation in perturba... [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... 2008-11-26 [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
1 0704.0002 Louis Theran Ileana Streinu and Louis Theran Sparsity-certifying Graph Decompositions To appear in Graphs and Combinatorics None None None math.CO cs.CG http://arxiv.org/licenses/nonexclusive-distrib... We describe a new algorithm, the $(k,\ell)$-... [{'version': 'v1', 'created': 'Sat, 31 Mar 200... 2008-12-13 [[Streinu, Ileana, ], [Theran, Louis, ]]

数据集的字段解释:

  • id:arXiv ID,可用于访问论文;
  • submitter:论文提交者;
  • authors:论文作者;
  • title:论文标题;
  • comments:论文页数和图表等其他信息;
  • journal-ref:论文发表的期刊的信息;
  • doi:数字对象标识符,https://www.doi.org;
  • report-no:报告编号;
  • categories:论文在 arXiv 系统的所属类别或标签;
  • license:文章的许可证;
  • abstract:论文摘要;
  • versions:论文版本;
  • authors_parsed:作者的信息。

2. 数据预处理

首先查看论文的种类信息,目的是了解一下数据集的基本信息

data['categories'].describe()
count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

-count:元素个数;

-unique:元素的不同种类;

-top:出现频率最高的元素;

-freq:出现频率最高的元素个数;

data['categories'].head(4)
0            hep-ph
1     math.CO cs.CG
2    physics.gen-ph
3           math.CO
Name: categories, dtype: object

查看一下categories的分类信息,同时需要依据官方的论文种类对其进行整理,方便我们找到计算机类的数据。

从官网爬取类别数据

#获取网页文本数据
websit_url = requests.get('https://arxiv.org/category_taxonomy').text
#爬取数据,使用lxml的解析器,加速
soup = BeautifulSoup(websit_url,'lxml')
#找到BeautifulScoup对应的标签入口
root = soup.find('div',{
     'id':'category_taxonomy_list'})
#读取tags
tags = root.find_all(["h2","h3","h4","p"], recursive=True)
#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []
#进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)
#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
     
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy
group_name archive_name archive_id category_name categories category_description
0 Computer Science Computer Science Computer Science Artificial Intelligence cs.AI Covers all areas of AI except Vision, Robotics...
1 Computer Science Computer Science Computer Science Hardware Architecture cs.AR Covers systems organization and hardware archi...
2 Computer Science Computer Science Computer Science Computational Complexity cs.CC Covers models of computation, complexity class...
3 Computer Science Computer Science Computer Science Computational Engineering, Finance, and Science cs.CE Covers applications of computer science to the...
4 Computer Science Computer Science Computer Science Computational Geometry cs.CG Roughly includes material in ACM Subject Class...
... ... ... ... ... ... ...
150 Statistics Statistics Statistics Computation stat.CO Algorithms, Simulation, Visualization
151 Statistics Statistics Statistics Methodology stat.ME Design, Surveys, Model Selection, Multiple Tes...
152 Statistics Statistics Statistics Machine Learning stat.ML Covers machine learning papers (supervised, un...
153 Statistics Statistics Statistics Other Statistics stat.OT Work in statistics that does not fit into the ...
154 Statistics Statistics Statistics Statistics Theory stat.TH stat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

观察上面的数据,我们会发现categories存在多种类的问题,使用空格作为了分隔符,这里我们需要去查看总共存在多少种分类

#使用集合的属性,直接去除重复值
unique_categories = set([i for j in [x.split(' ') for x in data['categories']] for i in j])
len(unique_categories)
176
unique_categories
{'acc-phys',
 'adap-org',
 'alg-geom',
 'ao-sci',
 'astro-ph',
 'astro-ph.CO',
 'astro-ph.EP',
 'astro-ph.GA',
 'astro-ph.HE',
 'astro-ph.IM',
 'astro-ph.SR',
 'atom-ph',
 'bayes-an',
 'chao-dyn',
 'chem-ph',
 'cmp-lg',
 'comp-gas',
 'cond-mat',
 'cond-mat.dis-nn',
 'cond-mat.mes-hall',
 'cond-mat.mtrl-sci',
 'cond-mat.other',
 'cond-mat.quant-gas',
 'cond-mat.soft',
 'cond-mat.stat-mech',
 'cond-mat.str-el',
 'cond-mat.supr-con',
 'cs.AI',
 'cs.AR',
 'cs.CC',
 'cs.CE',
 'cs.CG',
 'cs.CL',
 'cs.CR',
 'cs.CV',
 'cs.CY',
 'cs.DB',
 'cs.DC',
 'cs.DL',
 'cs.DM',
 'cs.DS',
 'cs.ET',
 'cs.FL',
 'cs.GL',
 'cs.GR',
 'cs.GT',
 'cs.HC',
 'cs.IR',
 'cs.IT',
 'cs.LG',
 'cs.LO',
 'cs.MA',
 'cs.MM',
 'cs.MS',
 'cs.NA',
 'cs.NE',
 'cs.NI',
 'cs.OH',
 'cs.OS',
 'cs.PF',
 'cs.PL',
 'cs.RO',
 'cs.SC',
 'cs.SD',
 'cs.SE',
 'cs.SI',
 'cs.SY',
 'dg-ga',
 'econ.EM',
 'econ.GN',
 'econ.TH',
 'eess.AS',
 'eess.IV',
 'eess.SP',
 'eess.SY',
 'funct-an',
 'gr-qc',
 'hep-ex',
 'hep-lat',
 'hep-ph',
 'hep-th',
 'math-ph',
 'math.AC',
 'math.AG',
 'math.AP',
 'math.AT',
 'math.CA',
 'math.CO',
 'math.CT',
 'math.CV',
 'math.DG',
 'math.DS',
 'math.FA',
 'math.GM',
 'math.GN',
 'math.GR',
 'math.GT',
 'math.HO',
 'math.IT',
 'math.KT',
 'math.LO',
 'math.MG',
 'math.MP',
 'math.NA',
 'math.NT',
 'math.OA',
 'math.OC',
 'math.PR',
 'math.QA',
 'math.RA',
 'math.RT',
 'math.SG',
 'math.SP',
 'math.ST',
 'mtrl-th',
 'nlin.AO',
 'nlin.CD',
 'nlin.CG',
 'nlin.PS',
 'nlin.SI',
 'nucl-ex',
 'nucl-th',
 'patt-sol',
 'physics.acc-ph',
 'physics.ao-ph',
 'physics.app-ph',
 'physics.atm-clus',
 'physics.atom-ph',
 'physics.bio-ph',
 'physics.chem-ph',
 'physics.class-ph',
 'physics.comp-ph',
 'physics.data-an',
 'physics.ed-ph',
 'physics.flu-dyn',
 'physics.gen-ph',
 'physics.geo-ph',
 'physics.hist-ph',
 'physics.ins-det',
 'physics.med-ph',
 'physics.optics',
 'physics.plasm-ph',
 'physics.pop-ph',
 'physics.soc-ph',
 'physics.space-ph',
 'plasm-ph',
 'q-alg',
 'q-bio',
 'q-bio.BM',
 'q-bio.CB',
 'q-bio.GN',
 'q-bio.MN',
 'q-bio.NC',
 'q-bio.OT',
 'q-bio.PE',
 'q-bio.QM',
 'q-bio.SC',
 'q-bio.TO',
 'q-fin.CP',
 'q-fin.EC',
 'q-fin.GN',
 'q-fin.MF',
 'q-fin.PM',
 'q-fin.PR',
 'q-fin.RM',
 'q-fin.ST',
 'q-fin.TR',
 'quant-ph',
 'solv-int',
 'stat.AP',
 'stat.CO',
 'stat.ME',
 'stat.ML',
 'stat.OT',
 'stat.TH',
 'supr-con'}

官网是155种,我们的数据集是176种,问题不大

开始获取我们想要的数据

#得到2019年的论文数据
data['year'] = pd.to_datetime(data['update_date']).dt.year
#查找2019年的数据
data = data[data['year']== 2019]
#获取id,categories
data_d  = pd.DataFrame(data[['id','categories']])
data_d = data_d.reset_index(drop=True)
data_d.shape
(170618, 2)
#将data_d与df_taxonomy拼接,找到论文的类型,使用值拼接,并删除重复的'id'和group_name
data_f = data_d.merge(df_taxonomy,on='categories',how='left').drop_duplicates(['id','group_name'])
#按照categories统计id的数量
data_f = pd.DataFrame(data_f.groupby('group_name')['id'].count())
data_f = data_f.rename(columns={
     'id':'count'})
#一定要排序,否则会影响绘图
data_f = data_f.sort_values(by='count',ascending=False)
data_f = data_f.reset_index()
data_f
group_name count
0 Physics 38379
1 Mathematics 24495
2 Computer Science 18087
3 Statistics 1802
4 Electrical Engineering and Systems Science 1371
5 Quantitative Biology 886
6 Quantitative Finance 352
7 Economics 173

3. 数据可视化

接下来可以查看计算机小类在2019年的情况

fig = plt.figure(figsize=(10,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1)
plt.pie(data_f['count'], labels=data_f['group_name'], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

Day01-数据分析实战-论文数量统计(DataWhale)_第1张图片

comp = data_d.merge(df_taxonomy,on='categories',how='left').query("group_name == 'Computer Science'").drop_duplicates(['id','group_name'])
comp = pd.DataFrame(comp.groupby(['category_name'])['id'].count()).rename(columns={
     'id':'count'}).sort_values(by='count',ascending=False).reset_index()
comp.head(5)
category_name count
0 Computer Vision and Pattern Recognition 5559
1 Computation and Language 2153
2 Cryptography and Security 1067
3 Robotics 917
4 Networking and Internet Architecture 864

从结果看出,Computer Vision and Pattern Recognition(计算机视觉与模式识别)类是CS中数量最多的子类,这里我只计算了2019年的数据。

你可能感兴趣的