# 深度学习之 自然语言处理BERT

Natural Language Processing(NLP)包括自然语言理解和自然语言生成，自然语言理解的应用包括语义分析、机器客服、语音识别、机器翻译等。

transformer这一深度网络架构在NLP领域占有举足轻重的地位，BERT是基于transformer的自然语言模型，相比于同样基于transformer的GTP3自然语言模型，transformer最早于2017由谷歌研究团队论文《Attention is all You Need》中提出，这带来了NLP领域重大进步。BERT有很多变种架构，RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, SciBERT, BioBERT, MobileBERT, TinyBERT and CamemBERT，这些都基于

## transformer （BERT）在NLP上的表现

### NLP任务之 判决语句正面还是负面

#Copy right shichaog@126.com. All rights reserved.
from transformers import pipeline
import textwrap
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)

#Classifying whole sentences
sentence = 'Both of these choices are good if you’re just starting to work with deep learning frameworks. Mathematicians and experienced researchers will find PyTorch more to their liking. Keras is better suited for developers who want a plug-and-play framework that lets them build, train, and evaluate their models quickly. Keras also offers more deployment options and easier model export.'
classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')
c = classifier(sentence)
print('\nSentence:')
print(wrapper.fill(sentence))
print(f"\nThis sentence is classified with a {c[0]['label']} sentiment")


This sentence is classified with a POSITIVE sentiment


### 句子中单词分类

sentence = "Both platforms enjoy sufficient levels of popularity that they offer plenty of learning resources. Keras has excellent access to reusable code and tutorials, while PyTorch has outstanding community support and active development."
ner = pipeline('token-classification', model='dbmdz/bert-large-cased-finetuned-conll03-english', grouped_entities=True)
ners = ner(sentence)
print('\nSentence:')
print(wrapper.fill(sentence))
print('\n')
for n in ners:
print(f"{n['word']} -> {n['entity_group']}")


Keras -> ORG
PyTorch -> ORG


### 问答

#question Answering
context = '''
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.
TensorFlow was developed by the Google Brain team for internal Google use in research and production. The initial version was released under the Apache License 2.0 in 2015. Google released the updated version of TensorFlow, named TensorFlow 2.0, in September 2019.
TensorFlow can be used in a wide variety of programming languages, most notably Python, as well as Javascript, C++, and Java. This flexibility lends itself to a range of applications in many different sectors. '''

question = 'When was TensorFlow initial announced ?'

print('Text:')
print(wrapper.fill(context))
print('\nQuestion:')
print(question)

print('\nQuestion:')
print(question + '\n')
a = qa(context=context, question=question)


Text:
TensorFlow is a free and open-source software library for machine learning and
artificial intelligence. It can be used across a range of tasks but has a
particular focus on training and inference of deep neural networks. TensorFlow
was developed by the Google Brain team for internal Google use in research and
production. The initial version was released under the Apache License 2.0 in
2015. Google released the updated version of TensorFlow, named TensorFlow 2.0,
in September 2019. TensorFlow can be used in a wide variety of programming
languages, most notably Python, as well as Javascript, C++, and Java. This
flexibility lends itself to a range of applications in many different sectors.

Question:
When was TensorFlow initial released ?

Question:
When was TensorFlow initial released ?

2015


### 摘要

#Text summarization
review = '''
While both Tensorflow and PyTorch are open-source, they have been created by two different wizards. Tensorflow is based on Theano and has been developed by Google, whereas PyTorch is based on Torch and has been developed by Facebook.
The most important difference between the two is the way these frameworks define the computational graphs. While Tensorflow creates a static graph, PyTorch believes in a dynamic graph. So what does this mean? In Tensorflow, you first have to define the entire computation graph of the model and then run your ML model. But in PyTorch, you can define/manipulate your graph on-the-go. This is particularly helpful while using variable length inputs in RNNs.
Tensorflow has a more steep learning curve than PyTorch. PyTorch is more pythonic and building ML models feels more intuitive. On the other hand, for using Tensorflow, you will have to learn a bit more about it’s working (sessions, placeholders etc.) and so it becomes a bit more difficult to learn Tensorflow than PyTorch.
Tensorflow has a much bigger community behind it than PyTorch. This means that it becomes easier to find resources to learn Tensorflow and also, to find solutions to your problems. Also, many tutorials and MOOCs cover Tensorflow instead of using PyTorch. This is because PyTorch is a relatively new framework as compared to Tensorflow. So, in terms of resources, you will find much more content about Tensorflow than PyTorch.
This comparison would be incomplete without mentioning TensorBoard. TensorBoard is a brilliant tool that enables visualizing your ML models directly in your browser. PyTorch doesn’t have such a tool, although you can always use tools like Matplotlib. Although, there are integrations out there that let you use Tensorboard with PyTorch. But it’s not supported natively.'''

print('\nOriginal text:\n')
print(wrapper.fill(review))
summarize = pipeline('summarization', model='sshleifer/distilbart-cnn-12-6')
summarized_text = summarize(review)[0]['summary_text']
print('\nSummarized text:')
print(wrapper.fill(summarized_text))


Summarized text:
While Tensorflow creates a static graph, PyTorch believes in a dynamic graph .
This is particularly helpful while using variable length inputs in RNNs .
TensorBoard is a brilliant tool that enables visualizing your ML models directly
in your browser . Pytorch is more pythonic and building ML models feels more
intuitive .


### 单词填空

#Fill in the blanks
sentence = 'It is the national  of China'
print(m['sequence'])


It is the national anthem of China
It is the national treasure of China
It is the national motto of China
It is the national pride of China
It is the national capital of China


### 翻译

#Translation
english = '''I like artificial intelligence very much!'''

translator = pipeline('translation_en_to_de', model='t5-base')
german = translator(english)
print('\nEnglish:')
print(english)
print('\nGerman:')
print(german[0]['translation_text'])


English:
I like artificial intelligence very much!
German:
Ich mag künstliche Intelligenz sehr!


### BERT 模型

BERT是Bidirectional Encoder Representations from Transformers的缩写，其训练数据使用了有25亿单词的英文维基百科以及8亿单词的出版书籍，BERT根据区分大小写和不区分大小写将模型分为两类，如果要计算区分大小写的BERT大小，可以通过BERT官方提供的模型checkpoint计算得到：

def get_model_size(checkpoint='bert-base-cased'):
'''
Usage:
checkpoint - NLP model with its configuration and its associated weights
returns the size of the NLP model
'''

model = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
num_params = 0

return sum(torch.numel(param) for param in model.parameters())

checkpoint = 'bert-base-cased'
print(f"The number of parameters for {checkpoint} is : {get_model_size(checkpoint)}")


The number of parameters for bert-base-cased is : 108310272


### transformer架构和BERT

《Attention is all you need》论文给出的transformer架构图如下：

selft-attention的公式表示:
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T n ) V Attention(Q,K,V) =softmax(\frac{QK^T}{\sqrt{n}})V ，Q是query，V是value，K是key，表示的是词向量之间距离，公式看起来比较抽象和晦涩，下面用简单例子说明该公式的计算过程。

1.输入
2.权重初始化
3.推导key，query和value
4.为输入1计算attention打分
5.计算softmax
6.用打分值对value值加权（相乘）
7.步骤6得到的权重值得到输出1
8.对于输入2和输入3，重复步骤4-7

• 步骤1 准备输入
这里以三个四维为例表示输入：
Input 1: [0, 1, 0, 1]
Input 2: [3, 0, 4, 1]
Input 3: [1, 2, 1, 1]


• 权重初始化
每个输入向量都用三种key（黄色）、query（红色）和value（紫色）三种向量表示，且令三种表示的维度都是3，则可以得到权重维度为4*3。

//key
[
[1, 0, 2],
[1, 0, 0],
[0, 1, 2],
[0, 1, 1]
]
//query
[
[1, 1, 0],
[1, 2, 0],
[1, 0, 1],
[0, 0, 1]
]
//value
[
[0, 1, 0],
[1, 3, 0],
[1, 0, 2],
[1, 2, 0]
]

• 计算key、query、value值
输入1的key计算如下：
               [1, 0, 2]
[0, 1, 0, 1] x [1, 0, 0] = [1, 1, 1]
[0, 1, 2]
[0, 1, 1]


//key2
[1, 0, 2]
[3, 0, 4, 1] x [1, 0, 0] = [3, 5, 15]
[0, 1, 2]
[0, 1, 1]
//key3
[1, 0, 2]
[1, 2, 1, 1] x [1, 0, 0] = [3, 2, 5]
[0, 1, 2]
[0, 1, 1]


key值的向量计算表示如下：

               [1, 0, 2]
[0, 1, 0, 1]   [1, 0, 0]   [1, 1, 1]
[3, 0, 4, 1] x [0, 1, 2] = [3, 5, 15]
[1, 2, 1, 1]   [0, 1, 1]   [3, 2, 5]


//value
[0, 1, 0]
[0, 1, 0, 1]   [1, 3, 0]   [2, 5, 0]
[3, 0, 4, 1] x [1, 0, 2] = [5, 5, 8]
[1, 2, 1, 1]   [1, 2, 0]   [4, 9, 2]
//query
[1, 1, 0]
[0, 1, 0, 1]   [1, 2, 0]   [1, 2, 1]
[3, 0, 4, 1] x [1, 0, 1] = [7, 3, 5]
[1, 2, 1, 1]   [0, 0, 1]   [4, 5, 2]


• 输入1的打分
为了计算其打分，需要将输入1的query值和所有的key值点乘(此外点乘外，还可以选择缩放的点乘、加或拼接等形式作为attention打分值)，则可以得到是哪个蓝颜色的attention打分。
            [1, 3, 3]
[1, 2, 1] x [1, 5, 2] = [4, 28, 12]
[1, 15, 5]


• 计算softmax值
对所有蓝色打分值做softmax运算，为了便于运算进行了四首五入。
softmax([2, 4, 4]) = [0.0, 1.0, 0.0]

• 将打分值和value值相乘
经过softmax之后的atten打分值和其相应的value值相乘，这样会得到三个对其的黄色向量，这被称为权重value值。
1: 0.0 * [2, 5, 0] = [0.0, 0.0, 0.0]
2: 1.0 * [5, 5, 8] = [5.0, 5.0, 8.0]
3: 0.0 * [4, 9, 2] = [0.0, 0.0, 0.0]

• 将权重value相加得到输出值
  [0.0, 0.0, 0.0]
+ [5.0, 5.0, 8.0]
+ [0.0, 0.0, 0.0]
-----------------
= [5.0, 5.0, 8.0]


• 对输入2和输入3类似运算

## BERT分类

IMDB数据集的情况如下：

DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})