爬虫概念、工具和HTTP(百度翻译)

python编程快速上手(持续更新中…)

什么是爬虫

模拟客户端(浏览器等)发送请求,获取响应,按照规则获得数据

爬虫数据去哪了?

呈现出来,展示在网页上或者app上
进行分析:从数据中找一些规律

需要的软件和环境

python3
PythonChram
chrome浏览器

浏览器的请求

Url
在浏览器中输入url,回车请求

Url=请求的协议+网站域名+资源的路径+参数
浏览器请求url地址

当前url响应+js+css+图片==》elements中的内容
爬虫请求url地址

当前url响应
Elements的内容和爬虫获取的url地址响应不同,爬虫需要以当前url地址对 应响 应为准提取数据

当前url对应响应在哪里

-从network中到url地址,点击response
-页面上右键,显示源码

HTTP与HTTPS以及req

http与https区别

http超文本传输协议
明文传输
效率高,不安全

HTTPS=http+ssl(安全套接字层)
传输之前加密,之后解密获取内容
效率低,安全

get与post区别

Get没有请求体,post有,get把数据放在url中
Post请求常用与注册。传输大文件

请求与响应

HTTP请求
请求行
请求头
User-Agent
cookie
请求体
post有

HTTP响应
响应头
set-cookie:服务器通过setCookie保存客户端
响应体
Url对应响应

requests模块学习

使用前
pip install requests

发送get/post请求
res = requests.get(url)
res = requests.post(url,data={字典})

get访问百度

import requests

url = "http://www.baidu.com"

headers = {
     "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
response = requests.get(url,headers=headers)
# print(response)

#获取网页的html字符串
# response.encoding = "utf-8"
#
# print(response.text)


print(response.content.decode())

post百度翻译

# coding=utf-8
import requests
url = "http://fanyi.baidu.com/v2transapi"

query_string = {
     "query":"你好,世界",
        "from":"zh",
        "to":"en"}

headers = {
     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36",}

response = requests.post(url,data=query_string,headers=headers)
print(response)

print(response.content.decode())
print(type(response.content.decode()))

18年时还不会出现 {‘errno’: 997, ‘errmsg’: ‘未知错误’, '‘query’: ‘你好世界’, ‘from’: ‘zh’, ‘to’: ‘en’, ‘error’: 997}}

遗憾的是:每次不同词语,都需要更改sign,

# coding=utf-8
import requests

import json


url = "https://fanyi.baidu.com/v2transapi"

query_str = input("请输入要翻译的中文:")

query_string = {
     "from": "zh",
                "to": "en",
                "query": query_str,
                "transtype": "translang",
                "simple_means_flag": "3",
                "sign": "232427.485594",
                "token": "2dad051a370c1e7db4bb12e060918365",
                "domain": "common"
            }

headers={
     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Cookie": "BIDUPSID=E768CB6962B70440474D229F084AF889; PSTM=1631781236; BAIDUID=E768CB6962B704405F86432AD6EEB9D4:FG=1; __yjs_duid=1_6a4da6895d9b673fe4daa9ff5aeb27c01631796974448; BDUSS=ZMbkZjdVRyS281TjI4aHlQUXlEVWxxSXY3VUJvTnQwOU5tVXVTZE1YSjk1SXhoSVFBQUFBJCQAAAAAAAAAAAEAAABGE~tlx6u-~b2txM8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAH1XZWF9V2Vhan; BDUSS_BFESS=ZMbkZjdVRyS281TjI4aHlQUXlEVWxxSXY3VUJvTnQwOU5tVXVTZE1YSjk1SXhoSVFBQUFBJCQAAAAAAAAAAAEAAABGE~tlx6u-~b2txM8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAH1XZWF9V2Vhan; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1637228707,1637228950; BDSFRCVID=1YIOJeC629w-1dnHrnycJFXobGt9N53TH6f3l897y4qOzVL8n3q5EG0P_U8g0Ku-Vi_gogKK0eOTHkCF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=JJAtoKLyJKvbfP0kK-nKKPtJbeT22jnuMm39aJ5nJDoNsRjeQ4T1LtugjP6lXPRaamC8Bp3OQpP-HJ7vhpj55R0fbh3mbJ3Htn4DKl0MLp7lbb0xyn_V-qRbLfnMBMPj5mOnanvn3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDFRjjtBjTOLjGRt2tcyatj2WnTJ25r8e5rnhPF3hf_dXP6-35KH0KKDa56JWpjV8nr_Ql7o-4uu54vNBq37JD6y0n3GaCQ-hPnx2-6NXfDyD4oxJpO3BRbMopvaBlnzVbRvbURvD-Lg3-7W3M5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-j5JIEoD85JD-5hD0r5nJbq4LOKxQfb-RhbPo2WDvqKxJcOR5Jj65C5nFU0aj7XJcZteQlhn7ta4jpojCC3MA--t4L3pJUBMQxLNrabnDKabnYsq0x0McWe-bQypoat6_qaCOMahkM5l7xObvJQlPK5JkgMx6MqpQJQeQ-5KQN3KJmfbL9bT3tjjTXeaDjq6-JJJ3fL-08-b7tf56ghnJ_q4C85aRa2xn9WDTmWlnybUjbbKQS5tRf-fIzWMna2xnitIv9-pnsanu2bhRGXq5f5UCABnjZKxtq3mkjbPbDfn02OP5P3foSK44syP4jKMRnWnciKfA-b4ncjRcTehoM3xI8LNj405OTbIFO0KJzJCFahCP4jjt5ePFyMqO-etjK2CntsJOOaCkaoqTOy4oTj6j-5gFDKP67-Do83hcIyn7DsP-wX6_53MvBbGbQqTFfMRnD2-cCBIn_hPDGQft20h4AeMtjBbLLQD5d0b7jWhk2eq72y-RTQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjH62btt_tJ4eoM5; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1637311796; ab_sr=1.0.1_MGFjYTg4ZGIyZGQ4ZWFhMjI2YjVlZWYzZDBlMzA0MmVkZDQ1Yjk0NDNmN2NkZGNhNzk3YjQ3MTk1MTk5OTZhMDY1ZDk4NmYxZjUyZTZmOGI5NjY1YjE5MGM2M2RhNGU3ODY3MDUyYmQyMTNkMWNhYWRiYTVhMjZjNmNiMDkxYWJjZGE4NDQ0NmQ3MzAzM2YyZjJiMzc1YmI5OWFiNDRmMDBjOTViNzkyZGQwYTE3ODRjMDJkNzA5MTcwYzUyNzNj; delPer=0; PSINO=3; BAIDUID_BFESS=E768CB6962B704405F86432AD6EEB9D4:FG=1; BDSFRCVID_BFESS=1YIOJeC629w-1dnHrnycJFXobGt9N53TH6f3l897y4qOzVL8n3q5EG0P_U8g0Ku-Vi_gogKK0eOTHkCF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF_BFESS=JJAtoKLyJKvbfP0kK-nKKPtJbeT22jnuMm39aJ5nJDoNsRjeQ4T1LtugjP6lXPRaamC8Bp3OQpP-HJ7vhpj55R0fbh3mbJ3Htn4DKl0MLp7lbb0xyn_V-qRbLfnMBMPj5mOnanvn3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDFRjjtBjTOLjGRt2tcyatj2WnTJ25r8e5rnhPF3hf_dXP6-35KH0KKDa56JWpjV8nr_Ql7o-4uu54vNBq37JD6y0n3GaCQ-hPnx2-6NXfDyD4oxJpO3BRbMopvaBlnzVbRvbURvD-Lg3-7W3M5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-j5JIEoD85JD-5hD0r5nJbq4LOKxQfb-RhbPo2WDvqKxJcOR5Jj65C5nFU0aj7XJcZteQlhn7ta4jpojCC3MA--t4L3pJUBMQxLNrabnDKabnYsq0x0McWe-bQypoat6_qaCOMahkM5l7xObvJQlPK5JkgMx6MqpQJQeQ-5KQN3KJmfbL9bT3tjjTXeaDjq6-JJJ3fL-08-b7tf56ghnJ_q4C85aRa2xn9WDTmWlnybUjbbKQS5tRf-fIzWMna2xnitIv9-pnsanu2bhRGXq5f5UCABnjZKxtq3mkjbPbDfn02OP5P3foSK44syP4jKMRnWnciKfA-b4ncjRcTehoM3xI8LNj405OTbIFO0KJzJCFahCP4jjt5ePFyMqO-etjK2CntsJOOaCkaoqTOy4oTj6j-5gFDKP67-Do83hcIyn7DsP-wX6_53MvBbGbQqTFfMRnD2-cCBIn_hPDGQft20h4AeMtjBbLLQD5d0b7jWhk2eq72y-RTQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjH62btt_tJ4eoM5; H_PS_PSSID=34444_35105_31253_35049_35096_34584_34504_35246_34606_34815_26350_35114; BA_HECTOR=04218424002ga02hra1gpepio0q"}

response = requests.post(url,data=query_string,headers=headers)
# print(response)
#
# print(response.content.decode())
# print(type(response.content.decode()))


html_str = response.content.decode()#json字符串

dict_ret = json.loads(html_str)#json字符串转化为python类型

ret = dict_ret["trans_result"]['data'][0]['dst']

print("翻译结果是:",ret)

Response的方法
res.text
方式会出现乱码
res.encoding=’utf-8’

获取网页资源正确打开方式
1.res.content.decode();//二进制字节转化str类型
2.res.content.decode(‘gbk’)
3.res.text

使用超时参数
res = requests.get(url,headers=headers,timeout=3)//3秒不发回报错

retrying模块的学习

pip install retrying

爬虫概念、工具和HTTP(百度翻译)_第1张图片

你可能感兴趣的