爬虫模块之Request-白红宇

爬虫模块之Request

阅读量：5328 次

发布时间：2019-06-14

本文共 9591 字，大约阅读时间需要 31 分钟。

requests

Requests唯一一个非转基因的Python HTTP库，人类就可以安全享用。

Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是，它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作，甚至包括各种方法覆盖，来完成最简单的任务。

Requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests可以轻而易举的完成浏览器可有的任何操作。

首先我们先看一下GET请求和POST请求的区别

GET请求

# 1、无参数实例  import requests  ret = requests.get('https://github.com/timeline.json')  print ret.urlprint ret.text  # 2、有参数实例  import requests  payload = {'key1': 'value1', 'key2': 'value2'}ret = requests.get("http://httpbin.org/get", params=payload)  print ret.urlprint ret.text

POST请求

1、基本POST实例  import requests  payload = {'key1': 'value1', 'key2': 'value2'}ret = requests.post("http://httpbin.org/post", data=payload)  print ret.text    # 2、发送请求头和数据实例  import requestsimport json  url = 'https://api.github.com/some/endpoint'payload = {'some': 'data'}headers = {'content-type': 'application/json'}  ret = requests.post(url, data=json.dumps(payload), headers=headers)  print ret.textprint ret.cookies

对于其他请求也是同样的道理

r = requests.put("http://httpbin.org/put")r = requests.delete("http://httpbin.org/delete")r = requests.head("http://httpbin.org/get")r = requests.options("http://httpbin.org/get")# r 则是response对象，我们可以从r 中获取我们想要的数据

响应内容　　

Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。

请求发出后，Requests 会基于 HTTP 头部对响应的编码作出有根据的推测。当你访问 r.text 之时，Requests 会使用其推测的文本编码。你可以找出 Requests 使用了什么编码，并且能够使用r.encoding 属性来改变它

>>> r.encoding'utf-8'>>> r.encoding = 'ISO-8859-1'

如果你改变了编码，每当你访问 r.text ，Request 都将会使用 r.encoding 的新值。你可能希望在使用特殊逻辑计算出文本的编码的情况下来修改编码。比如 HTTP 和 XML 自身可以指定编码。这样的话，你应该使用 r.content 来找到编码，然后设置 r.encoding 为相应的编码。这样就能使用正确的编码解析 r.text 了。　　

如果你想获取原始的响应内容的话，你可以这样做　　

>>> r = requests.get('https://github.com/timeline.json', stream=True)>>> r.raw
     
      >>> r.raw.read(10)'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

这样做了之后想要保存到本地，就不能按之前的保存方式来做了

with open(filename, 'wb') as fd:    for chunk in r.iter_content(chunk_size):        fd.write(chunk)

响应状态吗　　

我们可以检测响应状态吗

>>> r = requests.get('http://httpbin.org/get')>>> r.status_code200

为方便引用，Requests还附带了一个内置的状态码查询对象

>>> r.status_code == requests.codes.okTrue

如果发送了一个错误请求(一个 4XX 客户端错误，或者 5XX 服务器错误响应)，我们可以通过Response.raise_for_status() 来抛出异常

>>> bad_r = requests.get('http://httpbin.org/status/404')>>> bad_r.status_code404>>> bad_r.raise_for_status()Traceback (most recent call last):  File "requests/models.py", line 832, in raise_for_status    raise http_errorrequests.exceptions.HTTPError: 404 Client Error

重定向和请求历史　

默认情况下，除了 HEAD, Requests 会自动处理所有重定向。

可以使用响应对象的 history 方法来追踪重定向。

Response.history 是一个 Response 对象的列表，为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。

例如，Github 将所有的 HTTP 请求重定向到 HTTPS：

>>> r = requests.get('http://github.com')>>> r.url'https://github.com/'>>> r.status_code200>>> r.history[
     
      ]

如果你使用的是GET、OPTIONS、POST、PUT、PATCH 或者 DELETE，那么你可以通过allow_redirects 参数禁用重定向处理

>>> r = requests.get('http://github.com', allow_redirects=False)>>> r.status_code301>>> r.history[]

如果你使用了 HEAD，你也可以启用重定向

>>> r = requests.head('http://github.com', allow_redirects=True)>>> r.url'https://github.com/'>>> r.history[
     
      ]

超时　　

requests.get('http://github.com', timeout=0.001)

不过我们需要特别注意的是：

timeout 仅对连接过程有效，与响应体的下载无关。 timeout 并不是整个下载响应的时间限制，而是如果服务器在 timeout 秒内没有应答，将会引发一个异常（更精确地说，是在 timeout 秒内没有从基础套接字上接收到任何字节的数据时）　

下面就根据 Requests对几个常用的网站进行简单的登陆实验

自动登陆示例

这里我们会用到BeautifulSoup模块，我们需要另外安装

#__author__:Administrator#date:2017/2/10import reimport jsonimport base64import rsaimport requestsdef js_encrypt(text):    b64der = 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB'    der = base64.standard_b64decode(b64der)    pk = rsa.PublicKey.load_pkcs1_openssl_der(der)    v1 = rsa.encrypt(bytes(text, 'utf8'), pk)    value = base64.encodebytes(v1).replace(b'\n', b'')    value = value.decode('utf8')    return valuesession = requests.Session()i1 = session.get('https://passport.cnblogs.com/user/signin')rep = re.compile("'VerificationToken': '(.*)'")v = re.search(rep, i1.text)verification_token = v.group(1)form_data = {    'input1': js_encrypt('用户名'),    'input2': js_encrypt('密码'),    'remember': False}i2 = session.post(url='https://passport.cnblogs.com/user/signin',                  data=json.dumps(form_data),                  headers={                      'Content-Type': 'application/json; charset=UTF-8',                      'X-Requested-With': 'XMLHttpRequest',                      'VerificationToken': verification_token}                  )i3 = session.get(url='https://i.cnblogs.com/EditDiary.aspx')print(i3.text)

博客园

#__author__:Administrator#date:2017/2/10import requestsfrom bs4 import BeautifulSoupfrom bs4.element import Tag# 访问登陆url，获得authenticity_tokenresponse = requests.get(url='https://github.com/login')soup = BeautifulSoup(response.text,features='lxml')tag = soup.find(name='input',attrs={
    'name':'authenticity_token'})  # 获得input标签authenticity_token = tag.get('value')    # 获得authenticity_token：r1 = response.cookies.get_dict() # 获得cookie# 携带用户信息和authenticity_token去访问form_data = {    'commit': 'Sign in',    'utf8': '✓',    'authenticity_token': authenticity_token,    'login': '用户名',    'password': '密码'}response2 = requests.post(    url='https://github.com/session',    data=form_data,    cookies=r1)r2 = response2.cookies.get_dict() # 获得第二次的cookier1.update(r2)# 携带cookies登陆response3 = requests.get(    url='https://github.com/settings/repositories',    cookies=r1)soup3 = BeautifulSoup(response3.text,features='lxml')tag = soup3.find(name='div',class_='listgroup')print(Tag)for children in tag.children:    if isinstance(children,Tag):        project_tag = children.find(name='a', class_='mr-1')        size_tag = children.find(name='small')        temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )        print(temp)

github

#__author__:Administrator#date:2017/2/9import requests# 首先登陆任何界面获得cookieresponse1 = requests.get(    url="http://dig.chouti.com/help/service",)c1 = response1.cookies.get_dict()# 用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权form_data = {    'phone':'86手机号',    'password':'密码',    'oneMonth':1}response2 = requests.post(    url='http://dig.chouti.com/login',    data=form_data,    cookies=c1)gpsd = c1['gpsd']# 点赞（只需要携带已经被授权的gpsd即可）response3 = requests.post(    url='http://dig.chouti.com/link/vote?linksId=10256811',    cookies={        'gpsd': gpsd    })print(response3.text)

抽屉登陆并点赞

#__author__:Administrator#date:2017/2/10import hashlibimport requestsimport redef md5_pwd(pwd):    # md5 加密    m = hashlib.md5()    m.update(pwd.encode('utf8'))    new_pwd = m.hexdigest()    return new_pwd# r1 = requests.get(url='https://mp.weixin.qq.com/')# c1 = r1.cookies.get_dict()# print(c1)form_data={    'username':'用户名',    'pwd':md5_pwd('密码'),    'imgcode':'',    'f':'json'}# print(form_data['pwd'])r2 = requests.post(    url='https://mp.weixin.qq.com/cgi-bin/bizlogin',    params={
    'action': 'startlogin'},    data=form_data,    # cookies=c1,    headers={        'Referer': 'https://mp.weixin.qq.com/',        # 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'    })c2 = r2.cookies.get_dict()resp_text = r2.textprint(resp_text)token = re.findall(".*token=(\d+)", resp_text)[0]print(token)# print(c2)res_user_list = requests.get(    url="https://mp.weixin.qq.com/cgi-bin/user_tag",    params={
    "action": "get_all_data", "lang": "zh_CN", "token": token},    cookies=c2,    headers={
    'Referer': 'https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN'})user_info = res_user_list.text# print(user_info)

微信公众号

#__author__:Administrator#date:2017/2/10import requestsfrom bs4 import BeautifulSoupimport time# 访问登陆界面session = requests.session()r1 = session.get(    url='https://www.zhihu.com/',    headers={        'Referer':'https://www.zhihu.com/',        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'    })soup1 = BeautifulSoup(r1.text,features='lxml')tag = soup1.find(name='input',attrs={
    'name':'_xsrf'}) # 找到input标签,获得xsrfxsrf = tag.get('value')print(xsrf)# 获得验证图片r = time.time()r2 = session.get(    url='https://www.zhihu.com/captcha.gif',    params={
    'r':r,'type':'login','lang':'cn'}, # r表示时间    headers={        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'    })with open('zhihu.gif','wb') as f:    f.write(r2.content) # r2.content获得是二进制格式，进行读写到本地# 携带验证码和form_data去登陆zhihu_gif = input('请输入本地验证码图片>>>>>')form_data ={    '_xsrf':xsrf,    'password':'密码',    'captcha':zhihu_gif,    'email':'用户名',}print(form_data['captcha'])r3 = session.post(    url='https://www.zhihu.com/login/email',    data=form_data,    headers={        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'    })# print(eval(r3.text))r4 = session.get(    url='https://www.zhihu.com/settings/profile',    headers={        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'        })soup4 = BeautifulSoup(r4.text,features='lxml') # 找到登陆后的htmlprint(soup4.text)tag4 = soup4.find_all(name='span')print(tag4)# nick_name = tag4.find('span',class_='name').string# print(nick_name)

知乎