Python网络爬虫6 – 网页编码

2016年8月21日作者：白42

暂无评论

在抓取网页时遇到了一段报错信息：

Traceback (most recent call last):

File "D:/pythonDevelop/spider/pic_grab.py", line 14, in <module>

print(get("http://pp.163.com/longer-yowoo/pp/10069141.html"))

File "D:/pythonDevelop/spider/pic_grab.py", line 8, in get

content = response.read().decode("utf8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 69: invalid continuation byte

抓取网页的代码及网址如下：

#!python

# encoding: utf-8

from urllib.request import urlopen

def get(url):

response = urlopen(url)

content = response.read().decode("utf8")

response.close()

return content

if __name__ == '__main__':

print(get("http://pp.163.com/longer-yowoo/pp/10069141.html"))

在错误信息中提示了网页的编码不是utf-8。那么如何确认网页的编码形式呢？有如下几种方式：

从网页源码中查找chaset信息；
使用FireBug。重新打开网页，使用FireBug的NetWork抓取网页加载过程，查看目标网页的头信息，找到Content-Type，其中的charset信息就是；
使用Firefox右键菜单中的“查看页面信息”功能：点击网页空白处 –> 右键菜单 –> 查看页面信息，在弹出窗口中选择常规 –> 文字编码也可以查看网页编码信息。

检测到网页的编码是gbk。修改后就可以了。

#########

分类 : Python

标签 : 爬虫

Python网络爬虫7 - 使用cookie