Python网络爬虫5 – 图片抓取

这一节看下如何抓取网页中的图片。目标网址是：http://pp.163.com/longer-yowoo/pp/10069141.html。这里有一组我非常喜欢的图片。

要抓取网页首先就要找出图片的网址。这里仍然是使用BeautifulSoup，具体如何使用在前一节《使用BeautifulSoup解析网页》时说过，现在就不说了。看下代码好了：

#!python

# encoding: utf-8

from urllib.request import urlopen

from bs4 import BeautifulSoup

def get(url):

response = urlopen(url)

html = response.read().decode("gbk")

response.close()

return html

def detect(html):

soup = BeautifulSoup(html, "html.parser")

images = soup.select("img[data-lazyload-src]")

return images

def main():

html = get("http://pp.163.com/longer-yowoo/pp/10069141.html")

links = detect(html)

for i in range(len(links)):

print(links[i].attrs['data-lazyload-src'])

if __name__ == '__main__':

main()

在上面的代码中soup.select(“img[data-lazyload-src]”)一句查询了所有包含data-lazyload-src属性的img标签。在捕捉到图片标签后，又取出data-lazyload-src属性并打印了出来，一共有六个。

然后就是如何抓取图片了。先来看看之前的一段代码：

1	html = response.read().decode("gbk")

这段代码的作用是抓取网页内容并转换为字符串。其中，response是http反馈信息，read方法的作用是读取出http返回的字节流，decode则是将字节流转换为字符串。字符串本质是字节流，图片也是。那么，如何获取图片也就清楚了：就是通过http获取到图片的字节流，再将字节流保存到硬盘即可。看下是如何实现的：

def download(url, pic_path):

response = urlopen(url)

img_bytes = response.read()

f = open(pic_path, "wb")

f.write(img_bytes)

f.close()

注意open方法中的mode属性“wb”，w指的是写文件，b指的是采用二进制模式。

再来看看完整的程序：

#!python

# encoding: utf-8

import os

from urllib.request import urlopen

from bs4 import BeautifulSoup

def get(url):

response = urlopen(url)

html = response.read().decode("gbk")

response.close()

return html

def detect(html):

soup = BeautifulSoup(html, "html.parser")

images = soup.select("img[data-lazyload-src]")

return images

def download(url, pic_path):

response = urlopen(url)

img_bytes = response.read()

f = open(pic_path, "wb")

f.write(img_bytes)

f.close()

def main():

html = get("http://pp.163.com/longer-yowoo/pp/10069141.html")

images = detect(html)

pic_folder = "/pics"

os.mkdir(pic_folder)

for i in range(len(images)):

url = images[i].attrs['data-lazyload-src']

download(url, pic_folder + "/" + str(i) + ".jpg")

if __name__ == '__main__':

main()

上面的代码仍可以优化下：要下载的文件的名称及扩展名最好是从下载链接中动态获取。这里我偷了个懒，随意指定了文件的名称，扩展名则是早已经知道了。

###################

Python网络爬虫5 – 图片抓取

Python网络爬虫7 - 使用cookie

Python网络爬虫6 - 网页编码

Python网络爬虫4 - 多线程抓取

Python网络爬虫3 - 使用BeautifulSoup解析网页

Python网络爬虫2 - 请求中遇到的几个问题

发表评论取消回复

我的专题

友情链接

其他操作

Python网络爬虫5 – 图片抓取

Python网络爬虫7 - 使用cookie

Python网络爬虫6 - 网页编码

Python网络爬虫4 - 多线程抓取

Python网络爬虫3 - 使用BeautifulSoup解析网页

Python网络爬虫2 - 请求中遇到的几个问题

发表评论 取消回复

我的专题

友情链接

其他操作

标签云

发表评论取消回复