Python网络爬虫1 – 简单的Http请求

最近这段时间会有需要写一个网络爬虫。会在这里将实现网络爬虫的经验记录下来。

爬虫什么的，只是一个名字罢了。简单地说，也都是从http请求开始的。

Python实现http请求主要依赖的是urllib.request模块。例如发送http get请求：

from urllib import request

url = 'http://www.zhyea.com/2016/07/17/memory-analyzer-all.html'

response = request.urlopen(url)

content = response.read()

就是这么简单。

通常，在命令行打印出来的是网页的源代码。想从中过滤出来需要的信息需要进行匹配和筛选。比如使用正则式匹配获取title和body中的内容：

def get_target(pattern, content):

m = re.search(pattern, content)

target = ""

if m:

target = m.group(0)

return target

title = get_target(r"<title>.*<\title>", content)

body = get_target(r"<body[\w|\W]*<body>", content)

对于一些采集程序来说做到这里就够了。如果我们要的是网页的内容而非网页的html，则需要使用比正则表达式更强大的工具。在下一节会用一个实例介绍相关的内容。

附上完整的程序：

#!python

# encoding: utf-8

import re

from urllib import request

from urllib import parse

def get(url):

response = request.urlopen(url)

content = ""

if response:

content = response.read().decode("utf8")

response.close()

return content

def post(url, **paras):

param = parse.urlencode(paras).encode("utf8")

req = request.Request(url, param)

response = request.urlopen(req)

content = ""

if response:

content = response.read().decode("utf8")

response.close()

return content

def get_target(pattern, content):

m = re.search(pattern, content)

target = ""

if m:

target = m.group(0)

return target

def main():

url = 'http://www.zhyea.com/2016/07/17/memory-analyzer-all.html'

content = get(url)

title = get_target(r"<title>.*<\title>", content)

body = get_target(r"<body[\w|\W]*<body>", content)

if __name__ == "__main__":

main()

#########

Python网络爬虫1 – 简单的Http请求

Python网络爬虫7 - 使用cookie

Python网络爬虫6 - 网页编码

Python网络爬虫5 – 图片抓取

Python网络爬虫4 - 多线程抓取

Python网络爬虫3 - 使用BeautifulSoup解析网页

发表评论取消回复

我的专题

友情链接

其他操作

Python网络爬虫1 – 简单的Http请求

Python网络爬虫7 - 使用cookie

Python网络爬虫6 - 网页编码

Python网络爬虫5 – 图片抓取

Python网络爬虫4 - 多线程抓取

Python网络爬虫3 - 使用BeautifulSoup解析网页

发表评论 取消回复

我的专题

友情链接

其他操作

标签云

发表评论取消回复