说明:以下的代码基于httpclient4.5.2实现。
我们要使用java的HttpClient实现get请求抓取网页是一件比较容易实现的工作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
public static String get(String url) { CloseableHttpResponse response = null; BufferedReader in = null; String result = ""; try { CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(url); response = httpclient.execute(httpGet); in = new BufferedReader(new InputStreamReader(response.getEntity().getContent())); StringBuffer sb = new StringBuffer(""); String line = ""; String NL = System.getProperty("line.separator"); while ((line = in.readLine()) != null) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if (null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; } |
要多线程执行get请求时上面的方法也堪用。不过这种多线程请求是基于在每次调用get方法时创建一个HttpClient实例实现的。每个HttpClient实例使用一次即被回收,也意味着每次调用这里的get方法都需要重新创建连接。这显然不是一种最优的实现。
HttpClient提供了多线程请求方案,可以查看官方文档的《Pooling connection manager》这一节。HttpCLient实现多线程请求是基于内置的连接池实现的,其中有一个关键的类即PoolingHttpClientConnectionManager,这个类负责管理HttpClient连接池。在PoolingHttpClientConnectionManager中提供了两个关键的方法:setMaxTotal和setDefaultMaxPerRoute。setMaxTotal设置连接池的最大连接数,setDefaultMaxPerRoute设置每个路由上的默认连接个数。此外还有一个方法setMaxPerRoute——单独为某个站点设置最大连接个数,像这样:
1 2 |
HttpHost host = new HttpHost("locahost", 80); cm.setMaxPerRoute(new HttpRoute(host), 50); |
根据文档稍稍调整下我们的get请求实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
package com.zhyea.robin; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; public class HttpUtil { private static CloseableHttpClient httpClient; static { PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(); cm.setMaxTotal(200); cm.setDefaultMaxPerRoute(20); cm.setDefaultMaxPerRoute(50); httpClient = HttpClients.custom().setConnectionManager(cm).build(); } public static String get(String url) { CloseableHttpResponse response = null; BufferedReader in = null; String result = ""; try { HttpGet httpGet = new HttpGet(url); response = httpClient.execute(httpGet); in = new BufferedReader(new InputStreamReader(response.getEntity().getContent())); StringBuffer sb = new StringBuffer(""); String line = ""; String NL = System.getProperty("line.separator"); while ((line = in.readLine()) != null) { sb.append(line + NL); } in.close(); result = sb.toString(); } catch (IOException e) { e.printStackTrace(); } finally { try { if (null != response) response.close(); } catch (IOException e) { e.printStackTrace(); } } return result; } public static void main(String[] args) { System.out.println(get("https://www.baidu.com/")); } } |
这样就差不多了。不过对于我自己而言,我更喜欢httpclient的fluent实现,比如我们刚才实现的http get请求完全可以这样简单的实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
package com.zhyea.robin; import org.apache.http.client.fluent.Request; import java.io.IOException; public class HttpUtil { public static String get(String url) { String result = ""; try { result = Request.Get(url) .connectTimeout(1000) .socketTimeout(1000) .execute().returnContent().asString(); } catch (IOException e) { e.printStackTrace(); } return result; } public static void main(String[] args) { System.out.println(get("https://www.baidu.com/")); } } |
我们要做的只是将以前的httpclient依赖替换为fluent-hc依赖:
1 2 3 4 5 |
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>fluent-hc</artifactId> <version>4.5.2</version> </dependency> |
并且这个fluent实现天然就是采用PoolingHttpClientConnectionManager完成的。它设置的maxTotal和defaultMaxPerRoute的值分别是200和100:
1 2 3 |
CONNMGR = new PoolingHttpClientConnectionManager(sfr); CONNMGR.setDefaultMaxPerRoute(100); CONNMGR.setMaxTotal(200); |
唯一一点让人不爽的就是Executor没有提供调整这两个值的方法。不过这也完全够用了,实在不行的话,还可以考虑重写Executor方法,然后直接使用Executor执行get请求:
1 2 |
Executor.newInstance().execute(Request.Get(url)) .returnContent().asString(); |
就这样!
####
这里使用的是默认的HttpRequestRetryHandler,在一些连接超时、取conn超时等异常时不会默认重试,如果设置下这里就更妥帖了
这里我觉得需要点个赞