Skip to content

webmagic-0.4.0

Compare
Choose a tag to compare
@code4craft code4craft released this 06 Nov 23:54
· 948 commits to develop since this release

Improve performance of Downloader.

  • Update HttpClient to 4.3.1 and rewrite the code of HttpClientDownloader #32.
  • Use gzip by default to reduce the transport cost #31.
  • Enable HTTP Keep-Alive and connection persistence, fix the wrong usage of PoolConnectionManage r#30.

The performance of Downloader is improved by 90% in my test.Test code: Kr36NewsModel.java.

Add synchronzing API for small task #28.

        OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(100), BaiduBaike.class);
        BaiduBaike baike = ooSpider.<BaiduBaike>get("http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8");
        System.out.println(baike);

More config for site

  • Http proxy support by Site.setHttpProxy #22.
  • More http header customizing support by Site.addHeader #27.
  • Allow disable gzip by Site.setUseGzip(false).
  • Move Site.addStartUrl to Spider.addUrl because I think startUrl is more a Spider's property than Site.

Code refactor in Spider

  • Refactor the multi-thread part of Spider and fix some concurrent problem.
  • Import Google Guava API for simpler code.
  • Allow add request with more information by Spider.addRequest() instead of addUrl #29.
  • Allow just downloading start urls without spawn urls extracted by Spider.setSpawnUrl(false).