您当前的位置: 飞鹏网>游戏攻略>SeimiCrawler框架

SeimiCrawler框架

本文由飞鹏网小编发表于2019-10-08 17:38:51

评论

原本使用jsoup,但是发觉这个框架爬取的效率不高,用起来也不是很方便,了解了一些爬虫框架之后,决定使用SeimiCrawler来爬取数据。

开发环境:ideal+mybatis+SeimiCrawler

环境配置,具体的不解释,做过Java开发的明白,直接上配置文件:注意:SeimiCrawler相关的配置必须以seimi开头;

全局配置:seimi.xml

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.springframework.org/schema/beans

http://www.springframework.org/schema/beans/spring-beans.xsd">

classpath:**/*.properties

数据库全局配置:mybatis-config.xml

PUBLIC "-//mybatis.org//DTD Config 3.0//EN"

"http://mybatis.org/dtd/mybatis-3-config.dtd">

SeimiCrawler数据配置seimi-mybatis.xml:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:context="http://www.springframework.org/schema/context"

xsi:schemaLocation="http://www.springframework.org/schema/beans

http://www.springframework.org/schema/beans/spring-beans.xsd

http://www.springframework.org/schema/context

http://www.springframework.org/schema/context/spring-context.xsd">

数据库引擎配置seimi.properties:

jdbc.driver=com.mysql.jdbc.Driver

jdbc.url=jdbc:mysql://localhost:3360/xiaohuo?useUnicode=true&characterEncoding=utf8&useSSL=false

jdbc.username=root

jdbc.password=123456

日志输出配置log4j.properties:

log4j.rootLogger=info, console, log, error

###Console ###

log4j.appender.console = org.apache.log4j.ConsoleAppender

log4j.appender.console.Target = System.out

log4j.appender.console.layout = org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern = %d %p[%C:%L]- %m%n

### log ###

log4j.appender.log = org.apache.log4j.DailyRollingFileAppender

log4j.appender.log.File = ${catalina.base}/logs/debug.log

log4j.appender.log.Append = true

log4j.appender.log.Threshold = DEBUG

log4j.appender.log.DatePattern='.'yyyy-MM-dd

log4j.appender.log.layout = org.apache.log4j.PatternLayout

log4j.appender.log.layout.ConversionPattern = %d %p[%c:%L] - %m%n

### Error ###

log4j.appender.error = org.apache.log4j.DailyRollingFileAppender

log4j.appender.error.File = ${catalina.base}/logs/error.log

log4j.appender.error.Append = true

log4j.appender.error.Threshold = ERROR

log4j.appender.error.DatePattern='.'yyyy-MM-dd

log4j.appender.error.layout = org.apache.log4j.PatternLayout

log4j.appender.error.layout.ConversionPattern =%d %p[%c:%L] - %m%n

###\u8F93\u51FASQL

log4j.logger.com.ibatis=DEBUG

log4j.logger.com.ibatis.common.jdbc.SimpleDataSource=DEBUG

log4j.logger.com.ibatis.common.jdbc.ScriptRunner=DEBUG

log4j.logger.com.ibatis.sqlmap.engine.impl.SqlMapClientDelegate=DEBUG

log4j.logger.java.sql.Connection=DEBUG

log4j.logger.java.sql.Statement=DEBUG

log4j.logger.java.sql.PreparedStatement=DEBUG

基础配置到这里就配置完成了,接下来的就是实现爬虫业务了。

SeimiCrawler融合Spring,结合XPath,可以很方便的解析html,每一个爬虫的具体实现类需要放在包名为:xxx.crawlers的目录下,SeimiCrawler会自动扫描该目录下的文件,不然会找不到文件,爬虫无法启动。每一个爬虫需要集成BaseSeimiCrawler,并实现重写startUrls(),start(Response response)和回调接口。

下面以爬取代理IP为例,实现并对爬虫框架进行简单的二次封装:

基类爬虫BaseCrawler:

public abstract class BaseCrawler extends BaseSeimiCrawler {

/**

* 数据搜集前缀

*

* @return

*/

protected abstract String getUrlPrefix();

/**

* 数据搜集后缀

*

* @return

*/

protected abstract String getUrlsuffix();

/**

* 获取最大页数

*

* @param document

* @return

*/

protected abstract int getMaxPage(JXDocument document);

/**

* 数据解析

*

* @param response

*/

public abstract void operation(Response response);

/**

* 设置头信息

*

* @return

*/

protected Map setHeader() {

return null;

}

@Override

public void start(Response response) {

try {

JXDocument document = response.document();

int max = getMaxPage(document);

for (int i = 1; i <= max; i++) {

logger.info("当前为第{}页", i);

push(Request.build(getUrlPrefix() + i + getUrlsuffix(), "operation").setHeader(setHeader()));

}

} catch (Exception e) {

e.printStackTrace();

}

}

}

具体的爬虫实现SeCrawler:

@Crawler(name = "seCrawler")

public class SeCrawler extends BaseCrawler {

@Autowired

private ProxyIpStoreDao dao;

@Override

public String[] startUrls() {

return new String[]{"https://ip.seofangfa.com/"};

}

@Override

protected String getUrlPrefix() {

return "https://ip.seofangfa.com/proxy/";

}

@Override

protected String getUrlsuffix() {

return ".html";

}

@Override

protected int getMaxPage(JXDocument document) {

try {

List pages = document.sel("//div[@class='page_nav']/ul/li/a/text()");

return Integer.parseInt((String) pages.get(pages.size() - 1));

} catch (Exception e) {

e.printStackTrace();

}

return 0;

}

@Override

public void operation(Response response) {

try {

JXDocument document = response.document();

List ips = document.sel("//table[@class='table']/tbody/tr/td[1]/text()");

List ports = document.sel("//table[@class='table']/tbody/tr/td[2]/text()");

List speeds = document.sel("//table[@class='table']/tbody/tr/td[3]/text()");

List addres = document.sel("//table[@class='table']/tbody/tr/td[4]/text()");

List times = document.sel("//table[@class='table']/tbody/tr/td[5]/text()");

ProxyIp proxyIp = new ProxyIp();

for (int i = 0; i < ips.size(); i++) {

proxyIp.setIp((String) ips.get(i));

proxyIp.setPort((String) ports.get(i));

proxyIp.setSpeed((String) speeds.get(i));

proxyIp.setAddr((String) addres.get(i));

proxyIp.setTime((String) times.get(i));

dao.insert(proxyIp);

logger.info("插入代理IP:", proxyIp.toString());

}

} catch (Exception e) {

}

}

}

启动爬虫:

public static void main(String... agrs) {

Seimi seimi = new Seimi();

seimi.goRun("seCrawler");

}

SeimiCrawler爬虫就是这么造

安全blog.csdn.net/z2464342708m/article/details/80689030

上一篇:Python新手第一天学习什么东西
下一篇:用 Python 写网络爬虫(第2版)试买阅读体验

Copyright © 2010-2022 All Rights Reserved