当前位置:首页 > 开发 > 编程语言 > 蜘蛛爬虫 > 正文

Jsoup解析网页、文件

发表于: 2011-04-04   作者:bibiQ   来源:转载   浏览:
摘要: Jsoup网站: http://jsoup.org/ 所有的使用方法都可以从api获得,api地址: http://jsoup.org/apidocs/ html的结构,可以参考wiki: http://en.wikipedia.org/wiki/HTML_element ----------------------Jsoup连接--------------------- 连接url
Jsoup网站: http://jsoup.org/
所有的使用方法都可以从api获得,api地址: http://jsoup.org/apidocs/
html的结构,可以参考wiki: http://en.wikipedia.org/wiki/HTML_element
----------------------Jsoup连接---------------------
连接url:
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTest {

	public static void main(String[] args) {
		Document doc = null;
		String url = " http://slashdot.org/";
		try {
			doc = Jsoup
					.connect(url)
					.header("User-Agent",
							"Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1")
					.header("Accept", "text ml,application/xhtml+xml").header(
							"Accept-Language", "zh-cn,zh;q=0.5").header(
							"Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7")
					.get();
			Element body = doc.body();
			System.out.println(body.text());

		} catch (IOException e) {
			e.printStackTrace();
		}

	}
}

为了方便的话,可以直接使用:
doc = Jsoup.connect(url).get();

连接htm文件,将doc部分替换如下:
String baseUrl = "";
File input = new File(url);
Document doc = Jsoup.parse(input, "UTF-8", baseUrl);


----------------------Jsoup解析---------------------
解析部分使用的是Selector方法。
例如,将网页保存成.htm文件后,见附件,下面只显示该网页的一个片段:
<div id="bodycol"><div id="jobheadertop">&nbsp;</div><div id="jobheader"><img border="0" src="./102708474_files/pixel.gif" alt="DiSalvo LLC" id="companyLogo" class="logo" onerror="removeLogo()"><p id="companyNameHeader" style="display: block; ">DiSalvo LLC recruiting</p>
          <div id="subicons"><img src="./102708474_files/pixel(1).gif" height="1" width="1" alt="" style="margin:0px"></div><div style="clear:both;height:1px">&nbsp;</div><div id="jobheaderbottom">&nbsp;</div></div><div id="jobwrappertop2">&nbsp;</div><div id="jobwrapper">
          <div id="jobsummary">
            <div id="jobsummary_content">
              <h2>Job Summary</h2>
              <dl>
                <dt>Company</dt>
                <dd><span class="wrappable">DiSalvo LLC recruiting</span></dd>
                <dt>Location</dt>
                <dd><span class="wrappable">Tigard, OR 97223</span></dd>
                <dt>Industries</dt>
                <dd><span class="wrappable">All</span></dd>
                <dt>Job Type</dt>
                <dd class="multipledd"><span class="wrappable">Full Time</span></dd><dd class="multipleddlast"><span class="wrappable"> Employee</span></dd>
                <dt>Years of Experience</dt>
                <dd><span class="wrappable">2+​ to 5 Years</span></dd>
                <dt>Career Level</dt>
                <dd><span class="wrappable">Experienced (Non-Manager)</span></dd>
                <dt>Salary</dt>
                <dd><span class="wrappable">$47,000.​00 - $49,000.​00  /​year<br>$7k per year expense acct, medical, dental, 401K, uncapped commissions</span></dd>
              </dl>
            </div>
          </div>
          <div id="jobcopy">
            <h1>Sales Representative</h1>
            <h2>About the Job</h2>
            <div id="jobBodyContent">


想提取出<div id="jobsummary">的片段,那么使用:
private String seletorJobSum = "div#jobsummary";
Elements elements = element.select(seletorJobSum);
if(elements.size() == 0){
	return null;
}
Element section = elements.first();


想提取出Salary标签下面的值,如
<dt>Salary</dt>
<dd><span class="wrappable">$47,000.​00 - $49,000.​00  /​year<br>$7k per year expense acct, medical, dental, 401K, uncapped commissions</span></dd>

可以使用:
private String selectorSalary = "dt:contains(Salary) + dd";
Elements salaries = section.select(selectorSalary);

可以得到结果:
引用

<dd>
<span class="wrappable">$47,000. 00 - $49,000. 00 / year<br />$7k per year expense acct, medical, dental, 401K, uncapped commissions</span>
</dd>

在下例中,如果要根据Job Information这个关键字取出其下的Company等结构,
</div><br class="brclear"><div id="CJT_leftpanel">
                <h2>Job Information</h2>
                <div id="CJT_leftHolder">
                  <div>
                    <ul>
                      <li><strong>Company:</strong><br>Mount Sinai Medical Center</li>
                      <li><strong>Location:</strong><br>New York, NY</li>
                      <li><strong> Industries:</strong><br>Healthcare Services</li>
                      <li><strong>Job Status/Type:</strong><br>Full Time, Employee</li>
                      <li><strong> Occupation:</strong><br>Administrative Support<br />Secretary/Executive Assistant</li>
                      <li><strong>Category:</strong><br>Administrative/Clerical</li>
                      <li><strong>Years of Experience:</strong><br>2+ to 5 Years</li>
                      <li><strong>Education Level:</strong><br>High School or equivalent</li>
                      <li><strong>Career Level:</strong><br>Experienced (Non-Manager)</li>
                      <li><strong>Job Reference Code:</strong><br>11-1345374</li>
                      <div id="CJT_bottom"><a href="https://mountsinai.igreentree.com/CSS_External/CSSPage_Referred.asp?Req=11-1345374" mns_rt="Apply"><img src="http://media.newjobs.com/mm/xmsinaix/images/apply_left.jpg" alt="Apply Now"></a></div>
                    </ul>
                  </div>
                  <div><img src="http://media.newjobs.com/mm/xmsinaix/images/left_bottom.jpg"></div>
                  <h2>Contact Us</h2>
                  <ul>
                    <li><strong> Company Name:</strong><br>Mount Sinai Medical Center</li>
                  </ul>
                  <div><img src="http://media.newjobs.com/mm/xmsinaix/images/left_bottom.jpg"></div>
                </div>
              </div>

使用
seletor = "h2:contains(Job Information) + div";


未完待续

Jsoup解析网页、文件

  • 0

    开心

    开心

  • 0

    板砖

    板砖

  • 0

    感动

    感动

  • 0

    有用

    有用

  • 0

    疑问

    疑问

  • 0

    难过

    难过

  • 0

    无聊

    无聊

  • 0

    震惊

    震惊

版权所有 IT知识库 CopyRight © 2009-2015 IT知识库 IT610.com , All Rights Reserved. 京ICP备09083238号