2012年3月6日 星期二

Web-Harvest爬取yahoo!answers數據


本文主要以web-harvest爬取yahoo! answers的數據為例,說明在使用過程中需要注意的問題。當然,最好的使用文檔就是官方網站的user manual。

web-harvest有三個版本,這裡用的是源碼包。要完成數據的爬取,最重要的是配置config文件。源碼包中有個Java類,Test.java,源代碼如下:

public class Test {

    public static void main(String[] args) throws IOException {

        ScraperConfiguration config = new ScraperConfiguration("e:/temp/yahooanswer/auto racing.xml"); //line a
        Scraper scraper = new Scraper(config, "e:/temp/wikianswer"); //line b

        scraper.setDebug(true);

        long startTime = System.currentTimeMillis();
        scraper.execute();
        System.out.println("time elapsed: " + (System.currentTimeMillis() - startTime));
    }

}

 line a中的.xml文件即抓取配置數據,line b 為抓取後數據的存放路徑。其功能是完成yahoo! answers分類中sports/auto racing的resolved問題中的前5頁內容,每頁20條,以如下格式寫入文件中:

下面主要來分析一下auto racing.xml,xml文件如下:

<?xml version="1.0" encoding="utf-8"?>
<config charset="utf-8">
    <include path="functions.xml"/>
    <var-def name="home">http://answers.yahoo.com</var-def>
    <var-def name="QALinks"> //定義變量QALinks,其值為函數download-multipage-list的返回值。
        <call name="d​​ownload-multipage-list">
            <call-param name="pageUrl">http://answers.yahoo.com/dir/index;_ylt=AnRU11UwwAiICNV69Xv._0HzDH1G;_ylv=3?sid=396545601&amp;link=resolved#yan-questions</call- param>
            <call-param name="nextXPath">//li[@rel="next"]/@href</call-param>
            <call-param name="itemXPath">//ul[@class="questions"]//h3//a/@href</call-param>
            <call-param name="maxloops">5</call-param>
        </call>
    </var-def>

    <!-​​- According the link, get all questions -->
    <var-def name="questions">
        <loop item="item" index="i">
            <list><var name="QALinks"/></list>
            <body>
                <html-to-xml>
                    <http url="${sys.fullUrl(home, item)}"/>
                </html-to-xml><script><![CDATA[print("item"+i+":"+item);]]></script>
            </body>
        </loop>
    </var-def>

    <!-​​- iterates over all collected products and extract desired data -->
    <file action="write" path="sports/auto racing.xml" charset="utf-8">
        <![CDATA[ <questionanswers> ]]>
        <loop item="item" index="i">
            <list><var name="questions"/></list>
            <body>
                <template>
                     <![CDATA[<number>]]><var name="i"/><![CDATA[</number>]]>
                </template>
                <xquery>
                    <xq-param name="item" type="node()"><var name="item"/></xq-param>
                    <xq-expression><![CDATA[declare variable $item as node() external;
                        let $subject := data($item//h1[@class='subject'])
                            return
                                 <questionanswer>
                                    <subject>{normalize-space($subject)}</subject>
                                   { for $x at $count in data($item//div[@class="content"])
                            return if($count eq 1)
                        then <questioncontent>{$x}</questioncontent>
                        else <answer>{$x}</answer>
            }
                                </questionanswer>
                ]]></xq-expression>
                </xquery>
            </body>
        </loop>
        <![CDATA[ </questionanswers> ]]>
    </file>
</config>

functions.xml源代碼:

<?xml version="1.0" encoding="UTF-8"?>
<config>
    <!-​​-
        Download multi-page list of items.
      
        @param pageUrl - URL of starting page
        @param itemXPath - XPath expression to obtain single item in the list
        @param nextXPath - XPath expression to URL for the next page
        @param maxloops - maximum number of pages downloaded
      
        @return list of all downloaded items
     -->
    <function name="d​​ownload-multipage-list">
        <return>
            <while condition="${pageUrl.toString().length() != 0}" maxloops="${maxloops}" index="i">
                <empty> //函數中<empty></empty>中的內容表示不用返回。
                    <var-def name="content"> //定義了變量content,其內容是pageUrl返回的網頁內容
                        <html-to-xml>
                            <http url="${pageUrl}"/>
                        </html-to-xml>
                    </var-def>
                    <script><![CDATA[ // <script>中是調試用的print,將輸入內容顯示在Java的控制台。
                        print("pageUrl:"+pageUrl);
                    ]]></script>

                    <var-def name="nextLinkUrl"> //定義了變量nextLinkUrl,其值是根據nextXPath從content中獲取的數據
                        <xpath expression="${nextXPath}">
                            <var name="content"/>
                        </xpath>
                    </var-def>

                    <var-def name="pageUrl"> //重新定義pageUrl,其值為原來的pageUrl和nextLinkUrl的連接。
                        <template>${sys.fullUrl(pageUrl.toString(), nextLinkUrl.toString())}</template>
                    </var-def>

                </empty>
  
                <xpath expression="${itemXPath}"> //要返回的值,根據itemXPath從content中獲取的數據

                    <var name="content"/>
                </xpath>
            </while>
        </return>
    </function>
</config>

 functions.xml定義了一個函數,4個輸入參數,1個輸出。 pageUrl表示起始的抓取url;nextXPath是從本頁抓取的內容中獲取下一頁url的xpath表達式,也就是如何在本頁中獲取next所對應的href;function包含一個while循環,maxloops是在其他條件滿足是最多循環次數;itemXPath是每次循環時從抓取的內容中獲取返回的列表的xpath表達式,本例中是從每頁獲得answer對應的href。最後返回的是根據itemXPath獲取的所有內容的列表。

沒有留言:

張貼留言