java抽取word,pdf的四種武器

　　chris ()
　　畢業於中國人民大學信息學院
　　年月
　　
　　很多人用java進行文檔操作時經常會遇到一個問題就是如何獲得wordexcelpdf等文檔的內容？我研究了一下在這裡總結一下抽取wordpdf的幾種方法
　　用jacob
　　其實jacob是一個bridage連接java和com或者win函數的一個中間件jacob並不能直接抽取wordexcel等文件需要自己寫dll哦不過已經有為你寫好的了就是jacob的作者一並提供了
　　
　　jacob jar與dll文件下載 ?id=
　　
　　下載了jacob並放到指定的路徑之後(dll放到pathjar文件放到classpath)就可以寫你自己的抽取程序了下面是一個簡單的例子
　　
　　
　　import javaioFile;
　　import *;
　　import comjacobactiveX*;
　　/**
　　 * Title: pdf extraction
　　 * Description: email:
　　 * Copyright: Matrix Copyright (c)
　　 * Company:
　　 * @author chris
　　 * @version who use this example pls remain the declare
　　 */
　　public class FileExtracter{
　　 public static void main(String[] args) {
　　 ActiveXComponent component = new ActiveXComponent(WordApplication);
　　 String inFile = c:\\testdoc;
　　 String tpFile = c:\\;
　　 String otFile = c:\\tempxml;
　　 boolean flag = false;
　　 try {
　　 componentsetProperty(Visible new Variant(false));
　　 Object wordacc = componentgetProperty(document．)toDispatch();
　　 Object wordfile = Dispatchinvoke(wordaccOpen DispatchMethod
　　 new Object[]{inFilenew Variant(false) new Variant(true)}
　　 new int[] )toDispatch();
　　 Dispatchinvoke(wordfileSaveAs DispatchMethod new Object[]{tpFilenew Variant()} new int[]);
　　 Variant f = new Variant(false);
　　 Dispatchcall(wordfile Close f);
　　 flag = true;
　　 } catch (Exception e) {
　　 eprintStackTrace();
　　 } finally {
　　 componentinvoke(Quit new Variant[] {});
　　 }
　　 }
　　}
　　
　　
　　
　　
　　用apache的poi來抽取wordexcel
　　poi是apache的一個項目不過就算用poi你可能都覺得很煩不過不要緊這裡提供了更加簡單的一個接口給你
　　
　　下載經過封裝後的poi包 ?id=
　　
　　下載之後放到你的classpath就可以了下面是如何使用它的一個例子
　　
　　
　　import javaio*;
　　import orgtextminingtextextractionWordExtractor;
　　/**
　　 *

　　Title: word extraction

　　Description: email:

　　Company:

　　 * @author chris
　　 * @version who use this example pls remain the declare
　　 */
　　
　　public class PdfExtractor {
　　 public PdfExtractor() {
　　 }
　　 public static void main(String args[]) throws Exception
　　 {
　　 FileInputStream in = new FileInputStream (c:\\adoc);
　　 WordExtractor extractor = new WordExtractor();
　　 String str = extractorextractText(in);
　　 Systemoutprintln(the result length is+strlength());
　　 Systemoutprintln(the result is+str);
　　}
　　}
　　
　　
　　
　　
　　 pdfbox用來抽取pdf文件
　　但是pdfbox對中文支持還不好先下載pdfbox ?id=
　　
　　下面是一個如何使用pdfbox抽取pdf文件的例子
　　
　　
　　import orgpdfboxpdmodelPDdocument．
　　import orgpdfboxpdfparserPDFParser;
　　import javaio*;
　　import orgpdfboxutilPDFTextStripper;
　　import javautilDate;
　　/**
　　 *

　　Title: pdf extraction

　　Description: email:

　　Company:

　　 * @author chris
　　 * @version who use this example pls remain the declare
　　 */
　　
　　public class PdfExtracter{
　　
　　public PdfExtracter(){
　　 }
　　public String GetTextFromPdf(String filename) throws Exception
　　 {
　　 String temp=null;
　　 PDdocument．nbsppdfdocument．null;
　　 FileInputStream is=new FileInputStream(filename);
　　 PDFParser parser = new PDFParser( is );
　　 parserparse();
　　 pdfdocument．nbsp= parsergetPDdocument．);
　　 ByteArrayOutputStream out = new ByteArrayOutputStream();
　　 OutputStreamWriter writer = new OutputStreamWriter( out );
　　 PDFTextStripper stripper = new PDFTextStripper();
　　 stripperwriteText(pdfdocument．getdocument．) writer );
　　 writerclose();
　　 byte[] contents = outtoByteArray();
　　
　　 String ts=new String(contents);
　　 Systemoutprintln(the string length is+contentslength+\n);
　　 return ts;
　　}
　　public static void main(String args[])
　　{
　　PdfExtracter pf=new PdfExtracter();
　　PDdocument．nbsppdfdocument．nbsp= null;
　　
　　try{
　　String ts=pfGetTextFromPdf(c:\\apdf);
　　Systemoutprintln(ts);
　　}
　　catch(Exception e)
　　 {
　　 eprintStackTrace();
　　 }
　　}
　　
　　}
　　
　　
　　
　　
　　抽取支持中文的pdf文件－xpdf
　　xpdf是一個開源項目我們可以調用他的本地方法來實現抽取中文pdf文件
　　
　　下載xpdf函數包 ?id=
　　
　　同時需要下載支持中文的補丁包 ?id=
　　
　　按照readme放好中文的patch就可以開始寫調用本地方法的java程序了
　　
　　下面是一個如何調用的例子
　　
　　
　　import javaio*;
　　/**
　　 *

　　Title: pdf extraction

　　Description: email:

　　Company:

　　 * @author chris
　　 * @version who use this example pls remain the declare
　　 */
　　
　　
　　public class PdfWin {
　　 public PdfWin() {
　　 }
　　 public static void main(String args[]) throws Exception
　　 {
　　 String PATH_TO_XPDF=C:\\Program Files\\xpdf\\pdftotextexe;
　　 String filename=c:\\apdf;
　　 String[] cmd = new String[] { PATH_TO_XPDF enc UTF q filename };
　　 Process p = RuntimegetRuntime()exec(cmd);
　　 BufferedInputStream bis = new BufferedInputStream(pgetInputStream());
　　 InputStreamReader reader = new InputStreamReader(bis UTF);
　　 StringWriter out = new StringWriter();
　　 char [] buf = new char[];
　　 int len;
　　 while((len = readerread(buf))>= ) {
　　 //outwrite(buf len);
　　 Systemoutprintln(the length is+len);
　　 }
　　 readerclose();
　　 String ts=new String(buf);
　　 Systemoutprintln(the str is+ts);
　　 }
　　}
　　
　　
　　
　　
　　關於作者
　　作者簡介chris畢業於中國人民大學信息學院現於香港進行金融分析軟件研發作者亦活躍於 jxta pp開源軟件的開發社區並熱衷於網絡安全AI搜索引擎技術與基於java的游戲引擎技術
　　如果大家誰有更好的辦法請告訴作者 :

From:http://tw.wingwit.com/Article/program/Java/JSP/201311/19681.html