java - Convert doc to pdf using Apache POI -

- August 15, 2010

i trying convert doc pdf using apache poi, resulting pdf document contains text, not having formating images, tables alignment etc.

how can convert doc pdf having formattings tables, images, alignments?

here code:

import java.io.file; import java.io.fileinputstream; import java.io.fileoutputstream; import java.io.outputstream;  import com.lowagie.text.document; import com.lowagie.text.documentexception; import com.lowagie.text.paragraph; import com.lowagie.text.pdf.pdfwriter;   import org.apache.poi.hwpf.hwpfdocument; import org.apache.poi.hwpf.extractor.wordextractor;  import org.apache.poi.hwpf.usermodel.range; import org.apache.poi.poifs.filesystem.poifsfilesystem;   public class demo {     public static void main(string[] args) {          poifsfilesystem fs = null;           document document = new document();           try {                system.out.println("starting test");                fs = new poifsfilesystem(new fileinputstream("resume.doc"));                 hwpfdocument doc = new hwpfdocument(fs);                wordextractor = new wordextractor(doc);                 outputstream file = new fileoutputstream(new file("test.pdf"));                pdfwriter writer = pdfwriter.getinstance(document, file);                 range range = doc.getrange();              document.open();                writer.setpageempty(true);                document.newpage();                writer.setpageempty(true);                 string[] paragraphs = we.getparagraphtext();                (int = 0; < paragraphs.length; i++) {                     org.apache.poi.hwpf.usermodel.paragraph pr = range.getparagraph(i);                  paragraphs[i] = paragraphs[i].replaceall("\\cm?\r?\n", "");                    system.out.println("length:" + paragraphs[i].length());                    system.out.println("paragraph" + + ": " + paragraphs[i].tostring());                    // add paragraph document                    document.add(new paragraph(paragraphs[i]));                }                 system.out.println("document testing completed");            } catch (exception e) {                system.out.println("exception during test");                e.printstacktrace();            } {                // close document                document.close();            }        }    }

the task @ hand converting doc pdf having formattings tables, images, alignments.

creating own converter class

there wordtoxxxconverter classes in apache poi, namely wordtofoconverter, wordtohtmlconverter, , wordtotextconverter. latter 1 lossy serve example requirements former 2 adequate.

all these converter classes derived common base class abstractwordconverter provides basic framework word conversion classes. furthermore these classes make use of matching *documentfacade class wraps concrete target (or intermediate) format creation: fodocumentfacade, htmldocumentfacade, or textdocumentfacade.

to implement task converting doc pdf having formattings tables, images, alignments, therefore, should derive converter class abstractwordconverter , implementing abstract methods let inspired 3 concrete implementation classes. in other converter classes, concentrating pdf library specific code pdfdocumentfacade class seems idea.

if want start simple , add more complex details later, might start using wordtotextconverter implementation code first , works @ least on proof-of-concept level, extend functionality cover more , more of formatting information.

unfortunately converter framework dom element centric: abstractwordconverter callbacks expect , forward dom elements indicators of current target document context; @ first glance not seem make use of context being dom element, might away copying base class , exchanging dom element parameters more apropos type or better generic class parameter.

using existing word-to-xxx converters in combination existing xxx-to-pdf converters

if seems complex or time consuming resources, might try different approach: can try use output of 1 of existing converters mentioned above input conversion pdf.

using existing conversion classes turn out results earlier, multi-step conversions tend more lossy single-step ones. decision you.

in code posted in question used itext classes. itext support conversion html pdf limitations using xmlworker provided in itext xml worker sub-project. in ancient itext versions there used deprecated htmlworker class. using wordtohtmlconverter in combination itext xmlworker may option you.

alternatively apache provides xsl fo processing pdf. applied output of wordtofoconverter may option

Search This Blog

IO

java - Convert doc to pdf using Apache POI -

Comments

Post a Comment

Popular posts from this blog

javascript - DIV "hiding" when changing dropdown value -

html - Accumulated Depreciation of Assets on php -

node.js - Node - Passport Auth - Authed Post Route hangs on form submission -