We have been using FOP for html to pdf generation, and recently I noticed that when unicode chars are included, the pdf output has parsing issues.
How we use it (fop 2.4):
- We have html string
- final InputSource source = new InputSource(new ByteArrayInputStream(htmlString.getBytes()));
FopFactoryBuilder builder = new FopFactoryBuilder(URI.create(resourceLoader.getResource(resourceBasePath).getURI().toString()), new ClasspathResolverURIAdapter()); builder.setConfiguration(configuration); FopFactory factory = builder.build(); userAgent = factory.newFOUserAgent(); userAgent.setAuthor("Indeed"); userAgent.setCreator("Indeed"); userAgent.setTitle("Indeed"); userAgent.setKeywords("Indeed"); fop = factory.newFop(MimeConstants.MIME_PDF, userAgent, outputStream); // Setup CSSToXSLFo as transform the XHTML output into xml:fo final URL baseUrl = resourceLoader.getResource(resourceBasePath).getURL(); Loggers.debug(LOGGER, "Parsing HTML response using base URL '%s'", baseUrl); final XMLReader xmlParser = Util.getParser(null, isValidatingParser); final ProtectEventHandlerFilter eventHandlerFilter = new ProtectEventHandlerFilter(true, true, xmlParser); final XMLReader filter = new CSSToXSLFOFilter( baseUrl, null, Collections.EMPTY_MAP, eventHandlerFilter, cssToXslFoDebugEnabled); filter.setEntityResolver(classPathEntityResolver); filter.setContentHandler(fop.getDefaultHandler()); filter.parse(source);
This is able to produce a PDF with all the right displayed chars. As in, it looks correct to a human.
We have a use case of reading it programatically. We are testing it out with selecting the text in Adobe Reader, copying and pasting it. This output is the same as parsing tools like pdftotext & pdfbox.
However, when there are many unicode chars, 3 things happen when we copy:
1) some unicode chars are copied as some other random chars
e.g. source: 😂😂😂😂😂 🃋🃋🃋🃋🃋 🃋🃋🃋 jack 𝍐𝍐𝍐 3 chars 𝄞𝄞 2 music majhog : 🀤 🀤 🀤
e.g. From the above example, the next page of the PDF does not have 🃋. However when copying, it showed up "🃋🃋🃋" somewhere on the next page.
3) Some fonts make corrupted PDF output. We were trying out Mathematical fonts, e.g. "𝐏𝐫𝐨𝐟𝐢𝐥𝐞"
It was fixable by using the Symbola font embedding-mode="full", where a correct looking PDF is produced. However, copying "𝐏𝐫𝐨𝐟𝐢𝐥𝐞" gave "퐏퐫퐨퐟퐢퐥퐞". Upon comparing, the Capital P char is U+1D40F, and the corresponding Korean char is U+D40F. The 1 in front of it is missing.
It was frustrating and I have googled everywhere. It seems to be related to how Fop handles the toUnicodeCmap from a font file. I confirmed by producing the PDF using Weasyprint (Python library), where all chars are copy-able correctly.
Are we using FOP incorrectly? Are there tweaks we can do to fix it?