Double byte Unicode Char incorrect when parsed

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Double byte Unicode Char incorrect when parsed

Fung Cheung
Hello,

We have been using FOP for html to pdf generation, and recently I noticed that when unicode chars are included, the pdf output has parsing issues.

How we use it (fop 2.4):
- We have html string
- final InputSource source = new InputSource(new ByteArrayInputStream(htmlString.getBytes()));
FopFactoryBuilder builder = new FopFactoryBuilder(URI.create(resourceLoader.getResource(resourceBasePath).getURI().toString()), new ClasspathResolverURIAdapter());
builder.setConfiguration(configuration);

FopFactory factory = builder.build();
userAgent = factory.newFOUserAgent();
userAgent.setAuthor("Indeed");
userAgent.setCreator("Indeed");
userAgent.setTitle("Indeed");
userAgent.setKeywords("Indeed");

fop = factory.newFop(MimeConstants.MIME_PDF, userAgent, outputStream);

// Setup CSSToXSLFo as transform the XHTML output into xml:fo
final URL baseUrl = resourceLoader.getResource(resourceBasePath).getURL();
Loggers.debug(LOGGER, "Parsing HTML response using base URL '%s'", baseUrl);
final XMLReader xmlParser = Util.getParser(null, isValidatingParser);
final ProtectEventHandlerFilter eventHandlerFilter = new ProtectEventHandlerFilter(true, true, xmlParser);

final XMLReader filter =
new CSSToXSLFOFilter(
baseUrl,
null,
Collections.EMPTY_MAP,
eventHandlerFilter,
cssToXslFoDebugEnabled);

filter.setEntityResolver(classPathEntityResolver);
filter.setContentHandler(fop.getDefaultHandler());
filter.parse(source);

This is able to produce a PDF with all the right displayed chars. As in, it looks correct to a human.

We have a use case of reading it programatically. We are testing it out with selecting the text in Adobe Reader, copying and pasting it. This output is the same as parsing tools like pdftotext & pdfbox.

However, when there are many unicode chars, 3 things happen when we copy:
1) some unicode chars are copied as some other random chars
e.g. source: πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚ πŸƒ‹πŸƒ‹πŸƒ‹πŸƒ‹πŸƒ‹ πŸƒ‹πŸƒ‹πŸƒ‹ jack 𝍐𝍐𝍐 3 chars π„žπ„ž 2 music majhog : πŸ€€ πŸ€€ πŸ€€
copy output:Β πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚ 33333 333πŸƒ‹ jackπŸƒ‹ 555 Β  charsπŸƒ‹ 𝍐 2 musicπŸƒ‹ majhog 8 3 3 3

2) Location on the chars
e.g. From the above example, the next page of the PDF does not haveΒ πŸƒ‹. However when copying, it showed up "πŸƒ‹πŸƒ‹πŸƒ‹" somewhere on the next page.

3) Some fonts make corrupted PDF output. We were trying out Mathematical fonts, e.g. "𝐏𝐫𝐨𝐟𝐒π₯𝐞"
It was fixable by using the Symbola fontΒ embedding-mode="full", where a correct looking PDF is produced. However, copying "𝐏𝐫𝐨𝐟𝐒π₯𝐞" gave "퐏퐫퐨퐟퐒ν₯퐞". Upon comparing, the Capital P char isΒ U+1D40F, and the correspondingΒ Korean char isΒ U+D40F. The 1 in front of it is missing.

It was frustrating and I have googled everywhere. It seems to be related to how Fop handles the toUnicodeCmap from a font file. I confirmed by producing the PDF using Weasyprint (Python library), where all chars are copy-able correctly.

Are we using FOP incorrectly? Are there tweaks we can do to fix it?

Thanks so much!