Unicode and Java versions | Madalin's Blog

While enhancing CATS I recently added a feature to send requests that include single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, but not the focus of this article).

I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters. A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means:

p{C} - match Unicode invisible Control Chars (\u000D - carriage return for example)
p{Z} - match Unicode whitespace and invisible separators (\u2028 - line separator for example)
p{So} - matches various symbols that are not math symbols, currency signs, or combining characters; this also includes emojis

I have a test service I use for testing new CATS fuzzers. The idea was to simply use the String’s replaceAll() method to remove all these characters from the String.

So let’s take the following simple code which aims to sanitize a given input:

    public static void main(String... args) {
        String input = "this is a great \uD83E\uDD76 article";
        String output = input.replaceAll("[\\p{C}\\p{So}]+", "");

        System.out.println("input = " + input);
        System.out.println("output = " + output);
    }

While running this with Java 11, I get the following output:

input = this is a great 🥶 article
output = this is a great  article

Which works as expected. The 🥶 emoji was removed from the String as expected.

Even though I have CATS compiled to Java 8, I mainly use JDK11+ for development. At some point I had CATS running in a CD pipeline with JRE8. The emoji test cases generated by the CATS Fuzzers, started to fail, even though they were successfully passing on my local box (and on other CD pipelines). I went through the log files, the request payloads were initially constructed and displayed ok, with the emoji properly printed, but while running some pattern matching on the string the result was printed as sometext?andanother. The ? is where the emoji was supposed to be. Further investigation led to the conclusion that what caused the mishandling of the emoji was the JRE version (which might be obvious for the 99.999% of Java devs out there). Which is actually expected as Java 8 is compatible with Unicode 6.2, while 🥶 is part of Unicode 11.

Going back to the previous example, if I run it with Java 8, I get the following output:

input = this is a great 🥶 article
output = this is a great ? article

Conclusions:

Even though a Java version can receive, write/store and forward the latest Unicode characters, any attempt to manipulate them might result in weird ? symbols if the Unicode char is not from the version supported by your JRE version
Independent on how you compile the code, it’s the JRE that decides how the Unicode chars are handled i.e. a Java program compiled as Java 8 will have different behaviour in JRE 8 vs JRE 14