Detect the charset in Java strings

Blog

13/12/17

Lluís Turró Cutiller

49.584

0

java

Before start, I would like to mention Apache Tika and juniversalchardet. Tika is a full-featured file type detection library and, because so much features, takes a big amount of dependencies. I haven't tried juniversalchardet for does not detect ISO-8859-1, which is the reason I needed charset detection.

Since none well suited my problem, I decided to detect charsets myself and, once results were in production, share it with anyone else. Hope you like it

Why charset detection?

Anyone developing web applications with data inputs and third-party frameworks, with a different charset than UTF-8, might have encountered the need to auto-detect charset. Guessing the source of the input on utility classes, or passing the charset along among methods, doesn't seem to be the right way and isn't always possible.

Changing the string charset

We'll need a convert method, in order to change the string charset. The most simple way would be using String supplied methods. Something like:

public String convert(String value, String fromEncoding, String toEncoding) {
  return new String(value.getBytes(fromEncoding), toEncoding);
}

The problem remains, though. The variable fromEncoding isn't always known.

Charset guessing

Guessing? Well, let's be clear, we are guessing. Also taking some premises that might be not true. For instance, we probe using UTF-8 against a set of expected charsets. The good thing about it is that we know the elements at play and can change them at will.

The approach is very simple: if I do change the string from the expected charset to UTF-8 and then back from UTF-8 to the expected charset, shouldn't be the resulting string exactly the same than the original one?

Let's put this at work:

public static String charset(String value, String charsets[]) {
  String probe = StandardCharsets.UTF_8.name();
  for(String c : charsets) {
    Charset charset = Charset.forName(c);
    if(charset != null) {
      if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
        return c;
      }
    }
  }
  return StandardCharsets.UTF_8.name();
}

A possible call to the charset() method would be:

String detectedCharset = charset(value, new String[] { "ISO-8859-1", "UTF-8" });

As I said, the approach uses the premise that UTF-8 will behave well on all transformations and that there is a reduced set of expected charsets. I haven't tried probing the whole Charset.availableCharsets(). In case you do and find a better way, please let me know.

Serveis BaaS BaaS Home Server Email Outsourcing Suport Sinergies Casos d'Estudi Distribuïdors Incubadora Anem per feina BaaS DEMO El més social Gestor de projectes Manteniment informàtic

Projectes Project Desk Android Comunicació Desplegament JavaEE Object Pascal

Organització Fundació Treballa amb nosaltres Desenvolupador Creació de continguts Llicència ZK Open Source License Socis Tecnològics Calendaris Activitat Agenda Convocatòries públiques Descàrregues Foundation Hub Notes legals

Documentació BaaS Single Sign On BrightSide Application Framework Contactes Adjunts Dossiers Publicacions Customer Relationship Financials Resource Planning Components Elephant Getting started Content Layouts Users Components Administering Best Practices Processos Feedback Javadoc Dependencies Tutorials Fundació Fundació i Imperi Segona Fundació Els limits de la Fundació Fundació i Terra Valors Treballa amb nosaltres Marketing Definició d'entitats

Publicacions Totes les categories Blog Customer Relationship New & Noteworthy

lluis@turro.org
Tel. +34 609323947

Detect the charset in Java strings

Why charset detection?

Changing the string charset

Charset guessing

Comentaris