In today’s globalized world, applications must support multiple languages and scripts — from English and Hindi to Chinese, Arabic, or emojis. Java’s string handling is built on Unicode, ensuring consistent behavior across platforms and locales.
But developers still encounter encoding issues when reading/writing files, converting between bytes and characters, or dealing with legacy systems. This tutorial explains how Unicode and encoding work in Java, with practical tips for building robust, international-ready applications.
🧠 What is Unicode?
Unicode is a universal character encoding standard that assigns a unique number (code point) to every character in every language and symbol set.
Examples:
'A'
→ U+0041'你'
→ U+4F60'😊'
→ U+1F60A
🔤 Java's Internal String Representation
Java uses UTF-16 for storing String
objects in memory.
- Characters from U+0000 to U+FFFF are stored in a single
char
(16-bit). - Supplementary characters (above U+FFFF) are represented using surrogate pairs.
String emoji = "😊";
System.out.println(emoji.length()); // 2 (due to surrogate pair)
🔧 Converting Between Strings and Bytes
Java uses the Charset
class for encoding/decoding.
String → Byte Array
byte[] utf8Bytes = "नमस्ते".getBytes(StandardCharsets.UTF_8);
Byte Array → String
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);
🛠️ Common Encodings in Java
Charset | Description |
---|---|
UTF-8 | Most popular, variable-length |
UTF-16 | Java internal representation |
ISO-8859-1 | Western European |
US-ASCII | Basic Latin (0–127) |
📤 Writing Unicode-Aware Files
Files.write(Paths.get("out.txt"), "こんにちは".getBytes(StandardCharsets.UTF_8));
Ensure your text editors, databases, and file systems use the same encoding.
📥 Reading Unicode Files Safely
List<String> lines = Files.readAllLines(Paths.get("data.txt"), StandardCharsets.UTF_8);
🔎 Understanding Surrogate Pairs
Characters above U+FFFF (like emojis) are represented using two char
s in UTF-16.
String smile = "😊";
System.out.println(smile.length()); // 2
System.out.println(smile.codePointCount(0, smile.length())); // 1
🧪 Working with Code Points
Use these methods to handle supplementary characters correctly:
String str = "𝒜𝒷𝒸"; // fancy script
str.codePoints().forEach(cp -> System.out.println(Character.toChars(cp)));
📉 Common Encoding Pitfalls
- Mismatched encodings: Writing in UTF-8 but reading in ISO-8859-1.
- Truncated characters: When multibyte sequences are cut mid-way.
- Unescaped Unicode: Java strings support
\uXXXX
escapes but require proper escaping in source files.
🔄 Refactoring Example: Avoid Platform Defaults
❌ Risky
byte[] bytes = str.getBytes(); // uses platform default charset
✅ Safer
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
📌 What's New in Java for Unicode/Encoding?
Java 7+
StandardCharsets
added for encoding safety.Files.readAllLines()
andwrite()
methods with charset support.
Java 11+
- Better Unicode support in string methods like
isBlank()
,strip()
,lines()
.
Java 21 (Preview)
- String Templates support complex interpolated strings with emojis and Unicode.
✅ Best Practices
- Always specify a charset explicitly when reading/writing strings or files.
- Use
StandardCharsets.UTF_8
— it’s safe and widely compatible. - Prefer
codePointCount()
andcodePoints()
when working with emojis or non-BMP characters. - Avoid
getBytes()
without arguments. - Validate user input encoding early.
🔚 Conclusion and Key Takeaways
- Java is Unicode-native, using UTF-16 for internal string representation.
- Understand the distinction between characters, code points, and bytes.
- Use standard APIs to control encoding and prevent data loss or corruption.
- Internationalization starts with proper string and encoding handling.
❓ FAQ
1. What encoding does Java use internally for strings?
UTF-16.
2. What's the difference between a character and a code point?
A character can be one or more code units; a code point is the Unicode identifier.
3. Why does emoji.length()
return 2?
Because emojis use surrogate pairs in UTF-16.
4. How to safely convert strings to bytes?
Use getBytes(StandardCharsets.UTF_8)
.
5. What happens if I read UTF-8 as ISO-8859-1?
The result will be garbled text (mojibake).
6. Is new String(bytes)
safe?
Only if you specify the correct encoding.
7. How to count actual characters in a string?
Use codePointCount()
instead of length()
.
8. How to detect encoding of a file?
Java doesn’t detect encoding automatically. Use external libraries or metadata.
9. Are emojis supported in Java?
Yes — as Unicode characters using surrogate pairs.
10. Is UTF-8 always the best choice?
For most modern systems, yes. It's compact and widely supported.