How do you split text into individual characters correctly?

Introduction

When you try to split the emoji "👨‍👩‍👧‍👦" into individual characters using JavaScript's standard string methods, you get a broken result. Instead of one family emoji, you see separate person emojis and invisible characters. The same problem occurs with accented letters like "é", flag emojis like "🇺🇸", and many other text elements that appear as single characters on screen.

This happens because JavaScript's built-in string splitting treats strings as sequences of UTF-16 code units rather than user-perceived characters. A single visible character can consist of multiple code units joined together. When you split by code units, you break these characters apart.

JavaScript provides the Intl.Segmenter API to handle this correctly. This lesson explains what user-perceived characters are, why standard string methods fail to split them properly, and how to use Intl.Segmenter to split text into actual characters.

What user-perceived characters are

A user-perceived character is what a person recognizes as a single character when reading text. These are called grapheme clusters in Unicode terminology. Most of the time, a grapheme cluster matches what you see as one character on screen.

The letter "a" is a grapheme cluster consisting of one Unicode code point. The emoji "😀" is a grapheme cluster consisting of two code points that form a single emoji. The family emoji "👨‍👩‍👧‍👦" is a grapheme cluster consisting of seven code points joined together with special invisible characters.

When you count characters in text, you want to count grapheme clusters, not code points or code units. When you split text into characters, you want to split at grapheme cluster boundaries, not at arbitrary positions within a cluster.

JavaScript strings are sequences of UTF-16 code units. Each code unit represents either a complete code point or part of a code point. A grapheme cluster can span multiple code points, and each code point can span multiple code units. This creates a mismatch between how JavaScript stores text and how users perceive text.

Why split method fails with complex characters

The split('') method divides a string at every code unit boundary. This works correctly for simple ASCII characters where each character is one code unit. It fails for characters that span multiple code units.

const simple = "hello";
console.log(simple.split(''));
// Output: ["h", "e", "l", "l", "o"]

Simple ASCII text splits correctly because each letter is one code unit. However, emoji and other complex characters break apart.

const emoji = "😀";
console.log(emoji.split(''));
// Output: ["\ud83d", "\ude00"]

The smiling face emoji consists of two code units. The split('') method breaks it into two separate pieces that are not valid characters on their own. When displayed, these pieces appear as replacement characters or nothing at all.

Flag emojis use regional indicator symbols that combine to form flags. Each flag requires two code points.

const flag = "🇺🇸";
console.log(flag.split(''));
// Output: ["\ud83c", "\uddfa", "\ud83c", "\uddf8"]

The US flag emoji splits into four code units representing two regional indicators. Neither indicator is a valid character by itself. You need both indicators together to form the flag.

Family emojis use zero-width joiner characters to combine multiple person emojis into one composite character.

const family = "👨‍👩‍👧‍👦";
console.log(family.split(''));
// Output: ["👨", "‍", "👩", "‍", "👧", "‍", "👦"]

The family emoji splits into individual person emojis and invisible joiner characters. The original composite character is destroyed, and you see four separate people instead of one family.

Accented letters can be represented two ways in Unicode. Some accented letters are single code points, while others combine a base letter with a combining diacritical mark.

const combined = "é"; // e + combining acute accent
console.log(combined.split(''));
// Output: ["e", "́"]

When the letter é is represented as two code points (base letter plus combining accent), splitting breaks it into separate pieces. The accent mark appears alone, which is not what users expect when splitting text into characters.

Using Intl.Segmenter to split text correctly

The Intl.Segmenter constructor creates a segmenter that divides text according to locale-specific rules. Pass a locale identifier as the first argument and an options object specifying the granularity as the second argument.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

The grapheme granularity tells the segmenter to split text at grapheme cluster boundaries. This respects the structure of user-perceived characters and does not break them apart.

Call the segment() method with a string to get an iterator of segments. Each segment includes the text and position information.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = "hello";
const segments = segmenter.segment(text);

for (const segment of segments) {
  console.log(segment.segment);
}
// Output:
// "h"
// "e"
// "l"
// "l"
// "o"

Each segment object contains a segment property with the character text and an index property with its position. You can iterate directly over the segments to access each character.

To get an array of characters, spread the iterator into an array and map to the segment text.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = "hello";
const characters = [...segmenter.segment(text)].map(s => s.segment);

console.log(characters);
// Output: ["h", "e", "l", "l", "o"]

This pattern converts the iterator to an array of segment objects, then extracts just the text from each segment. The result is an array of strings, one for each grapheme cluster.

Splitting emoji into characters correctly

The Intl.Segmenter API handles all emoji correctly, including composite emoji that use multiple code points.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const emoji = "😀";
const characters = [...segmenter.segment(emoji)].map(s => s.segment);
console.log(characters);
// Output: ["😀"]

The emoji stays intact as one grapheme cluster. The segmenter recognizes that both code units belong to the same character and does not split them.

Flag emojis remain as single characters instead of breaking into regional indicators.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const flag = "🇺🇸";
const characters = [...segmenter.segment(flag)].map(s => s.segment);
console.log(characters);
// Output: ["🇺🇸"]

The two regional indicator symbols form one grapheme cluster representing the US flag. The segmenter keeps them together as one character.

Family emojis and other composite emoji stay as single characters.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const family = "👨‍👩‍👧‍👦";
const characters = [...segmenter.segment(family)].map(s => s.segment);
console.log(characters);
// Output: ["👨‍👩‍👧‍👦"]

All the person emojis and zero-width joiners form one grapheme cluster. The segmenter treats the entire family emoji as one character, preserving its appearance and meaning.

Splitting text with accented letters

The Intl.Segmenter API correctly handles accented letters regardless of how they are encoded in Unicode.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const precomposed = "café"; // precomposed é
const characters = [...segmenter.segment(precomposed)].map(s => s.segment);
console.log(characters);
// Output: ["c", "a", "f", "é"]

When the accented letter é is encoded as a single code point, the segmenter treats it as one character. This matches user expectations for how to split the word.

When the same letter is encoded as a base letter plus combining diacritical mark, the segmenter still treats it as one character.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const decomposed = "café"; // e + combining acute accent
const characters = [...segmenter.segment(decomposed)].map(s => s.segment);
console.log(characters);
// Output: ["c", "a", "f", "é"]

The segmenter recognizes that the base letter and combining mark form a single grapheme cluster. The result looks identical to the precomposed version, even though the underlying encoding is different.

This behavior is important for text processing in languages that use diacritics. Users expect accented letters to be treated as complete characters, not as separate base letters and marks.

Counting characters correctly

One common use case for splitting text is counting how many characters it contains. The split('') method gives incorrect counts for text with complex characters.

const text = "👨‍👩‍👧‍👦";
console.log(text.split('').length);
// Output: 7

The family emoji appears as one character but counts as seven when split by code units. This does not match user expectations.

Using Intl.Segmenter gives accurate character counts.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = "👨‍👩‍👧‍👦";
const count = [...segmenter.segment(text)].length;
console.log(count);
// Output: 1

The segmenter recognizes the family emoji as one grapheme cluster, so the count is one. This matches what users see on screen.

You can create a helper function to count grapheme clusters in any string.

function countCharacters(text) {
  const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
  return [...segmenter.segment(text)].length;
}

console.log(countCharacters("hello"));
// Output: 5

console.log(countCharacters("café"));
// Output: 4

console.log(countCharacters("👨‍👩‍👧‍👦"));
// Output: 1

console.log(countCharacters("🇺🇸"));
// Output: 1

This function works correctly for ASCII text, accented letters, emoji, and any other Unicode characters. The count always matches the number of user-perceived characters.

Getting character at specific position

When you need to access a character at a specific position, you can convert the text to an array of grapheme clusters first.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = "Hello 👋";
const characters = [...segmenter.segment(text)].map(s => s.segment);

console.log(characters[6]);
// Output: "👋"

The waving hand emoji is at position 6 when counting grapheme clusters. If you used standard array indexing on the string, you would get an invalid result because the emoji spans multiple code units.

This approach is useful when implementing character-level operations like character picking, character highlighting, or character-by-character animations.

Reversing text correctly

Reversing a string by reversing its array of code units produces incorrect results for complex characters.

const text = "Hello 👋";
console.log(text.split('').reverse().join(''));
// Output: "�� olleH"

The emoji breaks because its code units are reversed separately. The resulting string contains invalid character sequences.

Using Intl.Segmenter to reverse text preserves character integrity.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = "Hello 👋";
const characters = [...segmenter.segment(text)].map(s => s.segment);
const reversed = characters.reverse().join('');
console.log(reversed);
// Output: "👋 olleH"

Each grapheme cluster stays intact during the reversal. The emoji remains valid because its code units are not separated.

Understanding locale parameter

The Intl.Segmenter constructor accepts a locale parameter, but for grapheme segmentation, the locale has minimal impact. Grapheme cluster boundaries follow Unicode rules that are mostly language-independent.

const segmenterEn = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segmenterJa = new Intl.Segmenter('ja', { granularity: 'grapheme' });

const text = "Hello 👋 こんにちは";

const charactersEn = [...segmenterEn.segment(text)].map(s => s.segment);
const charactersJa = [...segmenterJa.segment(text)].map(s => s.segment);

console.log(charactersEn);
console.log(charactersJa);
// Both outputs are identical

Different locale identifiers produce the same grapheme segmentation results. The Unicode standard defines grapheme cluster boundaries in a way that works across languages.

However, specifying a locale is still good practice for consistency with other Intl APIs and in case future Unicode versions introduce locale-specific rules.

Reusing segmenters for performance

Creating a new Intl.Segmenter instance involves loading locale data and initializing internal structures. When you need to segment multiple strings with the same settings, create the segmenter once and reuse it.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const texts = [
  "Hello 👋",
  "Café ☕",
  "World 🌍",
  "Family 👨‍👩‍👧‍👦"
];

texts.forEach(text => {
  const characters = [...segmenter.segment(text)].map(s => s.segment);
  console.log(characters);
});
// Output:
// ["H", "e", "l", "l", "o", " ", "👋"]
// ["C", "a", "f", "é", " ", "☕"]
// ["W", "o", "r", "l", "d", " ", "🌍"]
// ["F", "a", "m", "i", "l", "y", " ", "👨‍👩‍👧‍👦"]

This approach is more efficient than creating a new segmenter for each string. The performance difference becomes significant when processing large amounts of text.

Combining grapheme segmentation with other operations

You can combine grapheme segmentation with other string operations to build more complex text processing functions.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

function truncateByCharacters(text, maxLength) {
  const characters = [...segmenter.segment(text)].map(s => s.segment);

  if (characters.length <= maxLength) {
    return text;
  }

  return characters.slice(0, maxLength).join('') + '...';
}

console.log(truncateByCharacters("Hello 👋 World", 7));
// Output: "Hello 👋..."

console.log(truncateByCharacters("Family 👨‍👩‍👧‍👦 Photo", 8));
// Output: "Family 👨‍👩‍👧‍👦..."

This truncation function counts grapheme clusters rather than code units. It preserves emoji and other complex characters when truncating, so the output never contains broken characters.

Working with string positions

The segment objects returned by Intl.Segmenter include an index property that indicates the position in the original string. This position is measured in code units, not grapheme clusters.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = "Hello 👋";

for (const segment of segmenter.segment(text)) {
  console.log(`Character "${segment.segment}" starts at position ${segment.index}`);
}
// Output:
// Character "H" starts at position 0
// Character "e" starts at position 1
// Character "l" starts at position 2
// Character "l" starts at position 3
// Character "o" starts at position 4
// Character " " starts at position 5
// Character "👋" starts at position 6

The waving hand emoji starts at code unit position 6, even though it occupies positions 6 and 7 in the underlying string. The next character would start at position 8. This information is useful when you need to map between grapheme positions and string positions for operations like substring extraction.

Handling empty strings and edge cases

The Intl.Segmenter API handles empty strings and other edge cases correctly.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const empty = "";
const characters = [...segmenter.segment(empty)].map(s => s.segment);
console.log(characters);
// Output: []

An empty string produces an empty array of segments. No special handling is required.

Whitespace characters are treated as separate grapheme clusters.

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

const whitespace = "a b\tc\nd";
const characters = [...segmenter.segment(whitespace)].map(s => s.segment);
console.log(characters);
// Output: ["a", " ", "b", "\t", "c", "\n", "d"]

Spaces, tabs, and newlines each form their own grapheme clusters. This matches user expectations for character-level text processing.