Intl.Segmenter API

How to count characters, split words, and segment sentences correctly in JavaScript

Introduction

JavaScript's string.length property counts code units, not user-perceived characters. When users type emoji, accented characters, or text in complex scripts, string.length returns the wrong count. The split() method fails for languages that do not use spaces between words. Regular expression word boundaries do not work for Chinese, Japanese, or Thai text.

The Intl.Segmenter API solves these problems. It segments text according to Unicode standards, respecting the linguistic rules of each language. You can count graphemes (user-perceived characters), split text into words regardless of the language, or break text into sentences.

This article explains why basic string operations fail for international text, what grapheme clusters and linguistic boundaries are, and how to use Intl.Segmenter to handle text correctly for all users.

Why string.length fails for character counting

JavaScript strings use UTF-16 encoding. Each element in a JavaScript string is a 16-bit code unit, not a complete character. The string.length property counts these code units.

For basic ASCII characters, one code unit equals one character. The string "hello" has a length of 5, which matches user expectations.

For many other characters, this breaks down. Consider these examples:

"😀".length; // 2, not 1
"👨‍👩‍👧‍👦".length; // 11, not 1
"किं".length; // 5, not 2
"🇺🇸".length; // 4, not 1

Users see one emoji, one family emoji, two Hindi syllables, or one flag. JavaScript counts the underlying code units.

This matters when you build character counters for text inputs, validate length limits, or truncate text for display. The count JavaScript reports does not match what users see.

What grapheme clusters are

A grapheme cluster is what users perceive as a single character. It might consist of:

  • A single code point like "a"
  • A base character plus combining marks like "é" (e + combining acute accent)
  • Multiple code points joined together like "👨‍👩‍👧‍👦" (man + woman + girl + boy joined with zero-width joiners)
  • Emoji with skin tone modifiers like "👋🏽" (waving hand + medium skin tone)
  • Regional indicator sequences for flags like "🇺🇸" (regional indicator U + regional indicator S)

The Unicode Standard defines extended grapheme clusters in UAX 29. These rules determine where users expect boundaries between characters. When a user presses backspace, they expect to delete one grapheme cluster. When a cursor moves, it should move by grapheme clusters.

JavaScript's string.length does not count grapheme clusters. The Intl.Segmenter API does.

Counting grapheme clusters with Intl.Segmenter

Create a segmenter with grapheme granularity to count user-perceived characters:

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const text = "Hello 👋🏽";
const segments = segmenter.segment(text);
const graphemes = Array.from(segments);

console.log(graphemes.length); // 7
console.log(text.length); // 10

The user sees seven characters: five letters, one space, and one emoji. The grapheme segmenter returns seven segments. JavaScript's string.length returns ten because the emoji uses four code units.

Each segment object contains:

  • segment: the grapheme cluster as a string
  • index: the position in the original string where this segment starts
  • input: reference to the original string (not always needed)

You can iterate over segments with for...of:

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const text = "café";

for (const { segment } of segmenter.segment(text)) {
  console.log(segment);
}
// Logs: "c", "a", "f", "é"

Building a character counter that works internationally

Use grapheme segmentation to build accurate character counters:

function getGraphemeCount(text) {
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
  return Array.from(segmenter.segment(text)).length;
}

// Test with various inputs
getGraphemeCount("hello"); // 5
getGraphemeCount("hello 😀"); // 7
getGraphemeCount("👨‍👩‍👧‍👦"); // 1
getGraphemeCount("किंतु"); // 2
getGraphemeCount("🇺🇸"); // 1

This function returns counts that match user perception. A user who types a family emoji sees one character, and the counter shows one character.

For text input validation, use grapheme counts instead of string.length:

function validateInput(text, maxGraphemes) {
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
  const count = Array.from(segmenter.segment(text)).length;
  return count <= maxGraphemes;
}

Truncating text safely with grapheme segmentation

When truncating text for display, you must not cut through a grapheme cluster. Cutting at an arbitrary code unit index can split emoji or combining character sequences, producing invalid or broken output.

Use grapheme segmentation to find safe truncation points:

function truncateText(text, maxGraphemes) {
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
  const segments = Array.from(segmenter.segment(text));

  if (segments.length <= maxGraphemes) {
    return text;
  }

  const truncated = segments
    .slice(0, maxGraphemes)
    .map(s => s.segment)
    .join("");

  return truncated + "…";
}

truncateText("Hello 👨‍👩‍👧‍👦 world", 7); // "Hello 👨‍👩‍👧‍👦…"
truncateText("Hello world", 7); // "Hello w…"

This preserves complete grapheme clusters and produces valid Unicode output.

Why split() and regex fail for word segmentation

The common approach to splitting text into words uses split() with a space or whitespace pattern:

const text = "Hello world";
const words = text.split(" "); // ["Hello", "world"]

This works for English and other languages that separate words with spaces. It fails completely for languages that do not use spaces between words.

Chinese, Japanese, and Thai text does not include spaces between words. Splitting on spaces returns the entire string as one element:

const text = "你好世界"; // "Hello world" in Chinese
const words = text.split(" "); // ["你好世界"]

The user sees four distinct words, but split() returns one element.

Regular expression word boundaries (\b) also fail for these languages because the regex engine does not recognize word boundaries in scripts without spaces.

How word segmentation works across languages

The Intl.Segmenter API uses Unicode word boundary rules defined in UAX 29. These rules understand word boundaries for all scripts, including those without spaces.

Create a segmenter with word granularity:

const segmenter = new Intl.Segmenter("zh", { granularity: "word" });
const text = "你好世界";
const segments = Array.from(segmenter.segment(text));

segments.forEach(({ segment, isWordLike }) => {
  console.log(segment, isWordLike);
});
// "你好" true
// "世界" true

The segmenter correctly identifies word boundaries based on the locale and script. The isWordLike property indicates whether the segment is a word (letters, numbers, ideographs) or non-word content (spaces, punctuation).

For English text, the segmenter returns both words and spaces:

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world!";
const segments = Array.from(segmenter.segment(text));

segments.forEach(({ segment, isWordLike }) => {
  console.log(segment, isWordLike);
});
// "Hello" true
// " " false
// "world" true
// "!" false

Use the isWordLike property to filter word segments from punctuation and whitespace:

function getWords(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  return Array.from(segments)
    .filter(s => s.isWordLike)
    .map(s => s.segment);
}

getWords("Hello, world!", "en"); // ["Hello", "world"]
getWords("你好世界", "zh"); // ["你好", "世界"]
getWords("สวัสดีครับ", "th"); // ["สวัสดี", "ครับ"] (Thai)

This function works for any language, handling both space-separated and non-space-separated scripts.

Counting words accurately

Build a word counter that works internationally:

function countWords(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  return Array.from(segments).filter(s => s.isWordLike).length;
}

countWords("Hello world", "en"); // 2
countWords("你好世界", "zh"); // 2
countWords("Bonjour le monde", "fr"); // 3

This produces accurate word counts for content in any language.

Finding which word contains a cursor position

The containing() method finds the segment that includes a specific index in the string. This is useful for determining which word the cursor is in or which segment contains a click position.

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world";
const segments = segmenter.segment(text);

const segment = segments.containing(7); // Index 7 is in "world"
console.log(segment);
// { segment: "world", index: 6, isWordLike: true }

If the index is within whitespace or punctuation, containing() returns that segment:

const segment = segments.containing(5); // Index 5 is the space
console.log(segment);
// { segment: " ", index: 5, isWordLike: false }

Use this for text editing features, search highlighting, or contextual actions based on cursor position.

Segmenting sentences for text processing

Sentence segmentation splits text at sentence boundaries. This is useful for summarization, text-to-speech processing, or navigating long documents.

Basic approaches like splitting on periods fail because periods appear in abbreviations, numbers, and other contexts that are not sentence boundaries:

const text = "Dr. Smith bought 100.5 shares. He sold them later.";
text.split(". "); // Incorrect: breaks at "Dr." and "100."

The Intl.Segmenter API understands sentence boundary rules:

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "Dr. Smith bought 100.5 shares. He sold them later.";
const segments = Array.from(segmenter.segment(text));

segments.forEach(({ segment }) => {
  console.log(segment);
});
// "Dr. Smith bought 100.5 shares. "
// "He sold them later."

The segmenter correctly treats "Dr." and "100.5" as part of the sentence, not as sentence boundaries.

For multilingual text, sentence boundaries vary by locale. The API handles these differences:

const segmenterEn = new Intl.Segmenter("en", { granularity: "sentence" });
const segmenterJa = new Intl.Segmenter("ja", { granularity: "sentence" });

const textEn = "Hello. How are you?";
const textJa = "こんにちは。お元気ですか。"; // Uses Japanese full stop

Array.from(segmenterEn.segment(textEn)).length; // 2
Array.from(segmenterJa.segment(textJa)).length; // 2

When to use each granularity

Choose the granularity based on what you need to count or split:

  • Grapheme: Use for character counting, text truncation, cursor positioning, or any operation where you need to match user perception of characters.

  • Word: Use for word counting, search and highlighting, text analysis, or any operation that needs linguistic word boundaries across languages.

  • Sentence: Use for text-to-speech segmentation, summarization, document navigation, or any operation that processes text sentence by sentence.

Do not use grapheme segmentation when you need word boundaries, and do not use word segmentation when you need character counts. Each granularity serves a distinct purpose.

Creating and reusing segmenters

Creating a segmenter is inexpensive, but you can reuse segmenters for performance:

const graphemeSegmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const wordSegmenter = new Intl.Segmenter("en", { granularity: "word" });

// Reuse these segmenters for multiple strings
function processTexts(texts) {
  return texts.map(text => ({
    text,
    graphemes: Array.from(graphemeSegmenter.segment(text)).length,
    words: Array.from(wordSegmenter.segment(text)).filter(s => s.isWordLike).length
  }));
}

The segmenter caches locale data, so reusing the same instance avoids repeated initialization.

Checking browser support

The Intl.Segmenter API reached Baseline status in April 2024. It works in current versions of Chrome, Firefox, Safari, and Edge. Older browsers do not support it.

Check for support before using:

if (typeof Intl.Segmenter !== "undefined") {
  // Use Intl.Segmenter
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
  // ...
} else {
  // Fallback for older browsers
  const count = text.length; // Not accurate, but available
}

For production applications targeting older browsers, consider using a polyfill or providing degraded functionality.

Common mistakes to avoid

Do not use string.length for displaying character counts to users. It produces incorrect results for emoji, combining characters, and complex scripts.

Do not split on spaces or use regex word boundaries for multilingual word segmentation. These approaches only work for a subset of languages.

Do not assume word or sentence boundaries are the same across languages. Use locale-aware segmentation.

Do not forget to check the isWordLike property when counting words. Including punctuation and whitespace produces inflated counts.

Do not cut strings at arbitrary indices when truncating. Always cut at grapheme cluster boundaries to avoid producing invalid Unicode sequences.

When not to use Intl.Segmenter

For simple ASCII-only operations where you know the text contains only basic Latin characters, basic string methods are faster and sufficient.

When you need the byte length of a string for network operations or storage, use TextEncoder:

const byteLength = new TextEncoder().encode(text).length;

When you need the actual code unit count for low-level string manipulation, string.length is correct. This is rare in application code.

For most text processing that involves user-facing content, especially in international applications, use Intl.Segmenter.

How Intl.Segmenter relates to other internationalization APIs

The Intl.Segmenter API is part of the ECMAScript Internationalization API. Other APIs in this family include:

  • Intl.DateTimeFormat: Format dates and times according to locale
  • Intl.NumberFormat: Format numbers, currencies, and units according to locale
  • Intl.Collator: Sort and compare strings according to locale
  • Intl.PluralRules: Determine plural forms for numbers in different languages

Together, these APIs provide the tools needed to build applications that work correctly for users worldwide. Use Intl.Segmenter for text segmentation, and use the other Intl APIs for formatting and comparison.

Practical example: building a text statistics component

Combine grapheme and word segmentation to build a text statistics component:

function getTextStatistics(text, locale) {
  const graphemeSegmenter = new Intl.Segmenter(locale, {
    granularity: "grapheme"
  });
  const wordSegmenter = new Intl.Segmenter(locale, {
    granularity: "word"
  });
  const sentenceSegmenter = new Intl.Segmenter(locale, {
    granularity: "sentence"
  });

  const graphemes = Array.from(graphemeSegmenter.segment(text));
  const words = Array.from(wordSegmenter.segment(text))
    .filter(s => s.isWordLike);
  const sentences = Array.from(sentenceSegmenter.segment(text));

  return {
    characters: graphemes.length,
    words: words.length,
    sentences: sentences.length,
    averageWordLength: words.length > 0
      ? graphemes.length / words.length
      : 0
  };
}

// Works for any language
getTextStatistics("Hello world! How are you?", "en");
// { characters: 24, words: 5, sentences: 2, averageWordLength: 4.8 }

getTextStatistics("你好世界!你好吗?", "zh");
// { characters: 9, words: 5, sentences: 2, averageWordLength: 1.8 }

This function produces meaningful statistics for text in any language, using the correct segmentation rules for each locale.