How do I split text into words in JavaScript?

Use Intl.Segmenter to extract words from text in any language, including languages without spaces between words.

Introduction

When you need to extract words from text, the common approach is to split on spaces using split(" "). This works for English, but fails completely for languages that do not use spaces between words. Chinese, Japanese, Thai, and other languages write text continuously without word separators, yet users perceive distinct words in that text.

The Intl.Segmenter API solves this problem. It identifies word boundaries according to Unicode standards and linguistic rules for each language. You can extract words from text regardless of whether the language uses spaces, and the segmenter handles the complexity of determining where words begin and end.

This article explains why basic string splitting fails for international text, how word boundaries work across different writing systems, and how to use Intl.Segmenter to split text into words correctly for all languages.

Why splitting on spaces fails

The split() method breaks a string at each occurrence of a separator. For English text, splitting on spaces extracts words.

const text = "Hello world";
const words = text.split(" ");
console.log(words);
// ["Hello", "world"]

This approach assumes words are separated by spaces. Many languages do not follow this pattern.

Chinese text does not include spaces between words.

const text = "你好世界";
const words = text.split(" ");
console.log(words);
// ["你好世界"]

The user sees two distinct words, but split() returns the entire string as a single element because there are no spaces to split on.

Japanese text mixes multiple scripts and does not use spaces between words.

const text = "今日は良い天気です";
const words = text.split(" ");
console.log(words);
// ["今日は良い天気です"]

This sentence contains multiple words, but splitting on spaces produces one element.

Thai text also writes words continuously without spaces.

const text = "สวัสดีครับ";
const words = text.split(" ");
console.log(words);
// ["สวัสดีครับ"]

The text contains two words, but split() returns one element.

For these languages, you need a different approach to identify word boundaries.

Why regular expressions fail for word boundaries

Regular expression word boundaries use the \b pattern to match positions between word and non-word characters. This works for English.

const text = "Hello world!";
const words = text.match(/\b\w+\b/g);
console.log(words);
// ["Hello", "world"]

This pattern fails for languages without spaces because the regex engine does not recognize word boundaries in scripts like Chinese, Japanese, or Thai.

const text = "你好世界";
const words = text.match(/\b\w+\b/g);
console.log(words);
// ["你好世界"]

The regex treats the entire string as one word because it does not understand Chinese word boundaries.

Even for English, regex patterns can produce incorrect results with punctuation, contractions, or special characters. Regular expressions are not designed to handle linguistic word segmentation across all writing systems.

What word boundaries are across languages

A word boundary is a position in text where one word ends and another begins. Different writing systems use different conventions for word boundaries.

Space-separated languages like English, Spanish, French, and German use spaces to mark word boundaries. The word "hello" is separated from "world" by a space.

Scriptio continua languages like Chinese, Japanese, and Thai do not use spaces between words. Word boundaries exist based on semantic and morphological rules, but these boundaries are not marked visually in the text. A Chinese reader recognizes where one word ends and another begins through familiarity with the language, not through visual separators.

Some languages use mixed conventions. Japanese combines kanji, hiragana, and katakana characters, and word boundaries occur at transitions between character types or based on grammatical structure.

The Unicode Standard defines word boundary rules in UAX 29. These rules specify how to identify word boundaries for all scripts. The rules consider character properties, script types, and linguistic patterns to determine where words begin and end.

Using Intl.Segmenter to split text into words

The Intl.Segmenter constructor creates a segmenter object that splits text according to Unicode rules. You specify a locale and a granularity.

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world!";
const segments = segmenter.segment(text);

The first argument is the locale identifier. The second argument is an options object where granularity: "word" tells the segmenter to split at word boundaries.

The segment() method returns an iterable object containing segments. You can iterate over segments using for...of.

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world!";

for (const segment of segmenter.segment(text)) {
  console.log(segment);
}
// { segment: "Hello", index: 0, input: "Hello world!", isWordLike: true }
// { segment: " ", index: 5, input: "Hello world!", isWordLike: false }
// { segment: "world", index: 6, input: "Hello world!", isWordLike: true }
// { segment: "!", index: 11, input: "Hello world!", isWordLike: false }

Each segment object contains properties:

  • segment: the text of this segment
  • index: the position in the original string where this segment starts
  • input: the original string being segmented
  • isWordLike: whether this segment is a word or non-word content

Understanding the isWordLike property

When you segment text by words, the segmenter returns both word segments and non-word segments. Words include letters, numbers, and ideographic characters. Non-word segments include spaces, punctuation, and other separators.

The isWordLike property indicates whether a segment is a word. This property is true for segments that contain word characters, and false for segments that contain only spaces, punctuation, or other non-word characters.

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello, world!";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "Hello" true
// "," false
// " " false
// "world" true
// "!" false

Use the isWordLike property to filter word segments from punctuation and whitespace. This gives you just the words without separators.

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello, world!";
const segments = segmenter.segment(text);
const words = Array.from(segments)
  .filter(s => s.isWordLike)
  .map(s => s.segment);

console.log(words);
// ["Hello", "world"]

This pattern works for any language, including those without spaces.

Extracting words from text without spaces

The segmenter correctly identifies word boundaries in languages that do not use spaces. For Chinese text, the segmenter splits at word boundaries based on Unicode rules and linguistic patterns.

const segmenter = new Intl.Segmenter("zh", { granularity: "word" });
const text = "你好世界";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "你好" true
// "世界" true

The segmenter identifies two words in this text. There are no spaces, but the segmenter understands Chinese word boundaries and splits the text appropriately.

For Japanese text, the segmenter handles the complexity of mixed scripts and identifies word boundaries.

const segmenter = new Intl.Segmenter("ja", { granularity: "word" });
const text = "今日は良い天気です";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "今日" true
// "は" true
// "良い" true
// "天気" true
// "です" true

The segmenter splits this sentence into five word segments. It recognizes that particles like "は" are separate words and that compound words like "天気" form single units.

For Thai text, the segmenter identifies word boundaries without spaces.

const segmenter = new Intl.Segmenter("th", { granularity: "word" });
const text = "สวัสดีครับ";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "สวัสดี" true
// "ครับ" true

The segmenter correctly identifies two words in this greeting.

Building a word extraction function

Create a function that extracts words from text in any language.

function getWords(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  return Array.from(segments)
    .filter(s => s.isWordLike)
    .map(s => s.segment);
}

This function works for space-separated and non-space-separated languages.

getWords("Hello, world!", "en");
// ["Hello", "world"]

getWords("你好世界", "zh");
// ["你好", "世界"]

getWords("今日は良い天気です", "ja");
// ["今日", "は", "良い", "天気", "です"]

getWords("Bonjour le monde!", "fr");
// ["Bonjour", "le", "monde"]

getWords("สวัสดีครับ", "th");
// ["สวัสดี", "ครับ"]

The function returns an array of words regardless of the language or writing system.

Counting words accurately across languages

Build a word counter that works for all languages by counting word-like segments.

function countWords(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  return Array.from(segments).filter(s => s.isWordLike).length;
}

This function produces accurate word counts for text in any language.

countWords("Hello world", "en");
// 2

countWords("你好世界", "zh");
// 2

countWords("今日は良い天気です", "ja");
// 5

countWords("Bonjour le monde", "fr");
// 3

countWords("สวัสดีครับ", "th");
// 2

The counts match user perception of word boundaries in each language.

Finding which word contains a position

The containing() method finds the segment that includes a specific index in the string. This is useful for determining which word the cursor is in or which word was clicked.

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world";
const segments = segmenter.segment(text);

const segment = segments.containing(7);
console.log(segment);
// { segment: "world", index: 6, input: "Hello world", isWordLike: true }

Index 7 falls within the word "world", which starts at index 6. The method returns the segment object for that word.

If the index falls within whitespace or punctuation, the method returns that segment with isWordLike: false.

const segment = segments.containing(5);
console.log(segment);
// { segment: " ", index: 5, input: "Hello world", isWordLike: false }

Use this for text editor features like double-click word selection, contextual menus based on cursor position, or highlighting the current word.

Handling punctuation and contractions

The segmenter treats punctuation as separate segments. Contractions in English are typically split into multiple segments.

const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "I can't do it.";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "I" true
// " " false
// "can" true
// "'" false
// "t" true
// " " false
// "do" true
// " " false
// "it" true
// "." false

The contraction "can't" is split into "can", "'", and "t". If you need to keep contractions as single words, you need additional logic to merge segments based on apostrophes.

For most use cases, counting word-like segments gives you meaningful word counts even when contractions are split.

How locale affects word segmentation

The locale you pass to the segmenter affects how word boundaries are determined. Different locales may have different rules for the same text.

For languages with well-defined word boundary rules, the locale ensures the correct rules are applied.

const segmenterEn = new Intl.Segmenter("en", { granularity: "word" });
const segmenterZh = new Intl.Segmenter("zh", { granularity: "word" });

const text = "你好世界";

const wordsEn = Array.from(segmenterEn.segment(text))
  .filter(s => s.isWordLike)
  .map(s => s.segment);

const wordsZh = Array.from(segmenterZh.segment(text))
  .filter(s => s.isWordLike)
  .map(s => s.segment);

console.log(wordsEn);
// ["你好世界"]

console.log(wordsZh);
// ["你好", "世界"]

The English locale does not recognize Chinese word boundaries and treats the entire string as one word. The Chinese locale applies Chinese word boundary rules and correctly identifies two words.

Always use the appropriate locale for the language of the text being segmented.

Creating reusable segmenters for performance

Creating a segmenter is not expensive, but you can reuse segmenters across multiple strings for better performance.

const enSegmenter = new Intl.Segmenter("en", { granularity: "word" });
const zhSegmenter = new Intl.Segmenter("zh", { granularity: "word" });
const jaSegmenter = new Intl.Segmenter("ja", { granularity: "word" });

function getWords(text, locale) {
  const segmenter = locale === "zh" ? zhSegmenter
    : locale === "ja" ? jaSegmenter
    : enSegmenter;

  return Array.from(segmenter.segment(text))
    .filter(s => s.isWordLike)
    .map(s => s.segment);
}

This approach creates segmenters once and reuses them for all calls to getWords(). The segmenter caches locale data, so reusing instances avoids repeated initialization.

Practical example: building a word frequency analyzer

Combine word segmentation with counting to analyze word frequency in text.

function getWordFrequency(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  const words = Array.from(segments)
    .filter(s => s.isWordLike)
    .map(s => s.segment.toLowerCase());

  const frequency = {};
  for (const word of words) {
    frequency[word] = (frequency[word] || 0) + 1;
  }

  return frequency;
}

const text = "Hello world! Hello everyone in this world.";
const frequency = getWordFrequency(text, "en");
console.log(frequency);
// { hello: 2, world: 2, everyone: 1, in: 1, this: 1 }

This function splits text into words, normalizes to lowercase, and counts occurrences. It works for any language.

const textZh = "你好世界!你好大家!";
const frequencyZh = getWordFrequency(textZh, "zh");
console.log(frequencyZh);
// { "你好": 2, "世界": 1, "大家": 1 }

The same logic handles Chinese text without modification.

Checking browser support

The Intl.Segmenter API reached Baseline status in April 2024. It works in current versions of Chrome, Firefox, Safari, and Edge. Older browsers do not support it.

Check for support before using the API.

if (typeof Intl.Segmenter !== "undefined") {
  const segmenter = new Intl.Segmenter("en", { granularity: "word" });
  // Use segmenter
} else {
  // Fallback for older browsers
}

For production applications targeting older browsers, provide a fallback implementation. A simple fallback uses split() for English text and returns the entire string for other languages.

function getWords(text, locale) {
  if (typeof Intl.Segmenter !== "undefined") {
    const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
    return Array.from(segmenter.segment(text))
      .filter(s => s.isWordLike)
      .map(s => s.segment);
  }

  // Fallback: only works for space-separated languages
  return text.split(/\s+/).filter(word => word.length > 0);
}

This ensures your code runs in older browsers, though with reduced functionality for non-space-separated languages.

Common mistakes to avoid

Do not split on spaces or regex patterns for multilingual text. These approaches only work for a subset of languages and fail for Chinese, Japanese, Thai, and other languages without spaces.

Do not forget to filter by isWordLike when extracting words. Without this filter, you get spaces, punctuation, and other non-word segments in your results.

Do not use the wrong locale when segmenting text. The locale determines which word boundary rules apply. Using an English locale for Chinese text produces incorrect results.

Do not assume all languages define words the same way. Word boundaries vary by writing system and linguistic convention. Use locale-aware segmentation to handle these differences.

Do not count words using split(" ").length for international text. This only works for space-separated languages and produces wrong counts for others.

When to use word segmentation

Use word segmentation when you need to:

  • Count words in user-generated content across multiple languages
  • Implement search and highlight features that work with any writing system
  • Build text analysis tools that process international text
  • Create word-based navigation or editing features in text editors
  • Extract keywords or terms from multilingual documents
  • Validate word count limits in forms that accept any language

Do not use word segmentation when you only need character counts. Use grapheme segmentation for character-level operations.

Do not use word segmentation for sentence splitting. Use sentence granularity for that purpose.

How word segmentation fits into internationalization

The Intl.Segmenter API is part of the ECMAScript Internationalization API. Other APIs in this family handle different aspects of internationalization:

  • Intl.DateTimeFormat: Format dates and times according to locale
  • Intl.NumberFormat: Format numbers, currencies, and units according to locale
  • Intl.Collator: Sort and compare strings according to locale
  • Intl.PluralRules: Determine plural forms for numbers in different languages

Together, these APIs provide the tools needed to build applications that work correctly for users worldwide. Use Intl.Segmenter with word granularity when you need to identify word boundaries, and use the other Intl APIs for formatting and comparison.