---
title: "How do I split text into words in JavaScript?"
subtitle: "Use Intl.Segmenter to extract words from text in any language, including languages without spaces between words."
---

## Introduction

When you need to extract words from text, the common approach is to split on spaces using `split(" ")`. This works for English, but fails completely for languages that do not use spaces between words. Chinese, Japanese, Thai, and other languages write text continuously without word separators, yet users perceive distinct words in that text.

The `Intl.Segmenter` API solves this problem. It identifies word boundaries according to Unicode standards and linguistic rules for each language. You can extract words from text regardless of whether the language uses spaces, and the segmenter handles the complexity of determining where words begin and end.

This article explains why basic string splitting fails for international text, how word boundaries work across different writing systems, and how to use `Intl.Segmenter` to split text into words correctly for all languages.

## Why splitting on spaces fails

The `split()` method breaks a string at each occurrence of a separator. For English text, splitting on spaces extracts words.

```javascript
const text = "Hello world";
const words = text.split(" ");
console.log(words);
// ["Hello", "world"]
```

This approach assumes words are separated by spaces. Many languages do not follow this pattern.

Chinese text does not include spaces between words.

```javascript
const text = "你好世界";
const words = text.split(" ");
console.log(words);
// ["你好世界"]
```

The user sees two distinct words, but `split()` returns the entire string as a single element because there are no spaces to split on.

Japanese text mixes multiple scripts and does not use spaces between words.

```javascript
const text = "今日は良い天気です";
const words = text.split(" ");
console.log(words);
// ["今日は良い天気です"]
```

This sentence contains multiple words, but splitting on spaces produces one element.

Thai text also writes words continuously without spaces.

```javascript
const text = "สวัสดีครับ";
const words = text.split(" ");
console.log(words);
// ["สวัสดีครับ"]
```

The text contains two words, but `split()` returns one element.

For these languages, you need a different approach to identify word boundaries.

## Why regular expressions fail for word boundaries

Regular expression word boundaries use the `\b` pattern to match positions between word and non-word characters. This works for English.

```javascript
const text = "Hello world!";
const words = text.match(/\b\w+\b/g);
console.log(words);
// ["Hello", "world"]
```

This pattern fails for languages without spaces because the regex engine does not recognize word boundaries in scripts like Chinese, Japanese, or Thai.

```javascript
const text = "你好世界";
const words = text.match(/\b\w+\b/g);
console.log(words);
// ["你好世界"]
```

The regex treats the entire string as one word because it does not understand Chinese word boundaries.

Even for English, regex patterns can produce incorrect results with punctuation, contractions, or special characters. Regular expressions are not designed to handle linguistic word segmentation across all writing systems.

## What word boundaries are across languages

A word boundary is a position in text where one word ends and another begins. Different writing systems use different conventions for word boundaries.

Space-separated languages like English, Spanish, French, and German use spaces to mark word boundaries. The word "hello" is separated from "world" by a space.

Scriptio continua languages like Chinese, Japanese, and Thai do not use spaces between words. Word boundaries exist based on semantic and morphological rules, but these boundaries are not marked visually in the text. A Chinese reader recognizes where one word ends and another begins through familiarity with the language, not through visual separators.

Some languages use mixed conventions. Japanese combines kanji, hiragana, and katakana characters, and word boundaries occur at transitions between character types or based on grammatical structure.

The Unicode Standard defines word boundary rules in UAX 29. These rules specify how to identify word boundaries for all scripts. The rules consider character properties, script types, and linguistic patterns to determine where words begin and end.

## Using Intl.Segmenter to split text into words

The `Intl.Segmenter` constructor creates a segmenter object that splits text according to Unicode rules. You specify a locale and a granularity.

```javascript
const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world!";
const segments = segmenter.segment(text);
```

The first argument is the locale identifier. The second argument is an options object where `granularity: "word"` tells the segmenter to split at word boundaries.

The `segment()` method returns an iterable object containing segments. You can iterate over segments using `for...of`.

```javascript
const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world!";

for (const segment of segmenter.segment(text)) {
  console.log(segment);
}
// { segment: "Hello", index: 0, input: "Hello world!", isWordLike: true }
// { segment: " ", index: 5, input: "Hello world!", isWordLike: false }
// { segment: "world", index: 6, input: "Hello world!", isWordLike: true }
// { segment: "!", index: 11, input: "Hello world!", isWordLike: false }
```

Each segment object contains properties:

- `segment`: the text of this segment
- `index`: the position in the original string where this segment starts
- `input`: the original string being segmented
- `isWordLike`: whether this segment is a word or non-word content

## Understanding the isWordLike property

When you segment text by words, the segmenter returns both word segments and non-word segments. Words include letters, numbers, and ideographic characters. Non-word segments include spaces, punctuation, and other separators.

The `isWordLike` property indicates whether a segment is a word. This property is `true` for segments that contain word characters, and `false` for segments that contain only spaces, punctuation, or other non-word characters.

```javascript
const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello, world!";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "Hello" true
// "," false
// " " false
// "world" true
// "!" false
```

Use the `isWordLike` property to filter word segments from punctuation and whitespace. This gives you just the words without separators.

```javascript
const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello, world!";
const segments = segmenter.segment(text);
const words = Array.from(segments)
  .filter(s => s.isWordLike)
  .map(s => s.segment);

console.log(words);
// ["Hello", "world"]
```

This pattern works for any language, including those without spaces.

## Extracting words from text without spaces

The segmenter correctly identifies word boundaries in languages that do not use spaces. For Chinese text, the segmenter splits at word boundaries based on Unicode rules and linguistic patterns.

```javascript
const segmenter = new Intl.Segmenter("zh", { granularity: "word" });
const text = "你好世界";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "你好" true
// "世界" true
```

The segmenter identifies two words in this text. There are no spaces, but the segmenter understands Chinese word boundaries and splits the text appropriately.

For Japanese text, the segmenter handles the complexity of mixed scripts and identifies word boundaries.

```javascript
const segmenter = new Intl.Segmenter("ja", { granularity: "word" });
const text = "今日は良い天気です";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "今日" true
// "は" true
// "良い" true
// "天気" true
// "です" true
```

The segmenter splits this sentence into five word segments. It recognizes that particles like "は" are separate words and that compound words like "天気" form single units.

For Thai text, the segmenter identifies word boundaries without spaces.

```javascript
const segmenter = new Intl.Segmenter("th", { granularity: "word" });
const text = "สวัสดีครับ";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "สวัสดี" true
// "ครับ" true
```

The segmenter correctly identifies two words in this greeting.

## Building a word extraction function

Create a function that extracts words from text in any language.

```javascript
function getWords(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  return Array.from(segments)
    .filter(s => s.isWordLike)
    .map(s => s.segment);
}
```

This function works for space-separated and non-space-separated languages.

```javascript
getWords("Hello, world!", "en");
// ["Hello", "world"]

getWords("你好世界", "zh");
// ["你好", "世界"]

getWords("今日は良い天気です", "ja");
// ["今日", "は", "良い", "天気", "です"]

getWords("Bonjour le monde!", "fr");
// ["Bonjour", "le", "monde"]

getWords("สวัสดีครับ", "th");
// ["สวัสดี", "ครับ"]
```

The function returns an array of words regardless of the language or writing system.

## Counting words accurately across languages

Build a word counter that works for all languages by counting word-like segments.

```javascript
function countWords(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  return Array.from(segments).filter(s => s.isWordLike).length;
}
```

This function produces accurate word counts for text in any language.

```javascript
countWords("Hello world", "en");
// 2

countWords("你好世界", "zh");
// 2

countWords("今日は良い天気です", "ja");
// 5

countWords("Bonjour le monde", "fr");
// 3

countWords("สวัสดีครับ", "th");
// 2
```

The counts match user perception of word boundaries in each language.

## Finding which word contains a position

The `containing()` method finds the segment that includes a specific index in the string. This is useful for determining which word the cursor is in or which word was clicked.

```javascript
const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "Hello world";
const segments = segmenter.segment(text);

const segment = segments.containing(7);
console.log(segment);
// { segment: "world", index: 6, input: "Hello world", isWordLike: true }
```

Index 7 falls within the word "world", which starts at index 6. The method returns the segment object for that word.

If the index falls within whitespace or punctuation, the method returns that segment with `isWordLike: false`.

```javascript
const segment = segments.containing(5);
console.log(segment);
// { segment: " ", index: 5, input: "Hello world", isWordLike: false }
```

Use this for text editor features like double-click word selection, contextual menus based on cursor position, or highlighting the current word.

## Handling punctuation and contractions

The segmenter treats punctuation as separate segments. Contractions in English are typically split into multiple segments.

```javascript
const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const text = "I can't do it.";

for (const { segment, isWordLike } of segmenter.segment(text)) {
  console.log(segment, isWordLike);
}
// "I" true
// " " false
// "can" true
// "'" false
// "t" true
// " " false
// "do" true
// " " false
// "it" true
// "." false
```

The contraction "can't" is split into "can", "'", and "t". If you need to keep contractions as single words, you need additional logic to merge segments based on apostrophes.

For most use cases, counting word-like segments gives you meaningful word counts even when contractions are split.

## How locale affects word segmentation

The locale you pass to the segmenter affects how word boundaries are determined. Different locales may have different rules for the same text.

For languages with well-defined word boundary rules, the locale ensures the correct rules are applied.

```javascript
const segmenterEn = new Intl.Segmenter("en", { granularity: "word" });
const segmenterZh = new Intl.Segmenter("zh", { granularity: "word" });

const text = "你好世界";

const wordsEn = Array.from(segmenterEn.segment(text))
  .filter(s => s.isWordLike)
  .map(s => s.segment);

const wordsZh = Array.from(segmenterZh.segment(text))
  .filter(s => s.isWordLike)
  .map(s => s.segment);

console.log(wordsEn);
// ["你好世界"]

console.log(wordsZh);
// ["你好", "世界"]
```

The English locale does not recognize Chinese word boundaries and treats the entire string as one word. The Chinese locale applies Chinese word boundary rules and correctly identifies two words.

Always use the appropriate locale for the language of the text being segmented.

## Creating reusable segmenters for performance

Creating a segmenter is not expensive, but you can reuse segmenters across multiple strings for better performance.

```javascript
const enSegmenter = new Intl.Segmenter("en", { granularity: "word" });
const zhSegmenter = new Intl.Segmenter("zh", { granularity: "word" });
const jaSegmenter = new Intl.Segmenter("ja", { granularity: "word" });

function getWords(text, locale) {
  const segmenter = locale === "zh" ? zhSegmenter
    : locale === "ja" ? jaSegmenter
    : enSegmenter;

  return Array.from(segmenter.segment(text))
    .filter(s => s.isWordLike)
    .map(s => s.segment);
}
```

This approach creates segmenters once and reuses them for all calls to `getWords()`. The segmenter caches locale data, so reusing instances avoids repeated initialization.

## Practical example: building a word frequency analyzer

Combine word segmentation with counting to analyze word frequency in text.

```javascript
function getWordFrequency(text, locale) {
  const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
  const segments = segmenter.segment(text);
  const words = Array.from(segments)
    .filter(s => s.isWordLike)
    .map(s => s.segment.toLowerCase());

  const frequency = {};
  for (const word of words) {
    frequency[word] = (frequency[word] || 0) + 1;
  }

  return frequency;
}

const text = "Hello world! Hello everyone in this world.";
const frequency = getWordFrequency(text, "en");
console.log(frequency);
// { hello: 2, world: 2, everyone: 1, in: 1, this: 1 }
```

This function splits text into words, normalizes to lowercase, and counts occurrences. It works for any language.

```javascript
const textZh = "你好世界！你好大家！";
const frequencyZh = getWordFrequency(textZh, "zh");
console.log(frequencyZh);
// { "你好": 2, "世界": 1, "大家": 1 }
```

The same logic handles Chinese text without modification.

## Checking browser support

The `Intl.Segmenter` API reached Baseline status in April 2024. It works in current versions of Chrome, Firefox, Safari, and Edge. Older browsers do not support it.

Check for support before using the API.

```javascript
if (typeof Intl.Segmenter !== "undefined") {
  const segmenter = new Intl.Segmenter("en", { granularity: "word" });
  // Use segmenter
} else {
  // Fallback for older browsers
}
```

For production applications targeting older browsers, provide a fallback implementation. A simple fallback uses `split()` for English text and returns the entire string for other languages.

```javascript
function getWords(text, locale) {
  if (typeof Intl.Segmenter !== "undefined") {
    const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
    return Array.from(segmenter.segment(text))
      .filter(s => s.isWordLike)
      .map(s => s.segment);
  }

  // Fallback: only works for space-separated languages
  return text.split(/\s+/).filter(word => word.length > 0);
}
```

This ensures your code runs in older browsers, though with reduced functionality for non-space-separated languages.

## Common mistakes to avoid

Do not split on spaces or regex patterns for multilingual text. These approaches only work for a subset of languages and fail for Chinese, Japanese, Thai, and other languages without spaces.

Do not forget to filter by `isWordLike` when extracting words. Without this filter, you get spaces, punctuation, and other non-word segments in your results.

Do not use the wrong locale when segmenting text. The locale determines which word boundary rules apply. Using an English locale for Chinese text produces incorrect results.

Do not assume all languages define words the same way. Word boundaries vary by writing system and linguistic convention. Use locale-aware segmentation to handle these differences.

Do not count words using `split(" ").length` for international text. This only works for space-separated languages and produces wrong counts for others.

## When to use word segmentation

Use word segmentation when you need to:

- Count words in user-generated content across multiple languages
- Implement search and highlight features that work with any writing system
- Build text analysis tools that process international text
- Create word-based navigation or editing features in text editors
- Extract keywords or terms from multilingual documents
- Validate word count limits in forms that accept any language

Do not use word segmentation when you only need character counts. Use grapheme segmentation for character-level operations.

Do not use word segmentation for sentence splitting. Use sentence granularity for that purpose.

## How word segmentation fits into internationalization

The `Intl.Segmenter` API is part of the ECMAScript Internationalization API. Other APIs in this family handle different aspects of internationalization:

- `Intl.DateTimeFormat`: Format dates and times according to locale
- `Intl.NumberFormat`: Format numbers, currencies, and units according to locale
- `Intl.Collator`: Sort and compare strings according to locale
- `Intl.PluralRules`: Determine plural forms for numbers in different languages

Together, these APIs provide the tools needed to build applications that work correctly for users worldwide. Use `Intl.Segmenter` with word granularity when you need to identify word boundaries, and use the other `Intl` APIs for formatting and comparison.