How do I split text into sentences?

Use Intl.Segmenter to split text into sentences with locale-aware boundary detection that handles punctuation, abbreviations, and language-specific rules.

Introduction

When you process text for translation, analysis, or display, you often need to split it into individual sentences. A naive approach using regular expressions fails because sentence boundaries are more complex than periods followed by spaces. Sentences can end with question marks, exclamation points, or ellipses. Periods appear in abbreviations like "Dr." or "Inc." without ending sentences. Different languages use different punctuation marks as sentence terminators.

The Intl.Segmenter API solves this problem by providing locale-aware sentence boundary detection. It understands the rules for identifying sentence boundaries in different languages and handles edge cases like abbreviations, numbers, and complex punctuation automatically.

The problem with splitting on periods

You can try to split text into sentences by splitting on periods followed by spaces.

const text = "Hello world. How are you? I am fine.";
const sentences = text.split(". ");
console.log(sentences);
// ["Hello world", "How are you? I am fine."]

This approach has multiple problems. First, it does not handle question marks or exclamation points. Second, it breaks on abbreviations that contain periods. Third, it removes the period from each sentence except the last one. Fourth, it does not work when there are multiple spaces after periods.

const text = "Dr. Smith works at Acme Inc. He starts at 9 a.m.";
const sentences = text.split(". ");
console.log(sentences);
// ["Dr", "Smith works at Acme Inc", "He starts at 9 a.m."]

The text splits incorrectly at "Dr." and "Inc." because these abbreviations contain periods. You need a smarter approach that understands sentence boundary rules.

Using a more complex regular expression

You can improve the regex to handle more cases.

const text = "Hello world. How are you? I am fine!";
const sentences = text.split(/[.?!]\s+/);
console.log(sentences);
// ["Hello world", "How are you", "I am fine", ""]

This splits on periods, question marks, and exclamation points followed by whitespace. It handles more cases but still fails with abbreviations and creates empty strings. It also removes the punctuation from each sentence.

const text = "Dr. Smith works at Acme Inc. He starts at 9 a.m.";
const sentences = text.split(/[.?!]\s+/);
console.log(sentences);
// ["Dr", "Smith works at Acme Inc", "He starts at 9 a", "m", ""]

The regex approach cannot reliably distinguish between periods that end sentences and periods that appear in abbreviations. Building a comprehensive regex that handles all edge cases becomes impractical. You need a solution that understands linguistic rules.

Using Intl.Segmenter for sentence splitting

The Intl.Segmenter constructor creates a segmenter that splits text based on locale-specific rules. You specify a locale and set the granularity option to "sentence".

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "Hello world. How are you? I am fine!";
const segments = segmenter.segment(text);

for (const segment of segments) {
  console.log(segment.segment);
}
// "Hello world. "
// "How are you? "
// "I am fine!"

The segment() method returns an iterable that yields segment objects. Each segment object has a segment property containing the text of that segment. The segmenter preserves the punctuation and whitespace at the end of each sentence.

You can convert the segments into an array using Array.from().

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "Hello world. How are you? I am fine!";
const segments = segmenter.segment(text);
const sentences = Array.from(segments, s => s.segment);
console.log(sentences);
// ["Hello world. ", "How are you? ", "I am fine!"]

This creates an array where each element is a sentence with its original punctuation and spacing.

How Intl.Segmenter handles abbreviations

The segmenter understands common abbreviation patterns and does not split on periods that appear within abbreviations.

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "Dr. Smith works at Acme Inc. He starts at 9 a.m.";
const segments = segmenter.segment(text);
const sentences = Array.from(segments, s => s.segment);
console.log(sentences);
// ["Dr. Smith works at Acme Inc. ", "He starts at 9 a.m."]

The text splits correctly into two sentences. The periods in "Dr.", "Inc.", and "a.m." do not trigger sentence breaks because the segmenter recognizes these as abbreviations. This automatic handling of edge cases is why Intl.Segmenter is superior to regex approaches.

Trimming whitespace from sentences

The segmenter includes trailing whitespace in each sentence. You can trim this whitespace if needed.

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "Hello world. How are you? I am fine!";
const segments = segmenter.segment(text);
const sentences = Array.from(segments, s => s.segment.trim());
console.log(sentences);
// ["Hello world.", "How are you?", "I am fine!"]

The trim() method removes leading and trailing whitespace from each sentence. This is useful when you need clean sentence boundaries without extra spacing.

Getting segment metadata

Each segment object includes metadata about the segment position in the original text.

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "Hello world. How are you?";
const segments = segmenter.segment(text);

for (const segment of segments) {
  console.log({
    text: segment.segment,
    index: segment.index,
    input: segment.input
  });
}
// { text: "Hello world. ", index: 0, input: "Hello world. How are you?" }
// { text: "How are you?", index: 13, input: "Hello world. How are you?" }

The index property indicates where the segment starts in the original text. The input property contains the full original text. This metadata is useful when you need to track sentence positions or reconstruct the original text.

Splitting sentences in different languages

Different languages have different sentence boundary rules. The segmenter adapts its behavior based on the specified locale.

In Japanese, sentences can end with a full-width period called a kuten.

const segmenter = new Intl.Segmenter("ja", { granularity: "sentence" });
const text = "私は猫です。名前はまだない。";
const segments = segmenter.segment(text);
const sentences = Array.from(segments, s => s.segment);
console.log(sentences);
// ["私は猫です。", "名前はまだない。"]

The text splits correctly at the Japanese sentence terminators. A segmenter configured for English would not recognize these boundaries correctly.

In Hindi, sentences can end with a vertical bar called a purna viram.

const segmenter = new Intl.Segmenter("hi", { granularity: "sentence" });
const text = "यह एक वाक्य है। यह दूसरा वाक्य है।";
const segments = segmenter.segment(text);
const sentences = Array.from(segments, s => s.segment);
console.log(sentences);
// ["यह एक वाक्य है। ", "यह दूसरा वाक्य है।"]

The segmenter recognizes the Devanagari full stop as a sentence boundary. This locale-aware behavior is critical for internationalized text processing.

Using the correct locale for multilingual text

When you process text that contains multiple languages, choose the locale that matches the primary language of the text. The segmenter uses the specified locale to determine which boundary rules to apply.

const englishText = "Hello world. How are you?";
const japaneseText = "私は猫です。名前はまだない。";

const englishSegmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const japaneseSegmenter = new Intl.Segmenter("ja", { granularity: "sentence" });

const englishSentences = Array.from(
  englishSegmenter.segment(englishText),
  s => s.segment
);

const japaneseSentences = Array.from(
  japaneseSegmenter.segment(japaneseText),
  s => s.segment
);

console.log(englishSentences);
// ["Hello world. ", "How are you?"]

console.log(japaneseSentences);
// ["私は猫です。", "名前はまだない。"]

Creating separate segmenters for each language ensures correct boundary detection. If you process text where the language is unknown, you can use a generic locale like "en" as a fallback, though this reduces accuracy for non-English text.

Handling text with no sentence boundaries

When text contains no sentence terminators, the segmenter returns the entire text as a single segment.

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "Hello world";
const segments = segmenter.segment(text);
const sentences = Array.from(segments, s => s.segment);
console.log(sentences);
// ["Hello world"]

This behavior is correct because the text does not contain any sentence boundaries. The segmenter does not artificially split text that forms a single sentence.

Handling empty strings

The segmenter handles empty strings by returning an empty iterator.

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const text = "";
const segments = segmenter.segment(text);
const sentences = Array.from(segments, s => s.segment);
console.log(sentences);
// []

This produces an empty array, which is the expected result for empty input.

Reusing segmenters for better performance

Creating a segmenter has some overhead. When you need to segment multiple texts with the same locale and options, create the segmenter once and reuse it.

const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });

const texts = [
  "First text. With two sentences.",
  "Second text. With three sentences. And more.",
  "Third text."
];

texts.forEach(text => {
  const sentences = Array.from(segmenter.segment(text), s => s.segment);
  console.log(sentences);
});
// ["First text. ", "With two sentences."]
// ["Second text. ", "With three sentences. ", "And more."]
// ["Third text."]

Reusing the segmenter is more efficient than creating a new one for each text.

Building a sentence counting function

You can use the segmenter to count sentences in text.

function countSentences(text, locale = "en") {
  const segmenter = new Intl.Segmenter(locale, { granularity: "sentence" });
  const segments = segmenter.segment(text);
  return Array.from(segments).length;
}

console.log(countSentences("Hello world. How are you?"));
// 2

console.log(countSentences("Dr. Smith works at Acme Inc. He starts at 9 a.m."));
// 2

console.log(countSentences("Single sentence"));
// 1

console.log(countSentences("私は猫です。名前はまだない。", "ja"));
// 2

This function creates a segmenter, splits the text, and returns the number of segments. It handles abbreviations and language-specific boundaries correctly.

Building a sentence extraction function

You can create a function that extracts a specific sentence from text by index.

function getSentence(text, index, locale = "en") {
  const segmenter = new Intl.Segmenter(locale, { granularity: "sentence" });
  const segments = Array.from(segmenter.segment(text), s => s.segment);
  return segments[index] || null;
}

const text = "First sentence. Second sentence. Third sentence.";

console.log(getSentence(text, 0));
// "First sentence. "

console.log(getSentence(text, 1));
// "Second sentence. "

console.log(getSentence(text, 2));
// "Third sentence."

console.log(getSentence(text, 3));
// null

This function returns the sentence at the specified index, or null if the index is out of bounds.

Checking browser and runtime support

The Intl.Segmenter API is available in modern browsers and Node.js. It became part of the web platform baseline in April 2024 and is supported in all major browser engines.

You can check if the API is available before using it.

if (typeof Intl.Segmenter !== "undefined") {
  const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
  const text = "Hello world. How are you?";
  const sentences = Array.from(segmenter.segment(text), s => s.segment);
  console.log(sentences);
} else {
  console.log("Intl.Segmenter is not supported");
}

For environments without support, you need to provide a fallback. A simple fallback uses a basic regex split, though this loses the accuracy of locale-aware segmentation.

function splitSentences(text, locale = "en") {
  if (typeof Intl.Segmenter !== "undefined") {
    const segmenter = new Intl.Segmenter(locale, { granularity: "sentence" });
    return Array.from(segmenter.segment(text), s => s.segment);
  }

  // Fallback for older environments
  return text.split(/[.!?]\s+/).filter(s => s.length > 0);
}

console.log(splitSentences("Hello world. How are you?"));
// ["Hello world. ", "How are you?"]

This function uses Intl.Segmenter when available and falls back to regex splitting in older environments. The fallback loses features like abbreviation handling and language-specific boundary detection, but provides basic functionality.