Compare strings ignoring accent marks

Learn how to compare strings while ignoring diacritical marks using JavaScript normalization and Intl.Collator

Introduction

When building applications that work with multiple languages, you often need to compare strings that contain accent marks. A user searching for "cafe" should find results for "café". A username check for "Jose" should match "José". Standard string comparison treats these as different strings, but your application logic needs to treat them as equal.

JavaScript provides two approaches to solve this problem. You can normalize strings and remove accent marks, or use the built-in collation API to compare strings with specific sensitivity rules.

What are accent marks

Accent marks are symbols placed above, below, or through letters to modify their pronunciation or meaning. These marks are called diacritics. Common examples include the acute accent in "é", the tilde in "ñ", and the umlaut in "ü".

In Unicode, these characters can be represented in two ways. A single code point can represent the complete character, or multiple code points can combine a base letter with a separate accent mark. The letter "é" can be stored as U+00E9 or as "e" (U+0065) plus a combining acute accent (U+0301).

When to ignore accent marks in comparisons

Search functionality is the most common use case for accent-insensitive comparison. Users typing queries without accent marks expect to find content that contains accented characters. A search for "Muller" should find "Müller".

User input validation requires this capability when checking if usernames, email addresses, or other identifiers already exist. You want to prevent duplicate accounts for "maria" and "maría".

Case-insensitive comparisons often need to ignore accents at the same time. When checking if two strings match regardless of capitalization, you typically want to ignore accent differences as well.

Remove accent marks using normalization

The first approach converts strings to a normalized form where base letters and accent marks are separated, then removes the accent marks.

Unicode normalization converts strings into a standard form. The NFD (Canonical Decomposition) form separates combined characters into their base letters and combining marks. The string "café" becomes "cafe" followed by a combining acute accent character.

After normalization, you can remove the combining marks using a regular expression. The Unicode range U+0300 to U+036F contains combining diacritical marks.

function removeAccents(str) {
  return str.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}

const text1 = 'café';
const text2 = 'cafe';

const normalized1 = removeAccents(text1);
const normalized2 = removeAccents(text2);

console.log(normalized1 === normalized2); // true
console.log(normalized1); // "cafe"

This method gives you strings without accent marks that you can compare using standard equality operators.

You can combine this with lowercase conversion for case-insensitive, accent-insensitive comparisons.

function normalizeForComparison(str) {
  return str.normalize('NFD').replace(/[\u0300-\u036f]/g, '').toLowerCase();
}

const search = 'muller';
const name = 'Müller';

console.log(normalizeForComparison(search) === normalizeForComparison(name)); // true

This approach works well when you need to store or index the normalized version of strings for efficient searching.

Compare strings using Intl.Collator

The second approach uses the Intl.Collator API, which provides locale-aware string comparison with configurable sensitivity levels.

The Intl.Collator object compares strings according to language-specific rules. The sensitivity option controls which differences matter when comparing strings.

The "base" sensitivity level ignores both accent marks and case differences. Strings that differ only in accents or capitalization are considered equal.

const collator = new Intl.Collator('en', { sensitivity: 'base' });

console.log(collator.compare('café', 'cafe')); // 0 (equal)
console.log(collator.compare('Café', 'cafe')); // 0 (equal)
console.log(collator.compare('café', 'caff')); // -1 (first comes before second)

The compare method returns 0 when strings are equal, a negative number when the first string comes before the second, and a positive number when the first string comes after the second.

You can use this for equality checks or for sorting arrays.

const collator = new Intl.Collator('en', { sensitivity: 'base' });

function areEqualIgnoringAccents(str1, str2) {
  return collator.compare(str1, str2) === 0;
}

console.log(areEqualIgnoringAccents('José', 'Jose')); // true
console.log(areEqualIgnoringAccents('naïve', 'naive')); // true

For sorting, you can pass the compare method directly to Array.sort.

const names = ['Müller', 'Martinez', 'Muller', 'Márquez'];
const collator = new Intl.Collator('en', { sensitivity: 'base' });

names.sort(collator.compare);
console.log(names); // Groups variants together

The Intl.Collator API provides other sensitivity levels for different use cases.

The "accent" level ignores case but respects accent differences. "Café" equals "café" but not "cafe".

const accentCollator = new Intl.Collator('en', { sensitivity: 'accent' });
console.log(accentCollator.compare('Café', 'café')); // 0 (equal)
console.log(accentCollator.compare('café', 'cafe')); // 1 (not equal)

The "case" level ignores accents but respects case differences. "café" equals "cafe" but not "Café".

const caseCollator = new Intl.Collator('en', { sensitivity: 'case' });
console.log(caseCollator.compare('café', 'cafe')); // 0 (equal)
console.log(caseCollator.compare('café', 'Café')); // -1 (not equal)

The "variant" level respects all differences. This is the default behavior.

const variantCollator = new Intl.Collator('en', { sensitivity: 'variant' });
console.log(variantCollator.compare('café', 'cafe')); // 1 (not equal)

Choose between normalization and collation

Both methods produce correct results for accent-insensitive comparison, but they have different characteristics.

The normalization method creates new strings without accent marks. Use this approach when you need to store or index the normalized versions. Search engines and databases often store normalized text for efficient lookup.

The Intl.Collator method compares strings without modifying them. Use this approach when you need to compare strings directly, such as checking for duplicates or sorting lists. The collator respects language-specific sorting rules that simple string comparison cannot handle.

Performance considerations vary by use case. Creating a collator object once and reusing it is efficient for multiple comparisons. Normalizing strings is efficient when you normalize once and compare many times.

The normalization method removes accent information permanently. The collation method preserves the original strings while comparing them according to rules you specify.

A common use case is filtering an array of items based on user input, ignoring accent differences.

const products = [
  { name: 'Café Latte', price: 4.50 },
  { name: 'Crème Brûlée', price: 6.00 },
  { name: 'Croissant', price: 3.00 },
  { name: 'Café Mocha', price: 5.00 }
];

function searchProducts(query) {
  const collator = new Intl.Collator('en', { sensitivity: 'base' });

  return products.filter(product => {
    return collator.compare(product.name.slice(0, query.length), query) === 0;
  });
}

console.log(searchProducts('cafe'));
// Returns both Café Latte and Café Mocha

For substring matching, the normalization approach works better.

function removeAccents(str) {
  return str.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}

function searchProducts(query) {
  const normalizedQuery = removeAccents(query.toLowerCase());

  return products.filter(product => {
    const normalizedName = removeAccents(product.name.toLowerCase());
    return normalizedName.includes(normalizedQuery);
  });
}

console.log(searchProducts('creme'));
// Returns Crème Brûlée

This approach checks if the normalized product name contains the normalized search query as a substring.

Handle text input matching

When validating user input against existing data, you need accent-insensitive comparison to prevent confusion and duplicates.

const existingUsernames = ['José', 'María', 'François'];

function isUsernameTaken(username) {
  const collator = new Intl.Collator('en', { sensitivity: 'base' });

  return existingUsernames.some(existing =>
    collator.compare(existing, username) === 0
  );
}

console.log(isUsernameTaken('jose')); // true
console.log(isUsernameTaken('Maria')); // true
console.log(isUsernameTaken('francois')); // true
console.log(isUsernameTaken('pierre')); // false

This prevents users from creating accounts with names that differ only in accents or capitalization from existing accounts.

Browser and environment support

The String.prototype.normalize method is supported in all modern browsers and Node.js environments. Internet Explorer does not support this method.

The Intl.Collator API is supported in all modern browsers and Node.js versions. Internet Explorer 11 includes partial support.

Both approaches work reliably in current JavaScript environments. If you need to support older browsers, you need polyfills or alternative implementations.

Limitations of accent removal

Some languages use diacritics to create distinct letters, not just accent variations. In Turkish, "i" and "ı" are different letters. In German, "ö" is a distinct vowel, not an accented "o".

Removing accents changes the meaning in these cases. Consider whether accent-insensitive comparison is appropriate for your use case and target languages.

The collation approach handles these cases better because it follows locale-specific rules. Specifying the correct locale in the Intl.Collator constructor ensures culturally appropriate comparisons.

const turkishCollator = new Intl.Collator('tr', { sensitivity: 'base' });
const germanCollator = new Intl.Collator('de', { sensitivity: 'base' });

Always consider the languages your application supports when choosing a comparison strategy.