How to do transliteration in PostgreSQL

When working on one small project (but interesting, nevertheless), we needed to solve an interesting request. The customer was a Russian company and there was a requirement for fast database search by city name, so that the search returned the correct results when entering the name in both cyrillic or latin alphabets. E.g. search for ‘Mосква’ must return the same results as search for ‘Moskva’. In other words, we have to search in both, the cyrillic names and the names transliterated from cyrillic.

The easiest solution would be to add a database column with transliterated latin names of cities, and when querying, search in both columns - cyrillic and latin. A more optimal solution would be to query only one column - the transliterated latin name. The client’s input would be first programmatically transliterated into latin and then the DB would be queried. But how to do the transliteration? PostgreSQL does not provide native transliteration, but nothing prevents us from writing our own transliteration function. The rules of transliteration from cyrillic to latin are quite simple, it is basically just string substitution:

CREATE OR REPLACE FUNCTION cyrillic_transliterate(p_string text)
RETURNS character varying AS $BODY$
SELECT replace( replace( replace( replace(
replace( replace( replace( replace( translate(lower($1),
'абвгдеёзийклмнопрстуфхцэы', 'abvgdeezijklmnoprstufхcey'), 'ж', 'zh'),
'ч', 'ch'), 'ш', 'sh'), 'щ', 'shh'), 'ъ', ''), 'ю', 'yu'), 'я', 'ya'), 'ь', '');
$BODY$
LANGUAGE SQL IMMUTABLE COST 100;
ALTER FUNCTION cyrillic_transliterate(p_string text) OWNER TO blog;

The usage of the function is simple. In our case (since it is a static database codelist), we have created the following materialized view:

CREATE MATERIALIZED VIEW MV_CITYREGION AS
  SELECT
    city.id                                  AS cityId,
    city.name                                AS cityName,
    city.population                          AS population,
    region.id                                AS regionId,
    region.name                              AS regionName,
    lower(cyrillic_transliterate(city.name)) AS cityTransliterated,
    city.name || ' (' || region.name || ')'  AS cityRegion
  FROM city
    INNER JOIN region ON city.regionid = region.id
  ORDER BY city.name, region.name, city.id;

Now we can query the database:

select
  cr.cityName,
  cr.cityTransliterated,
  cr.cityregion
from MV_CITYREGION cr 
where
  cityTransliterated = 'moskva'
order by
  cr.population desc,
  cr.cityname asc;

 cityname | citytransliterated |         cityregion         
----------+--------------------+----------------------------
 Москва   | moskva             | Москва (Москва)
 Москва   | moskva             | Москва (Кировская область)
 Москва   | moskva             | Москва (Псковская область)
 Москва   | moskva             | Москва (Тверская область)
(4 rows)

Another option would be, for example, to create expression index using our cyrillic transliteration function.