Text::Unaccent - Remove accents from a string


NAME

Text::Unaccent - Remove accents from a string


SYNOPSIS

  use Text::Unaccent;
  $unaccented = unac_string($charset, $string);
  $unaccented = unac_string_utf16($string);
  $version = unac_version();
  unac_debug($level);


DESCRIPTION

Text::Unaccent is a module that remove accents from a string. unac_string converts the input string from the specified charset to UTF-16 and call unac_string_utf16 to return the unaccented equivalent. The conversion from and to UTF-16 is done with iconv(1).


METHODS

$unaccented = unac_string($charset, $string)
Return the unaccented equivalent of the string $string. The character set of $string is specified by the $charset argument. The returned string is coded using the same character set. Valid values for the $charset argument are character sets known by iconv(1). Under GNU/Linux try iconv -l for a complete list.

[ Added for Windows users (jl_morel@bribes.org)

Here is the list of the names of the supported encodings for this binary distro. The names are printed in upper case, separated by whitespace, and alias names of an encoding are listed on the same line as the encoding itself.

  ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US ISO_646.IRV:1991 US US-ASCII CSASCII
  UTF-8
  ISO-10646-UCS-2 UCS-2 CSUNICODE
  UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
  UCS-2LE UNICODELITTLE
  ISO-10646-UCS-4 UCS-4 CSUCS4
  UCS-4BE
  UCS-4LE
  UTF-16
  UTF-16BE
  UTF-16LE
  UTF-32
  UTF-32BE
  UTF-32LE
  UNICODE-1-1-UTF-7 UTF-7 CSUNICODE11UTF7
  UCS-2-INTERNAL
  UCS-2-SWAPPED
  UCS-4-INTERNAL
  UCS-4-SWAPPED
  C99
  JAVA
  CP819 IBM819 ISO-8859-1 ISO-IR-100 ISO8859-1 ISO_8859-1 ISO_8859-1:1987 L1 LATIN1 CSISOLATIN1
  ISO-8859-2 ISO-IR-101 ISO8859-2 ISO_8859-2 ISO_8859-2:1987 L2 LATIN2 CSISOLATIN2
  ISO-8859-3 ISO-IR-109 ISO8859-3 ISO_8859-3 ISO_8859-3:1988 L3 LATIN3 CSISOLATIN3
  ISO-8859-4 ISO-IR-110 ISO8859-4 ISO_8859-4 ISO_8859-4:1988 L4 LATIN4 CSISOLATIN4
  CYRILLIC ISO-8859-5 ISO-IR-144 ISO8859-5 ISO_8859-5 ISO_8859-5:1988 CSISOLATINCYRILLIC
  ARABIC ASMO-708 ECMA-114 ISO-8859-6 ISO-IR-127 ISO8859-6 ISO_8859-6 ISO_8859-6:1987 CSISOLATINARABIC
  ECMA-118 ELOT_928 GREEK GREEK8 ISO-8859-7 ISO-IR-126 ISO8859-7 ISO_8859-7 ISO_8859-7:1987 CSISOLATINGREEK
  HEBREW ISO-8859-8 ISO-IR-138 ISO8859-8 ISO_8859-8 ISO_8859-8:1988 CSISOLATINHEBREW
  ISO-8859-9 ISO-IR-148 ISO8859-9 ISO_8859-9 ISO_8859-9:1989 L5 LATIN5 CSISOLATIN5
  ISO-8859-10 ISO-IR-157 ISO8859-10 ISO_8859-10 ISO_8859-10:1992 L6 LATIN6 CSISOLATIN6
  ISO-8859-13 ISO-IR-179 ISO8859-13 ISO_8859-13 L7 LATIN7
  ISO-8859-14 ISO-CELTIC ISO-IR-199 ISO8859-14 ISO_8859-14 ISO_8859-14:1998 L8 LATIN8
  ISO-8859-15 ISO-IR-203 ISO8859-15 ISO_8859-15 ISO_8859-15:1998 LATIN-9
  ISO-8859-16 ISO-IR-226 ISO8859-16 ISO_8859-16 ISO_8859-16:2001 L10 LATIN10
  KOI8-R CSKOI8R
  KOI8-U
  KOI8-RU
  CP1250 MS-EE WINDOWS-1250
  CP1251 MS-CYRL WINDOWS-1251
  CP1252 MS-ANSI WINDOWS-1252
  CP1253 MS-GREEK WINDOWS-1253
  CP1254 MS-TURK WINDOWS-1254
  CP1255 MS-HEBR WINDOWS-1255
  CP1256 MS-ARAB WINDOWS-1256
  CP1257 WINBALTRIM WINDOWS-1257
  CP1258 WINDOWS-1258
  850 CP850 IBM850 CSPC850MULTILINGUAL
  862 CP862 IBM862 CSPC862LATINHEBREW
  866 CP866 IBM866 CSIBM866
  MAC MACINTOSH MACROMAN CSMACINTOSH
  MACCENTRALEUROPE
  MACICELAND
  MACCROATIAN
  MACROMANIA
  MACCYRILLIC
  MACUKRAINE
  MACGREEK
  MACTURKISH
  MACHEBREW
  MACARABIC
  MACTHAI
  HP-ROMAN8 R8 ROMAN8 CSHPROMAN8
  NEXTSTEP
  ARMSCII-8
  GEORGIAN-ACADEMY
  GEORGIAN-PS
  KOI8-T
  MULELAO-1
  CP1133 IBM-CP1133
  ISO-IR-166 TIS-620 TIS620 TIS620-0 TIS620.2529-1 TIS620.2533-0 TIS620.2533-1
  CP874 WINDOWS-874
  VISCII VISCII1.1-1 CSVISCII
  TCVN TCVN-5712 TCVN5712-1 TCVN5712-1:1993
  ISO-IR-14 ISO646-JP JIS_C6220-1969-RO JP CSISO14JISC6220RO
  JISX0201-1976 JIS_X0201 X0201 CSHALFWIDTHKATAKANA
  ISO-IR-87 JIS0208 JIS_C6226-1983 JIS_X0208 JIS_X0208-1983 JIS_X0208-1990 X0208 CSISO87JISX0208
  ISO-IR-159 JIS_X0212 JIS_X0212-1990 JIS_X0212.1990-0 X0212 CSISO159JISX02121990
  CN GB_1988-80 ISO-IR-57 ISO646-CN CSISO57GB1988
  CHINESE GB_2312-80 ISO-IR-58 CSISO58GB231280
  CN-GB-ISOIR165 ISO-IR-165
  ISO-IR-149 KOREAN KSC_5601 KS_C_5601-1987 KS_C_5601-1989 CSKSC56011987
  EUC-JP EUCJP EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE CSEUCPKDFMTJAPANESE
  MS_KANJI SHIFT-JIS SHIFT_JIS SJIS CSSHIFTJIS
  CP932
  ISO-2022-JP CSISO2022JP
  ISO-2022-JP-1
  ISO-2022-JP-2 CSISO2022JP2
  CN-GB EUC-CN EUCCN GB2312 CSGB2312
  CP936 GBK MS936 WINDOWS-936
  GB18030
  ISO-2022-CN CSISO2022CN
  ISO-2022-CN-EXT
  HZ HZ-GB-2312
  EUC-TW EUCTW CSEUCTW
  BIG-5 BIG-FIVE BIG5 BIGFIVE CN-BIG5 CSBIG5
  CP950
  BIG5-HKSCS BIG5HKSCS
  EUC-KR EUCKR CSEUCKR
  CP949 UHC
  CP1361 JOHAB
  ISO-2022-KR CSISO2022KR
  CP856
  CP922
  CP943
  CP1046
  CP1124
  CP1129
  CP1161 IBM-1161 IBM1161 CSIBM1161
  CP1162 IBM-1162 IBM1162 CSIBM1162
  CP1163 IBM-1163 IBM1163 CSIBM1163
  DEC-KANJI
  DEC-HANYU
  437 CP437 IBM437 CSPC8CODEPAGE437
  CP737
  CP775 IBM775 CSPC775BALTIC
  852 CP852 IBM852 CSPCP852
  CP853
  855 CP855 IBM855 CSIBM855
  857 CP857 IBM857 CSIBM857
  CP858
  860 CP860 IBM860 CSIBM860
  861 CP-IS CP861 IBM861 CSIBM861
  863 CP863 IBM863 CSIBM863
  CP864 IBM864 CSIBM864
  865 CP865 IBM865 CSIBM865
  869 CP-GR CP869 IBM869 CSIBM869
  CP1125
  EUC-JISX0213
  SHIFT_JISX0213
  ISO-2022-JP-3
  ISO-IR-230 TDS565
  RISCOS-LATIN1
]

$unaccented = unac_string_utf16($string)
Return the unaccented equivalent of the string $string. The character set of $string must be UTF-16.

$version = unac_version()
Return the version of the unac library used by this perl module.

unac_debug($level)
Set the debug level. Messages are printed on stderr. Possible debug levels are:
$Text::Unaccent::DEBUG_NONE
Silent.

$Text::Unaccent::DEBUG_LOW
Human readable messsages.

$Text::Unaccent::DEBUG_HIGH
Detailed and very verbose information.


AUTHOR

Loic Dachary (loic@senga.org) http://www.senga.org/unac/


SEE ALSO

iconv(1), unac(3).

 Text::Unaccent - Remove accents from a string