that isUTF8 function is a killer...
wouldn't something like
if ( preg_match( "~(\x00[\x80-\xff]|[\x00-\x07][\x00-\xff]~", $string ) ) { /* is utf */ };
be a lot more efficient? it doesn't take into account all the ranges, but it has to be a better method and a simple start since it'll quit on the first successful match. think of encoding and decoding a 1mb string--not good. i'm having to work with +20 meg xml files.
utf8_encode
(PHP 4, PHP 5)
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
Description
string utf8_encode
( string $data
)
This function encodes the string data to UTF-8, and returns the encoded version. UTF-8 is a standard mechanism used by Unicode for encoding wide character values into a byte stream. UTF-8 is transparent to plain ASCII characters, is self-synchronized (meaning it is possible for a program to figure out where in the bytestream characters start) and can be used with normal string comparison functions for sorting and such. PHP encodes UTF-8 characters in up to four bytes, like this:
| bytes | bits | representation |
|---|---|---|
| 1 | 7 | 0bbbbbbb |
| 2 | 11 | 110bbbbb 10bbbbbb |
| 3 | 16 | 1110bbbb 10bbbbbb 10bbbbbb |
| 4 | 21 | 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb |
Parameters
- data
-
An ISO-8859-1 string.
Return Values
Returns the UTF-8 translation of data .
utf8_encode
www.qaiser.net
17-Apr-2008 03:56
17-Apr-2008 03:56
renardo13 at free dot fr
01-Apr-2008 01:56
01-Apr-2008 01:56
another nice way to implement an isUTF8 function ...
<?php
function isUTF8($string)
{
return (utf8_encode(utf8_decode($string)) == $string);
}
?>
tacchete at gmail dot com
13-Dec-2007 12:35
13-Dec-2007 12:35
Known problem with Byte Order Mark (BOM) and header() in pages of a site.
For example at sending headings or to a dynamic conclusion in other coding distinct from UTF-8 by means of XSLT (<xsl:output encoding="windows-1251"/>).
To clean all symbols BOM from the text of page:
1. exclude BOM from the main file;
2. write down function of a return call for the buffer
<?php
header('content-type: text/html; charset: utf-8');
ob_start('ob');
function ob($buffer)
{
return str_replace("\xef\xbb\xbf", '', $buffer);
}
?>
it will exclude BOM from a code of the connected files;
3. do not experience for BOM in connected files;
4. be pleased.
ethan dot nelson at ltd dot org
07-Nov-2007 01:41
07-Nov-2007 01:41
This does the same thing as some of the posts below (minus the keys), but I thought I'd share anyway cause it is slightly more elegant. Also, its a good example using references such that this could be used as a callback function.
function utf_prepare(&$array) {
foreach($array AS $key => &$value) {
if (is_array($value)) {
$this->utf_prepare($value);
} else {
$value = utf8_encode($value);
}
}
}
luka8088 at gmail dot com
22-Jun-2007 03:19
22-Jun-2007 03:19
simple HTML to UTF-8 conversion:
function html_to_utf8 ($data)
{
return preg_replace("/\\&\\#([0-9]{3,10})\\;/e", '_html_to_utf8("\\1")', $data);
}
function _html_to_utf8 ($data)
{
if ($data > 127)
{
$i = 5;
while (($i--) > 0)
{
if ($data != ($a = $data % ($p = pow(64, $i))))
{
$ret = chr(base_convert(str_pad(str_repeat(1, $i + 1), 8, "0"), 2, 10) + (($data - $a) / $p));
for ($i; $i > 0; $i--)
$ret .= chr(128 + ((($data % pow(64, $i)) - ($data % ($p = pow(64, $i - 1)))) / $p));
break;
}
}
}
else
$ret = "&#$data;";
return $ret;
}
Example:
echo html_to_utf8("a b č ć ž こ に ち わ ()[]{}!#$?* < >");
Output:
a b č ć ž こ に ち わ ()[]{}!#$?* < >
hillar dot petersen at gmail dot com
30-May-2007 06:59
30-May-2007 06:59
In addition to my previous post. If your values are already in utf-8 maybe you want to utf8_encode array keys only. This will do it:
<?php
/**
* (Recursively) utf8_encode all array keys.
*
* @param array $array
* @return array with utf8_encoded keys
*/
function utf8_encode_array_keys($array)
{
$array_type = array_type($array);
if ($array_type == "map")
{
$result_array = array();
foreach($array as $key => $value)
{
if (is_array($value))
{
// recursion
$result_array[utf8_encode($key)] = utf8_encode_array_keys($value);
}
else
{
// value is not an array, no recursion
$result_array[utf8_encode($key)] = $value;
}
}
return $result_array;
}
else if ($array_type == "vector")
{
// do not encode anything, just follow the value if it is an array
$result_array = array();
foreach ($array as $key => $value)
{
if (is_array($value))
{
// recursion
$result_array[$key] = utf8_encode_array_keys($value);
}
else
{
// value is not an array, no recursion
$result_array[$key] = $value;
}
}
return $result_array;
}
return false; // argument is not an array, return false
}
?>
Also note that both this operation (with keys only) and the operation with both keys and values can be reversed by replacing "encode" by "decode".
hillar dot petersen at gmail dot com
29-May-2007 03:06
29-May-2007 03:06
If you are interested in recursively converting ISO-8859-1-encoded arrays into UTF-8, then this is one way to do it. Could use a small refactor though. (I used it to prepare some ISO-8859-1 arrays for json_encode. Note that for this to work your values and for associative arrays also your keys must be ISO-8859-1-encoded.)
<?php
/**
* (Recursively) utf8_encode each value in an array.
*
* @param array $array
* @return array utf8_encoded
*/
function utf8_encode_array($array)
{
if (is_array($array))
{
$result_array = array();
foreach($array as $key => $value)
{
if (array_type($array) == "map")
{
// encode both key and value
if (is_array($value))
{
// recursion
$result_array[utf8_encode($key)] = utf8_encode_array($value);
}
else
{
// no recursion
if (is_string($value))
{
$result_array[utf8_encode($key)] = utf8_encode($value);
}
else
{
// do not re-encode non-strings, just copy data
$result_array[utf8_encode($key)] = $value;
}
}
}
else if (array_type($array) == "vector")
{
// encode value only
if (is_array($value))
{
// recursion
$result_array[$key] = utf8_encode_array($value);
}
else
{
// no recursion
if (is_string($value))
{
$result_array[$key] = utf8_encode($value);
}
else
{
// do not re-encode non-strings, just copy data
$result_array[$key] = $value;
}
}
}
}
return $result_array;
}
return false; // argument is not an array, return false
}
/**
* Determines array type ("vector" or "map"). Returns false if not an array at all.
* (I hope a native function will be introduced in some future release of PHP, because
* this check is inefficient and quite costly in worst case scenario.)
*
* @param array $array The array to analyze
* @return string array type ("vector" or "map") or false if not an array
*/
function array_type($array)
{
if (is_array($array))
{
$next = 0;
$return_value = "vector"; // we have a vector until proved otherwise
foreach ($array as $key => $value)
{
if ($key != $next)
{
$return_value = "map"; // we have a map
break;
}
$next++;
}
return $return_value;
}
return false; // not array
}
?>
nikooo adog bk adot ru - Nickolaz
03-May-2007 03:02
03-May-2007 03:02
You can use this simple code to convert win-1251 into Unicode.
function rus2uni($str,$isTo = true)
{
$arr = array('ё'=>'ё','Ё'=>'Ё');
for($i=192;$i<256;$i++)
$arr[chr($i)] = ''.dechex($i-176).';';
$str =preg_replace(array('@([а-я]) @i','@ ([а-я])@i'),array('$1 ',' $1'),$str);
return strtr($str,$isTo?$arr:array_flip($arr));
}
That is useful for xml_parser (to parse windows-1251 files like utf-8).
18-Apr-2007 05:06
I just read what I wrote, sorry for the typos it was a long day:
here's the rewritten code:
xml_tpl.php
<?php
header("Content-Type: text/html;charset=ISO-8859-1");
print "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
$names=array('jack','bob','vanessa','catherine','valerie');
?>
<parent>
<?php foreach($names as $name) {?>
<child name="<?php print $name?>" />
<?php } ?>
</parent>
<?php
function create_xml(){
ob_start();
include "xml_tpl.php";
$trapped_content=ob_get_contents();
ob_end_clean();
$file_path= "./somefile.xml";
$file_handle=fopen($file_path,'w');
fwrite($file_handle,utf8_encode($trapped_content));
}
?>
penda ekoka
17-Apr-2007 07:15
17-Apr-2007 07:15
creating utf-8 xml files:
this is something that has wasted a lot of my time, I hope this will spare you the headaches:
my method consists of creating an xml template that will look like this (this is probably optional, I'm sure you can use good ol' print or echo statements):
xml_tpl.php
<?php
header("Content-Type: text/html;charset=ISO-8859-1");
print "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
$names=array('jack','bob','vanessa','catherine','valerie');
?>
<parent>
<?php foreach($names as $name) {?>
<child name="<?php print $name?>" />
<?php } ?>
</parent>
?>
from a function or a method I include the previous template and trap the outputted content in an output buffer. The buffured content is then inserted into a file:
<?php
function create_xml(){
ob_start();
include "xml_php.php";
$trapped_content=ob_get_contents();
ob_end_clean();
$file_path= "./somefile.xml";
$file_handle=fopen($somefile,'w');
fwrite($file_handle,utf8_encode($trapped_content));
}
?>
Some side notes:
- note that the utf8_encode function goes inside the fwrite() function.
- when troubleshooting, make sure to transfer text file (xml included) and scripts in ascii mode when using ftp. For some unknown reason my ftp client did not have xml set as an ascii transfer candidate and was automatically tranfering them in binary. That little "feature" ended up costing me hours of frustration, as the encoding information would just "vanish" between transfer and I kept scratching my head as to why manually created utf8 files were not behaving as they should.
28-Mar-2007 11:07
<?php
function unicon($str, $to_uni = true) {
$cp = Array (
"А" => "А", "а" => "а",
"Б" => "Б", "б" => "б",
"В" => "В", "в" => "в",
"Г" => "Г", "г" => "г",
"Д" => "Д", "д" => "д",
"Е" => "Е", "е" => "е",
"Ё" => "Ё", "ё" => "ё",
"Ж" => "Ж", "ж" => "ж",
"З" => "З", "з" => "з",
"И" => "И", "и" => "и",
"Й" => "Й", "й" => "й",
"К" => "К", "к" => "к",
"Л" => "Л", "л" => "л",
"М" => "М", "м" => "м",
"Н" => "Н", "н" => "н",
"О" => "О", "о" => "о",
"П" => "П", "п" => "п",
"Р" => "Р", "р" => "р",
"С" => "С", "с" => "с",
"Т" => "Т", "т" => "т",
"У" => "У", "у" => "у",
"Ф" => "Ф", "ф" => "ф",
"Х" => "Х", "х" => "х",
"Ц" => "Ц", "ц" => "ц",
"Ч" => "Ч", "ч" => "ч",
"Ш" => "Ш", "ш" => "ш",
"Щ" => "Щ", "щ" => "щ",
"Ъ" => "Ъ", "ъ" => "ъ",
"Ы" => "Ы", "ы" => "ы",
"Ь" => "Ь", "ь" => "ь",
"Э" => "Э", "э" => "э",
"Ю" => "Ю", "ю" => "ю",
"Я" => "Я", "я" => "я"
);
if ($to_uni) {
$str = strtr($str, $cp);
} else {
foreach ($cp as $c) {
$cpp[$c] = array_search($c, $cp);
}
$str = strtr($str, $cpp);
}
return $str;
}
?>
emze at donazga dot net
17-Dec-2006 05:42
17-Dec-2006 05:42
/*
Every function seen so far is incomplete or resource consumpting. Here are two -- integer 2 utf sequence (i3u) and utf sequence to integer (u3i). Below is a code snippet that checks well behavior at the range boundaries.
Someday they might be hardcoded into PHP...
*/
function i3u($i) { // returns UCS-16 or UCS-32 to UTF-8 from an integer
$i=(int)$i; // integer?
if ($i<0) return false; // positive?
if ($i<=0x7f) return chr($i); // range 0
if (($i & 0x7fffffff) <> $i) return '?'; // 31 bit?
if ($i<=0x7ff) return chr(0xc0 | ($i >> 6)) . chr(0x80 | ($i & 0x3f));
if ($i<=0xffff) return chr(0xe0 | ($i >> 12)) . chr(0x80 | ($i >> 6) & 0x3f)
. chr(0x80 | $i & 0x3f);
if ($i<=0x1fffff) return chr(0xf0 | ($i >> 18)) . chr(0x80 | ($i >> 12) & 0x3f)
. chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80 | $i & 0x3f);
if ($i<=0x3ffffff) return chr(0xf8 | ($i >> 24)) . chr(0x80 | ($i >> 18) & 0x3f)
. chr(0x80 | ($i >> 12) & 0x3f) . chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80 | $i & 0x3f);
return chr(0xfc | ($i >> 30)) . chr(0x80 | ($i >> 24) & 0x3f) . chr(0x80 | ($i >> 18) & 0x3f)
. chr(0x80 | ($i >> 12) & 0x3f) . chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80 | $i & 0x3f);
}
function u3i($s,$strict=1) { // returns integer on valid UTF-8 seq, NULL on empty, else FALSE
// NOT strict: takes only DATA bits, present or not; strict: length and bits checking
if ($s=='') return NULL;
$l=strlen($s); $o=ord($s{0});
if ($o <= 0x7f && $l==1) return $o;
if ($l>6 && $strict) return false;
if ($strict) for ($i=1;$i<$l;$i++) if (ord($s{$i}) > 0xbf || ord($s{$i})< 0x80) return false;
if ($o < 0xc2) return false; // no-go even if strict=0
if ($o <= 0xdf && ($l=2 && $strict)) return (($o & 0x1f) << 6 | (ord($s{1}) & 0x3f));
if ($o <= 0xef && ($l=3 && $strict)) return (($o & 0x0f) << 12 | (ord($s{1}) & 0x3f) << 6
| (ord($s{2}) & 0x3f));
if ($o <= 0xf7 && ($l=4 && $strict)) return (($o & 0x07) << 18 | (ord($s{1}) & 0x3f) << 12
| (ord($s{2}) & 0x3f) << 6 | (ord($s{3}) & 0x3f));
if ($o <= 0xfb && ($l=5 && $strict)) return (($o & 0x03) << 24 | (ord($s{1}) & 0x3f) << 18
| (ord($s{2}) & 0x3f) << 12 | (ord($s{3}) & 0x3f) << 6 | (ord($s{4}) & 0x3f));
if ($o <= 0xfd && ($l=6 && $strict)) return (($o & 0x01) << 30 | (ord($s{1}) & 0x3f) << 24
| (ord($s{2}) & 0x3f) << 18 | (ord($s{3}) & 0x3f) << 12
| (ord($s{4}) & 0x3f) << 6 | (ord($s{5}) & 0x3f));
return false;
}
// boundary behavior checking
$do=array(0x7f,0x7ff,0xffff,0x1fffff,0x3ffffff,0x7fffffff);
foreach ($do as $ii) for ($i=$ii;$i<=$ii+1; $i++) {
$o=i3u($i);
for ($j=0;$j<strlen($o);$j++) print "O[$j]=" . sprintf('%08b',ord($o{$j})) . ", ";
print "c=$i, o=[$o].\n";
print "Back: [$o] => [" . u3i($o) . "]\n";
}
sadikkeskin at hotmail dot com
21-Nov-2006 10:49
21-Nov-2006 10:49
i wrote a function to convert encoding utf8 to iso-8859-9. This function is very useful if you want to use this for ajax.
you can apply same way for other languages.
<?
function str_encode ($string,$to="iso-8859-9",$from="utf8") {
if($to=="iso-8859-9" && $from=="utf8"){
$str_array = array(
chr(196).chr(177) => chr(253),
chr(196).chr(176) => chr(221),
chr(195).chr(182) => chr(246),
chr(195).chr(150) => chr(214),
chr(195).chr(167) => chr(231),
chr(195).chr(135) => chr(199),
chr(197).chr(159) => chr(254),
chr(197).chr(158) => chr(222),
chr(196).chr(159) => chr(240),
chr(196).chr(158) => chr(208),
chr(195).chr(188) => chr(252),
chr(195).chr(156) => chr(220)
);
return str_replace(array_keys($str_array), array_values($str_array), $string);
}
return $string;
}
?>
genert at adsuk dot com
01-Oct-2006 06:23
01-Oct-2006 06:23
If you encoded data with utf8_encode function and you would like to decode it in javascript use library found here: http://www.webtoolkit.info/. There is encoder too.
27-Sep-2006 09:30
In reply to Cundle:
Note: The BOM is completely unnecessary in UTF-8. UTF-8 is interpreted the same way regardless of endianness, e.g. Λ (U+039B, GREEK CAPITAL LETTER LAMDA) is represented as the octets 0xCE, 0x9B, always in that order.
Extra note: UTF-16 and UCS-2 are different. The same letter would be encoded as 0x03 0x9B on big-endian (e.g. Motorola) architecture, but 0x9B 0x03 on little-endian (e.g Intel) architecture.
But in any case, there's nothing wrong with putting a BOM at the beginning of a UTF-8 encoded file. It is just treated as U+FEFF Zero Width No-Break Space.
James Cundle
18-Jul-2006 03:33
18-Jul-2006 03:33
I had some difficulty finding a way to easily write UTF-8 files with the byte order mark included. This is the simple solution I have come up with:
<?php
function writeUTF8File($filename,$content) {
$dhandle=fopen($filename,"w");
# Now UTF-8 - Add byte order mark
fwrite($dhandle, pack("CCC",0xef,0xbb,0xbf));
fwrite($dhandle,$content);
fclose($dhandle);
}
?>
When you read the file back in using fopen, the BOM will also be there. To remove it, I also wrote the following function:
<?php
function removeBOM($str=""){
if(substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {
$str=substr($str, 3);
}
return $str;
}
?>
rocketman
16-Mar-2006 12:46
16-Mar-2006 12:46
If you are looking for a function to replace special characters with the hex-utf-8 value (e.g. für Webservice-Security/WSS4J compliancy) you might use this:
$textstart = "Größe";
$utf8 ='';
$max = strlen($txt);
for ($i = 0; $i < $max; $i++) {
if ($txt{i} == "&"){
$neu = "&x26;";
}
elseif ((ord($txt{$i}) < 32) or (ord($txt{$i}) > 127)){
$neu = urlencode(utf8_encode($txt{$i}));
$neu = preg_replace('#\%(..)\%(..)\%(..)#','&#x\1;&#x\2;&#x\3;',$neu);
$neu = preg_replace('#\%(..)\%(..)#','&#x\1;&#x\2;',$neu);
$neu = preg_replace('#\%(..)#','&#x\1;',$neu);
}
else {
$neu = $txt{$i};
}
$utf8 .= $neu;
} // for $i
$textnew = $utf8;
In this example $textnew will be "Größe"
mailing at jcn50 dot com
21-Jan-2006 06:40
21-Jan-2006 06:40
I recommend using this alternative for every language:
$new=mb_convert_encoding($s,"UTF-8","auto");
Don't forget to set all your pages to "utf-8" encoding, otherwise just use HTML entities.
jcn50.
migueldiaz at gennio dot com
14-Dec-2005 05:23
14-Dec-2005 05:23
Here's my function to know if one string is encoded in UTF8.
If we encode in UTF8 a string or text file that is already encoded in UTF8, it's expected to find the character '' ( ALT+159) in the final string.
<?php
function isUTF8($string)
{
$string_utf8 = utf8_encode($string);
if( strpos($string_utf8,"",0) !== false ) // "" is ALT+159
return true; // the original string was utf8
else
return false; // otherwise
}
?>
regards
Miguel Daz
04-Nov-2005 10:34
// Reads a file story.txt ascii (as typed on keyboard)
// converts it to Georgian character using utf8 encoding
// if I am correct(?) just as it should be when typed on Georgian computer
// it outputs it as an html file
//
// http://www.comweb.nl/keys_to_georgian.html
// http://www.comweb.nl/keys_to_georgian.php
// http://www.comweb.nl/story.txt
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<HEAD>
<TITLE>keys to unicode code</TITLE>
// this meta tag is needed
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
// note the sylfean font seems to be standard installed on Windows XP
// It supports Georgian
<style TYPE="text/css">
<!--
body {font-family:sylfaen; }
-->
</style>
</HEAD>
<BODY>
<?
$eng=array(97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,87,82,84,83,
67,74,90);
$geo=array(4304,4305,4330,4307,4308,4324,4306,4336,4312,4335,4313,
4314,4315,4316,4317,4318,4325,4320,4321,4322,4323,4309,
4332,4334,4327,4310,4333,4326,4311,4328,4329,4319,4331,
91,93,59,39,44,46,96);
$fc=file("story.txt");
foreach($fc as $line)
{
$spacestart=1;
for ($i=0; $i<strlen($line); $i+=1)
{
$character=ord(substr($line,$i,1));
$found=0;
for ($k=0; $k<count($eng); $k+=1)
{
if ($eng[$k]==$character)
{
print code2utf( $geo[$k] );
$found=1;
}
}
if ($found==0)
{
if ($character==126 || $character==32 || $character==10 || $character==9)
{
if ($character==9) { print ' '; }
if ($character==10) { print "<BR>\n"; }
if ($character==32)
{
if ($spacestart==1) {print ' '; } else { print " "; }
}
if ($character==126){ print "~"; }
} else
{
print substr($line,$i,1);
}
}
if ($character!=32) { $spacestart=0; }
}
}
/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num)
{
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>
</BODY>
</HTML>
Janci
04-Nov-2005 12:00
04-Nov-2005 12:00
I was searching for a function similar to Javascript's unescape(). In most cases it is OK to use url_decode() function but not if you've got UTF characters in the strings. They are converted into %uXXXX entities that url_decode() cannot handle.
I googled the net and found a function which actualy converts these entities into HTML entities (&#xxx;) that your browser can show correctly. If you're OK with that, the function can be found here: http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
But it was not OK with me because I needed a string in my charset to make some comparations and other stuff. So I have modified the above function and in conjuction with code2utf() function mentioned in some other note here, I have managed to achieve my goal:
<?php
/**
* Function converts an Javascript escaped string back into a string with specified charset (default is UTF-8).
* Modified function from http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
*
* @param string $source escaped with Javascript's escape() function
* @param string $iconv_to destination character set will be used as second paramether in the iconv function. Default is UTF-8.
* @return string
*/
function unescape($source, $iconv_to = 'UTF-8') {
$decodedStr = '';
$pos = 0;
$len = strlen ($source);
while ($pos < $len) {
$charAt = substr ($source, $pos, 1);
if ($charAt == '%') {
$pos++;
$charAt = substr ($source, $pos, 1);
if ($charAt == 'u') {
// we got a unicode character
$pos++;
$unicodeHexVal = substr ($source, $pos, 4);
$unicode = hexdec ($unicodeHexVal);
$decodedStr .= code2utf($unicode);
$pos += 4;
}
else {
// we have an escaped ascii character
$hexVal = substr ($source, $pos, 2);
$decodedStr .= chr (hexdec ($hexVal));
$pos += 2;
}
}
else {
$decodedStr .= $charAt;
$pos++;
}
}
if ($iconv_to != "UTF-8") {
$decodedStr = iconv("UTF-8", $iconv_to, $decodedStr);
}
return $decodedStr;
}
/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num){
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>
aktionimskript at gmx dot net
01-Sep-2005 04:52
01-Sep-2005 04:52
if you want to put variables as parameter in a flashfile, i prefer using to convert the string with utf8_encode() [or preg_replace, or iconv] and after this i encode it with urlencode();
<?php
$yourstring="yourstring";
$str_utf8=utf8_encode($yourstring);
$str_encoded=urlencode($str_utf8);
echo "<script language='javascript'>";
echo "parameterForFlash='".$str_encoded."';";
echo "</script>";
?>
now you can use the variable (parameterForFlash) in your javascript (plugindetection), that writes the flash object/embed.
suttichai at ceforce dot com
28-May-2005 08:26
28-May-2005 08:26
This function I use convert Thai font (iso-8859-11) to UTF-8. For my case, It work properly. Please try to use this function if you have a problem to convert charset iso-8859-11 to UTF-8.
function iso8859_11toUTF8($string) {
if ( ! ereg("[\241-\377]", $string) )
return $string;
$iso8859_11 = array(
"\xa1" => "\xe0\xb8\x81",
"\xa2" => "\xe0\xb8\x82",
"\xa3" => "\xe0\xb8\x83",
"\xa4" => "\xe0\xb8\x84",
"\xa5" => "\xe0\xb8\x85",
"\xa6" => "\xe0\xb8\x86",
"\xa7" => "\xe0\xb8\x87",
"\xa8" => "\xe0\xb8\x88",
"\xa9" => "\xe0\xb8\x89",
"\xaa" => "\xe0\xb8\x8a",
"\xab" => "\xe0\xb8\x8b",
"\xac" => "\xe0\xb8\x8c",
"\xad" => "\xe0\xb8\x8d",
"\xae" => "\xe0\xb8\x8e",
"\xaf" => "\xe0\xb8\x8f",
"\xb0" => "\xe0\xb8\x90",
"\xb1" => "\xe0\xb8\x91",
"\xb2" => "\xe0\xb8\x92",
"\xb3" => "\xe0\xb8\x93",
"\xb4" => "\xe0\xb8\x94",
"\xb5" => "\xe0\xb8\x95",
"\xb6" => "\xe0\xb8\x96",
"\xb7" => "\xe0\xb8\x97",
"\xb8" => "\xe0\xb8\x98",
"\xb9" => "\xe0\xb8\x99",
"\xba" => "\xe0\xb8\x9a",
"\xbb" => "\xe0\xb8\x9b",
"\xbc" => "\xe0\xb8\x9c",
"\xbd" => "\xe0\xb8\x9d",
"\xbe" => "\xe0\xb8\x9e",
"\xbf" => "\xe0\xb8\x9f",
"\xc0" => "\xe0\xb8\xa0",
"\xc1" => "\xe0\xb8\xa1",
"\xc2" => "\xe0\xb8\xa2",
"\xc3" => "\xe0\xb8\xa3",
"\xc4" => "\xe0\xb8\xa4",
"\xc5" => "\xe0\xb8\xa5",
"\xc6" => "\xe0\xb8\xa6",
"\xc7" => "\xe0\xb8\xa7",
"\xc8" => "\xe0\xb8\xa8",
"\xc9" => "\xe0\xb8\xa9",
"\xca" => "\xe0\xb8\xaa",
"\xcb" => "\xe0\xb8\xab",
"\xcc" => "\xe0\xb8\xac",
"\xcd" => "\xe0\xb8\xad",
"\xce" => "\xe0\xb8\xae",
"\xcf" => "\xe0\xb8\xaf",
"\xd0" => "\xe0\xb8\xb0",
"\xd1" => "\xe0\xb8\xb1",
"\xd2" => "\xe0\xb8\xb2",
"\xd3" => "\xe0\xb8\xb3",
"\xd4" => "\xe0\xb8\xb4",
"\xd5" => "\xe0\xb8\xb5",
"\xd6" => "\xe0\xb8\xb6",
"\xd7" => "\xe0\xb8\xb7",
"\xd8" => "\xe0\xb8\xb8",
"\xd9" => "\xe0\xb8\xb9",
"\xda" => "\xe0\xb8\xba",
"\xdf" => "\xe0\xb8\xbf",
"\xe0" => "\xe0\xb9\x80",
"\xe1" => "\xe0\xb9\x81",
"\xe2" => "\xe0\xb9\x82",
"\xe3" => "\xe0\xb9\x83",
"\xe4" => "\xe0\xb9\x84",
"\xe5" => "\xe0\xb9\x85",
"\xe6" => "\xe0\xb9\x86",
"\xe7" => "\xe0\xb9\x87",
"\xe8" => "\xe0\xb9\x88",
"\xe9" => "\xe0\xb9\x89",
"\xea" => "\xe0\xb9\x8a",
"\xeb" => "\xe0\xb9\x8b",
"\xec" => "\xe0\xb9\x8c",
"\xed" => "\xe0\xb9\x8d",
"\xee" => "\xe0\xb9\x8e",
"\xef" => "\xe0\xb9\x8f",
"\xf0" => "\xe0\xb9\x90",
"\xf1" => "\xe0\xb9\x91",
"\xf2" => "\xe0\xb9\x92",
"\xf3" => "\xe0\xb9\x93",
"\xf4" => "\xe0\xb9\x94",
"\xf5" => "\xe0\xb9\x95",
"\xf6" => "\xe0\xb9\x96",
"\xf7" => "\xe0\xb9\x97",
"\xf8" => "\xe0\xb9\x98",
"\xf9" => "\xe0\xb9\x99",
"\xfa" => "\xe0\xb9\x9a",
"\xfb" => "\xe0\xb9\x9b"
);
$string=strtr($string,$iso8859_11);
return $string;
}
Suttichai Mesaard-www.ceforce.com
bisqwit at iki dot fi
20-May-2005 09:15
20-May-2005 09:15
For reference, it may be insightful to point out that:
utf8_encode($s)
is actually identical to:
recode_string('latin1..utf8', $s)
and:
iconv('iso-8859-1', 'utf-8', $s)
That is, utf8_encode is a specialized case of character set conversions.
If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.
JF Sebastian
09-Apr-2005 11:54
09-Apr-2005 11:54
The following Perl regular expression tests if a string is well-formed Unicode UTF-8 (Broken up after each | since long lines are not permitted here. Please join as a single line, no spaces, before use.):
^([\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
\xe0[\xa0-\xbf][\x80-\xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
\xed[\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
f0[\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
\xf4[\x80-\x8f][\x80-\xbf]{2})*$
NOTE: This strictly follows the Unicode standard 4.0, as described in chapter 3.9, table 3-6, "Well-formed UTF-8 byte sequences" ( http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703 ).
ISO-10646, a super-set of Unicode, uses UTF-8 (there called "UCS", see http://www.unicode.org/faq/utf_bom.html#1 ) in a relaxed variant that supports a 31-bit space encoded into up to six bytes instead of Unicode's 21 bits in up to four bytes. To check for ISO-10646 UTF-8, use the following Perl regular expression (again, broken up, see above):
^([\x00-\x7f]|
[\xc0-\xdf][\x80-\xbf]|
[\xe0-\xef][\x80-\xbf]{2}|
[\xf0-\xf7][\x80-\xbf]{3}|
[\xf8-\xfb][\x80-\xbf]{4}|
[\xfc-\xfd][\x80-\xbf]{5})*$
The following function may be used with above expressions for a quick UTF-8 test, e.g. to distinguish ISO-8859-1-data from UTF-8-data if submitted from a <form accept-charset="utf-8,iso-8859-1" method=..>.
function is_utf8($string) {
return (preg_match('/[insert regular expression here]/', $string) === 1);
}
http://iubito.free.fr
10-Mar-2005 07:57
10-Mar-2005 07:57
Here's a function I made to know if one string or textfile is already encoded in UTF8 :
<?php
/**
* Returns <kbd>true</kbd> if the string or array of string is encoded in UTF8.
*
* Example of use. If you want to know if a file is saved in UTF8 format :
* <code> $array = file('one file.txt');
* $isUTF8 = isUTF8($array);
* if (!$isUTF8) --> we need to apply utf8_encode() to be in UTF8
* else --> we are in UTF8 :)
* </code>
* @param mixed A string, or an array from a file() function.
* @return boolean
*/
function isUTF8($string)
{
if (is_array($string))
{
$enc = implode('', $string);
return @!((ord($enc[0]) != 239) && (ord($enc[1]) != 187) && (ord($enc[2]) != 191));
}
else
{
return (utf8_encode(utf8_decode($string)) == $string);
}
}
?>
Denis G.
24-Feb-2005 01:32
24-Feb-2005 01:32
Sniplet to convert ASCII coded text to UTF-8:
$text= preg_replace ('/([\x80-\xff])/se', "pack (\"C*\", (ord ($1) >> 6) | 0xc0, (ord ($1) & 0x3f) | 0x80)", $text);
anonymous at anonymous dot com
24-Jan-2005 10:49
24-Jan-2005 10:49
A few bugs in your example code:
function code2utf($num){
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
schofei at yahoo dot de
11-Jan-2005 11:23
11-Jan-2005 11:23
regarding the above code2utf function...
> romans at void dot lv
> 02-Oct-2002 09:59
> Here is optimized function which converts
> binary UTF symbol code into unicoded string....
Thanks for providing your nice conversion code, however due to some missing parenthesis 4-byte utf-8 chars are not converted properly.
Here is a corrected version of the code2utf function:
function code2utf($num){
if($num<128)return chr($num);
if($num<1024)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<32768)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
regards
Scho Fei
hrpeters (at) gmx (dot) net
14-Dec-2004 06:46
14-Dec-2004 06:46
// Validate Unicode UTF-8 Version 4
// This function takes as reference the table 3.6 found at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
// It also flags overlong bytes as error
function is_validUTF8($str)
{
// values of -1 represent disalloweded values for the first bytes in current UTF-8
static $trailing_bytes = array (
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
);
$ups = unpack('C*', $str);
if (!($aCnt = count($ups))) return true; // Empty string *is* valid UTF-8
for ($i = 1; $i <= $aCnt;)
{
if (!($tbytes = $trailing_bytes[($b1 = $ups[$i++])])) continue;
if ($tbytes == -1) return false;
$first = true;
while ($tbytes > 0 && $i <= $aCnt)
{
$cbyte = $ups[$i++];
if (($cbyte & 0xC0) != 0x80) return false;
if ($first)
{
switch ($b1)
{
case 0xE0:
if ($cbyte < 0xA0) return false;
break;
case 0xED:
if ($cbyte > 0x9F) return false;
break;
case 0xF0:
if ($cbyte < 0x90) return false;
break;
case 0xF4:
if ($cbyte > 0x8F) return false;
break;
default:
break;
}
$first = false;
}
$tbytes--;
}
if ($tbytes) return false; // incomplete sequence at EOS
}
return true;
}
Mark AT modernbill DOT com
09-Nov-2004 07:56
09-Nov-2004 07:56
If you haven't guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren't saving a bunch of ???? into your database.
Aidan Kehoe <php-manual at parhasard dot net>
30-Aug-2004 03:05
30-Aug-2004 03:05
Here's some code that addresses the issue that Steven describes in the previous comment;
<?php
/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
the UTF-8 encoding of the non-control characters that Windows-1252 places
at the equivalent code points. */
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
"\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);
function cp1252_to_utf8($str) {
global $cp1252_map;
return strtr(utf8_encode($str), $cp1252_map);
}
?>
