at least 8 letters,mac numbers使用教程,underscores

Let&s first define some terms to make it easier to understand the following sections (taken from the book ). See also the introductory .
A character is the smallest component of written language that has a semantic value. Examples of characters are letters, ideographs (e.g. Chinese characters), punctuation marks, digits etc.
A character set is a group of characters without associated numerical values. An example of a character set is the Latin alphabet or the Cyrillic alphabet.
Coded character sets are character sets in which each character is associated with a scalar value: a code point. For example, in ASCII, the uppercase letter &A& has the value 65. Examples for coded character sets are ASCII and Unicode. A coded character set is meant to be encoded, i.e. converted into a digital representation so that the characters can be serialized in files, databases, or strings. This is done through a character encoding scheme or encoding. The encoding method maps each character value to a given sequence of bytes.
In many cases, the encoding is just a direct projection of the scalar values, and there is no real distinction between the coded character set and its serialized representation. For example, in ISO 8859-1 (Latin 1), the character &A& (code point 65) is encoded as a byte 0&41 (i.e. 65). In other cases, the encoding method is more complex. For example, in UTF-8, an encoding of Unicode, the character &á& (225) is encoded as two bytes: 0xC3 and 0xA1.
For Unicode (also called Universal Character Set or UCS), a coded character set developed by the Unicode consortium, there a several possible encodings: UTF-8, UTF-16, and UTF-32. Of these, UTF-8 is most relevant for a web application.
UTF-8 is a multibyte 8-bit encoding in which each Unicode scalar value is mapped to a sequence of one to four bytes. One of the main advantages of UTF-8 is its compatibility with ASCII. If no extended characters are present, there is no difference between a dencoded in ASCII and one encoded in UTF-8.
One thing to take into consideration when using UTF-8 with PHP is that characters are represented with a varying number of bytes. Some PHP functions do not take this into account and will not work as expected (more on this below).
This page is going to assume you&ve done a little reading and absorbed some paranioa about the issue of character sets and character encoding in web applications. If you haven&t,
&When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.&
&Darn near impossible& is perhaps too extreme but, certainly in PHP, if you simply &accept the defaults& you probably will end up with all kinds of strange characters and question marks the moment anyone outside the US or Western Europe submits some content to your site
This page won&t rehash existing discussions suffice to say you should be thinking in terms of Unicode, the grand unified solution to all character issues and, in particular, UTF-8, a specific encoding of Unicode and the best solution for PHP applications.
Just so you don&t get the idea that only &serious programmers& can understand the problem, and as a taster for the type of problems you can have, right now (i.e. they may fix it later) on IBM&s new , here&s what you see if you right click & Page Info in F
Firefox say it regards the character encoding as being
. That&s actually coming from an HTTP header - if you click on the &Headers&
Content-Type: text/charset=ISO-8859-1
Meanwhile amongst the HTML meta tags (scroll down past the whitespace)
http-equiv=&Content-Type& content=&text/ charset=utf-8&/&
Now that&s not a train smash (yet) but it should raise the flag that something isn&t quite right. The meta tag will be ignored by browsers so content will be regarded as being encoded as ISO-8859-1, thanks to the HTTP header.
This begs the question - how is the content on the blog actually encoded. If that Content-Type: text/charset=ISO-8859-1 header is also turning up in a form that writers on the blog use to submit content, it will probably mean the content being stored will have been encoded as ISO-8859-1. If that&s the case, the real problem will raise it&s head in the blogs
which currently does not specify the the charset with an HTTP header - just that it&s XML;
Content-Type: text/xml
...but does declare UTF as the encoding in the XML
&?xml version=&1.0& encoding=&UTF-8&?&
Anyone subscribed to this feed is going to see some wierd characters appearing, should the blog contain anything but pure ASCII characters, because there&s a very good chance the content is actually stored is ISO-8859-1, the guess here being that the &back end& content admin page (containing a form for adding content) is also telling the browser it&s ISO-8859-1.
Hopefully, by the time you&ve read this document, you&ll understand what exactly is going wrong here and why.
The basic problem PHP has with character encoding is it has a very simple idea of what the notion of a character is: that one character equals one byte. Being more precise, the problem is most of PHP&s
for further details) make this assumption but to be able to support a wide range of characters (or all characters, ever, as Unicode does), you need more than one byte to represent a character.
An example in code. From Sam Ruby&s , he recommends using the string I?t?rn?ti?nàlizaeti?n for testing. Counted with your eye, you can see it contains 20
I?t?rn?ti?nàlizaeti?n
But counted with PHP&s
function...
('I?t?rn?ti?nàlizaeti?n');
PHP will report 27 characters. That&s because the string, encoded as UTF-8, contains multi-byte characters which PHP&s
function will count as being multiple characters.
Life gets even more interesting if y
('Content-Type: text/ charset=ISO-8859-1');
$str = 'I?t?rn?ti?nàlizaeti?n';
$out = '';
$pos = '';
for($i = 0, $j = 1; $i & ($str); $i++, $j++) {
$out .= $str[$i];
if ( $j == 10 ) $j = 0;
$pos .= $j;
$out.&\n&.$pos;
I?±t?<<rn??ti??n? liz??ti?,n
Which give you an idea of what PHP&s string related functionality actually &sees&
when working with this string.
The bottom line is all those string functions you&ve happily littered all over your code, plus a bunch of other stuff like your use
are now in doubt. Is there a character set issue lurking in there, ready to spray strange characters all over your content? The good news is it&s really not a big jump to being able to support any and all characters, so long as you make use of UTF-8.
One important point (and more good news) which may not be obvious is PHP doesn&t attempt to convert / massage the contents of strings. Even though it&s string capabilities don&t &understand& anything other than 1 character = 1 byte, PHP won&t &mess& with the encoding, leaving it &as is& . That means,
$some_utf8 = $_POST&#91;'comment'&#93;;
'Foo '.$some_utf8.' bar'; # note this is VERY bad security - XSS!
$utf8_words = &#40;'I?t?rn?ti?nàlizaeti?n', 'foo', 'I?t?rn?ti?nàlizaeti?n'&#41;;
$utf8_words = &#40;' ',$utf8_words&#41;;
$utf8_string = 'I?t?rn?ti?nàlizaeti?n';
&#40;&#40;'i',$utf8_string&#41;&#41;;
None of the above will &damage& or alter the character encoding. PHP just passes the strings through blindly.
$utf8_string = 'I?t?rn?ti?nàlizaeti?n';
&#40;&#40;'à',$utf8_string&#41;&#41;;
Although it&s passing the string à as the seperator to , because well formed UTF-8 has the property that every sequence is unique, there&s no chance the à will be mistaken for another character, so we can safely explode the string using it.
What may also be a little confusing is PHP scripts themselves can contain more or less any sort of encoding - the PHP parser is generally fine with this, although you need to be careful when it comes to the
(BOM) - see
Yep there&s PHP extensions to help with character encoding issues but (if you use a shared host, you&ve probably already got that sinking feeling) they&re not enabled by default in PHP4. Tw
iconv: The
extension became a default part of PHP5 but it doesn&t offer you a magic wand that will make all problems go away. It probably has most value when either migrating old content to UTF-8 or when interfacing with systems can&t deliver you US-ASCII, ISO-8859-1 or UTF-8, such as an RSS feed, your PHP script reads, which is encoded with .
mbstring: The
extension is potentially a magic wand, as it provides a mechanism to override a large number of PHP&s string functions. Bad news is it&s not avaible by default in PHP. Third-hand reports say it used to be pretty unstable but in the last year or so has stabilized (more detail appreciated).
It may be you can take advantage of these extensions in your own environment but if you&re writing software for other people to install for themselves, that makes them bad news.
Then all our problems magically vanish
Specifically PHP 6 should have native understanding of Unicode and default to UTF-8 for output as well as a bunch of other stuff, building on the
So what do you do when the tools you have for the job (PHP) don&t provide the facilities you need? You make it someone elses problem. In fact something else - the web browser. The Firefox and IE (and no doubt Konqueror/Safari as well but can&t speak first hand) have excellent support for many different character sets, the most important being UTF-8. All you have to do is tell them &everything is UTF-8& and your problem goes away (well almost).
What makes UTF-8 special is, first, that it&s an encoding of Unicode and, second, that it&s backwards compatible with ASCII. F
Character codes less than 128 (effectively, the ASCII repertoire) are presented &as such&, using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to six octets, each of which is in the range 128 - 255. This means that in a sequence of octets, octets in the range 0 - 127 (&bytes with most significant bit set to 0&) directly represent ASCII characters, whereas octets in the range 128 - 255 (&bytes with most significant bit set to 1&) are to be interpreted as really encoded presentations of characters.
There&s some important
if you have some text encoded as only as ASCII, you can immediately declare it as UTF-8 without needing to convert it
there&s zero likelihood that, when doing things like searching UTF-8 string, with PHP&s string functions, that anything that&s not an ASCII character could be mistaken for an ASCII character. So ''strpos($utf_string,'&');'' won&t mistake any other characters, split into their bytes, as being a & character.
If you&re not sure which characters ASCII represents, try
A further special feature of UTF-8 is that in well formed UTF-8, no character can be mistaken for another. Put another way, if you have a character that takes four bytes to represent and chop of the last two bytes of that sequence, it cannot be mistaken for another character. Each sequence of bytes starts with an identifier byte using a value which only appears in identifiers bytes. It&s easiest to see by examining .
Note the &well formed& above. You might also have badly formed UTF-8 and it may be important, in some instances to check for this.
is probably the best way to check, being strict and fast. More on validation below.
If you&re starting development on a new application / site and currently have no content stored (that might be encoded in something other than ASCII or UTF-8), using UTF-8 is just a matter of informing browsers correctly. OK there&s more to it than that, depending on what you&re actually going to do with data you get from a browser, e.g. parsing it, but the first step is letting browsers know.
That can be done by sending the following HTTP
// Setting the Content-Type header with charset
&#40;'Content-Type: text/ charset=utf-8'&#41;;
Note: the value charset should be case insensitive - browsers shouldn&t care.
An alternative (which, for overkill, you might want to use as well), is HTML meta tag equivalent to the Content-Type HTTP
http-equiv=&Content-Type& content=&text/ charset=utf-8&&
Note: you should use this header as early as possible in the &head/& section of the document, in particular before the &title/& tag - if not the browser may decide the charset for you (have not confirmed this point but seen it discussed before - anyone have a relevant link?).
Otherwise, it&s worth examining section
of the HTML 4.01 spec.
One further place (for sake of overkill if you&re using the above HTTP header and meta tag), where it&s a good idea to specify a charset is in forms with the accept-charset attribute e.g.;
accept-charset=&utf-8&&
This attribute corresponds to the HTTP
header, so when used in a form it instructs the user-agent to send the header, when submitting the form. Technically accept-charset is an intruction for the server but it also guides the browser.
A relevant link from M
If the this attribute is not specified, the form will be submitted in the character encoding specified for the document. If the form includes characters outside the character set specified for the document, Microsoft Internet Explorer will attempt to determine an appropriate character set. If an appropriate character set cannot be determined, then the characters outside of the character set will be encoded as an HTML numeric character reference. (...) If the user enters characters that are not in the character set of the document containing the form, the UTF-8 character set will be used. UTF-8 is the preferred format for multilingual text.
Also an interesting comment (first on the list) via Sam Ruby&
The best way to control how the web browser will send back data is to use the accept-charset attribute on the form element.
Without that attribute, all kinds of weird things can happen (eg. if the user forces the browser to use a non-default character encoding to display the page, the form might get submitted in that encoding).
Sam also makes the point that if you don&t declare the encoding, you lose the knowledge of what you&re actually being sent, leading you to a position where you have to &guess& what the incoming charset is.
Now you&ve intructed browsers that you&re only using UTF-8, the next issue will be making sure that the string operations you perform in your code will behave correctly when given multibyte UTF-8 characters. This doesn&t mean you need to throw out all use of PHP&s native string functions, regular expressions and otherwise, but that you need to consider where PHP needs to understand what a multibyte character is.
In general, if we call the needle to be a string you are, in some way, searching for and the haystack to be the string you are searching in, you will need to worry when the needle could contain multibyte (non-ASCII) characters. This is expanded in under
but a couple of examples.
Let&s say you want to validate someone&s first name, checking that it contains no more than 10 characters (characters, not bytes). Normally
if &#40; &#40;$firstname&#41; & 10 &#41; &#123;
&#40;$firstname . ' is too long'&#41;;
Now with a Russian name, like Aleksandra, using the
is exactly ten characters long. But using the
the looks like Александра (still ten characters) and PHP&s
function sees it as containing 20 bytes - the above test for length will fail.
To handle this correctly, we have to turn to PHP&s
extension and the /u . T
if &#40; !&#40;'/^\w{,10}$/u', $firstname&#41; &#41; &#123;
&#40;$firstname . ' is too long'&#41;;
will match word characters (letters, numbers and underscores). Using the /u pattern modifier, the notion of what is a letter, to the PCRE extension, is extended to UTF-8 characters
The normal way to convert between upper and lower case is using PHP&s
and . The problem there though is they rely on the current locale setting of the server as men
&alphabetic& is determined by the current locale. This means that in i.e. the default &C& locale, characters such as umlaut-A (?) will not be converted.
The concept of a characters &case& only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See .
This means there is a finite number of characters that you might need to convert from upper to lowercase and vice versa. A lookup table of upper / lower case character mapping can be found in the source of PHP&s mbstring extension .
It&s possible to implement this in PHP, as done in DokuWiki
- have a look at the utf8_strtolower and utf8_strtoupper functions (note also that they attempt to use the mbstring extension first, as implementing these in PHP is slow).
One issue with UTF-8 is it&s difficult to predict how many bytes will be needed. Characters can be represented by up to four bytes so a field in a database would probably need to have it&s size increased by x4 - that may even force you to use BLOB fields where previously you&d used a VARCHAR
If you have existing content, it will need to be converted to UTF-8.
Note: it&s strongly recommended to migrate all content to UTF-8 offline rather than attempting to manage multiple character sets at &runtime&, converting between the two. Aside from the fact PHP has poor support for this unless you have
available, it&s also a recipe for disaster if your application loses track of what encoding is being used where.
If you&re running an English-only site and you&re 99% sure that the content contains only ASCII characters, you might just bite the bullet, redeclare the site as UTF-8 and manually edit any problem characters as you find them.
If you&re site contains alot of non-English content, there&s a good chance you&ll need to convert the content to UTF-8.
First you need to find out what character set you&re currently using (Firefox: right-click & Page Info is a good way to find out). If you&ve never though about it, there&s a fairly good chance it&s encoded as ISO-8859-1.
See Sam Ruby&s .
Many HTML editors add a content-type meta tag that explicity declares ISO-8859-1, so often people are using it without even being aware.
Generally it&s easy to convert using PHP&s
function (which explicity converts ISO-8859-1 to UTF-8 - nothing else!).
You may also want to convert numeric HTML character refereneces as these will only be understood by a browser (TODO: more detail here needed)
Also be aware of of Windows cp1252 code page, which is similar to ISO-8859-1 but not 100%. See comments on
page (need to add examples / problems where this is an issue).
If you have a legacy application in ISO-8859-1, and don&t want/can&t convert it to UTF-8, but still need to use UTF-8, you can use the hack described at . In general, mixing charsets isn&t adviced, because of the danger of getting things mixed up.
(TODO) Use iconv - see Useful Tools as well
So long as you declared you pages / forms correctly with UTF-8, web browsers (modern ones at least) should provide no real issue (although you should still check input for well formedness).
The problem is what if you interface with other applications / data sources than a web browser (e.g. an RSS feed).
When building interfaces on your site for other applications to access, such as an RSS feed (i.e. when you&re outputing something for non-browser consumption) you need to make sure you&re declaring character sets e.g.
&#40;'Content-Type: text/xml, charset=utf-8'&#41;;
'&?xml version=&1.0& encoding=&utf-8&?&';
You should also avoid passing strings through htmlentities - few applications other than browsers understand HTML entities.
Otherwise, if you&re generating XML, you should read .
When handling input from sources other than a browser (e.g. parsing an RSS feed or accepting a ), there are two mai
Determining the input character encoding
Converting it to UTF-8 (if necessary)
In addition, different effort is required, depending on whether you PHP app is acting as a client or server to the input.
Specific to XML, you need to be careful with the SAX parser - see .
Hopefully the source of the input has declared the character encoding. If you&re reading a remote RSS feed, your HTTP client needs to examine the Content-Type header delivered by the remote server. You might also (if XML) want to examine the opening XML processing instruction and hunt for the encoding attribute. If there is no charset declaration anywhere, the safest thing to do is not to use it - it is possible to detect character sets but no one has done a good job of this in PHP - if desperate, you might consider passing it to a Python script via a shell command, as use the . Once you know what the content is declared as, you should be OK to pass the string and the declaration to
- it will fail if the declaration is lying or the character set is not supported.
If you&re insisting on PHP, you&re probably best off using
which has some support for this problem. More generally, it&s probably better to use
(a Python project which is light years ahead and has automatic encoding detection built in) running as a cron job.
If you&re accepting a
or similar, you need to check what the client is declaring the input as - the
function and
will allow you to examine the incoming Content-Type header.
Otherwise the rest of the discussion for RSS applies
About the most rigorous but also performant way to test whether a string is well formed UTF-8 in PHP is by using this
- returns FALSE when you convert from UTF-8 to Unicode code points, if the UTF-8 is not well formed.
Another approach is to the regular expression here
with . Have found this to be alot slower than t noticably if you have a large document. Note also that the way it&s defined, it regards most non-printable ASCII characters as invalid which may or may not be what you want.
A third and very fast way to do a quick check is using
with the /u modifier. If preg_match (or similar PCRE functions) are given badly formed UTF-8, when using the /u modifier, they simply die quietly. That means you can use a functi
function utf8_compliant&#40;$str&#41; &#123;
if &#40; &#40;$str&#41; == 0 &#41; &#123;
return TRUE;
// If even just the first character can be matched, when the /u
// modifier is used, then it's valid UTF-8. If the UTF-8 is somehow
// invalid, nothing at all will match, even if the string contains
// some valid sequences
return &#40;&#40;'/^.{1}/us',$str,$ar&#41; == 1&#41;;
Some words of warning.
UTF-8 allows for five and six byte sequences and PCRE uses that as the definition of UTF-8. But 5/6 byte sequences are not supported by Unicode. That means the above test might pass such a sequence but it would not actually represent any Unicode character - see
for more detail plus some well formed vs. badly formed strings to test with.
Whether the utf8_compliant() function shown above works properly seems to depend on exactly which version of PHP (or perhaps PCRE) it cannot be trusted for input validation. It is a bad choice for checking that your string is correctly formed UTF-8 to prevent an invalid multibyte sequence from being used in an
based . In particular, for this invalid UTF-8 sequence (that
will helpfully turn into an unescaped single quote) utf8_compliant(chr(0xbf).chr(0&27)) == true. Ouch. Fast yes, accurate no. (test fails with PHP4.4.2/PHP5.1.2 and PCRE6.4; presumably successful test by original author with unknown versions).
Interesting - the behaviour described by the above comment suggests PCRE is failing to understand UTF-8 completely. There is a compile option to enable / disable UTF-8 support but if it&s disabled, using the /u modifier should produce a warning (something like &Not compiled with UTF-8 support&).
A fourth approach can be found in
- have yet to test / benchmark it.
A fifth approach (and perhaps the wisest) is to use iconv to convert from UTF-8 to UTF-8. See
(TODO) Common code situations...
When you put utf-8 encoded strings in a XML document you should remember that not all utf-8 valid chars are accepted in a XML document
So you should strip away the unwanted chars, else you&ll have an XML fatal parsing error
function utf8_for_xml&#40;$string&#41;
return &#40;'/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $string&#41;;
PHP provides two functions,
and , which tend to get liberally scattered around code, for translating certain characters to HTML entities. It also provides the function
to go in the opposite direction, from entity to character. Behind the functions are some
for different character sets, about which some information can be obtained using
(but see !).
In general, given that you&re switching to UTF-8, you no longer need to use HTML entities other than the &special five& which could cause a parser problems, because the characters can be represented directly in UTF-8. The &special five&, which could trip an HTML / XML
& (ampersand) entity: &
& (double quote) entity: &
' (single quote) entity: &
& (less than) entity: &
& (greater than) entity: &
PHP does support more entities than this and understands their corresponding character representation in a number of common character sets (but not all character sets). In particular the translation from entity to UTF-8 characters seems to have been broken up until PHP 5.
In general, the safe rule is don&t output anything but the &special five& entities (or use anything but those five &internally& within your application). Entities will then only be an issue if you&re consuming data from an external source which is using them.
provides a tool to help with generating HTML and XML markup, to make sure that characters like &, which could be mistaken for part of the markup, are converted to an entity like & for display, as describe in the above section.
Unlike many of PHP&s string functions,
has some awareness of character encodings and, by default, assumes text it is given to escape is encoded as ISO-8859-1. Technically you can probably get away with passing it UTF-8 encoded text, without any problems because there shouldn&t be anything which is could mistake for the characters it is trying to match, but it&s probably smarter to tell it the character, using it&s third argument, e.g.;
$html = &#40;$utf8_string, ENT_COMPAT, 'UTF-8'&#41;;
To reverse , you are better off rolling your own function - see .
allows translation of a further range of characters (in addition to the markup characters translated by ) into their equivalent HTML entities. The original reason
for having
was to allow browsers with, say, only support for ASCII encoding to be able to display further, useful, characters.
You can get an idea of the characters that PHP&s
function would translate like this (note
cannot be told what charset to use - see .
echo '&pre&';
print_r(array_map('htmlspecialchars',get_html_translation_table(HTML_ENTITIES)));
echo '&/pre&
With modern web browsers and widespead support for UTF-8, you don&t need
because all of these characters can be represented directly in UTF-8. More importantly, in general, only browsers support HTML&s special characters - a normal text editor, for example, is unaware of HTML entities. Depending on what you&re doing, using
may reduce the ability of other systems to &consume& your content.
Also (not confirmed but sounds reasonable - from anon comment ), character entities (stuff like >> or —) do not work when a document is served as application/xml+xhtml (unless you define them). You can still get away with the numeric form though.
function is intended to convert HTML entities back into &normal& characters. Depending on the character set you tell it to use, it looks up an HTML entity it finds in some text and returns corresponding character from a . The character set you specify as this functions third argument means both the character set of the text you give
to parse and the character set which which to decode the entities into.
Support for UTF-8 seems to have been broken for this function until PHP 5 - see
Generally speaking you&re probably better off avoiding it unless you&re forced to consume some external data source which contains entities other than the special five (above). That also means you may be better off rolling your own function to reverse , because
will translate more than just the &special five&.
This function returns a array with characters as keys and their corresponding HTML entities as values. It looks like this function will always provide the characters encoded as ISO-8859-1, from looking at the
function it relies on.
Perhaps a future PHP version will see
provide a third argument to switch the charset.
If you do need to get into translating to and from anything but the &special five& entities, you should get familiar with what the relevant functions really do internally, by looking at .
As has already been discussed, because PHP&s basic string function regards 1 byte to be 1 character, using a function like
on a multibyte string (like UTF-8) will tell you the number of bytes in the string, not the number of characters.
To count the number of characters, there&s a nice hack via the
function (mentioned in the comment by &chernyshevsky at hotmail dot com& on the
function utf8_strlen&#40;$string&#41;&#123;
return &#40;&#40;$str&#41;&#41;;
function outputs are only for translating between ISO-8859-1 and UTF-8 (the function names are a little misleading) but when going from UTF-8 to ISO-8859-1, any UTF-8 character that
doesn&t know how to handle will be replaced by a single ? character of one byte. In effect that means all characters which are multiple bytes are &crunched& into single byte characters. From there
tells the &truth& about the number of characters in the string.
If you want to see the internal implementation, look
which calls
- seems to do a safe job of parsing UTF-8.
(TODO) More to come on stuff like substr
(TODO) Examples of the
/u pattern modifier and highlight the \w metacharacter
(TODO) Issues like spoofing / phishing etc.
UTF-7 risks - see
This is an attempt at describing the switch to UTF-8 in Dokuwiki, from the memory of someone who was indirectly involved. Right now it&s a short overview.
is a PHP wiki which stores all wiki pages in files. Origionally it began by defaulting to ISO-8859-1 while supporting other character sets depending on what language you specified in the Dokuwiki configuration.
The problem with this approach is it meant that a given wiki installation could only support a single character set (effectively meaning a small group of languages). It also introduced a whole bunch of headaches, like the behaviour of
these functions in conjunction with a server&s locale settings and perhaps the need for character set detection and iconv for any wiki content from sources other than a web browser.
The decision was taken to switch to move dokuwiki to &all UTF-8& - all wiki pages would be encoded as UTF-8. This pushed 90% of problems onto the browser (modern browsers have, generally excellent support for UTF-8) and allowed a single wiki to support many different character sets and thereby languages (see ).
The remaining 10%
the need to migrate existing wikis and their content to UTF-8. Given 99% of users were using ISO-8859-1, the
was written to help them migrate. Migration of content had to be performed all in one go by end users
the need to implement some
such as utf8_strtolower() for users without the mbstring extension installed.
the need for further functions (like utf8_strip_specials) to help with converting UTF-8 to ASCII for wiki page names
checking whether input is valid UTF-8 (utf8_check)
The short summary of this section is: for PHP & v6, you need mbstring and iconv available.
Provides multibyte aware implementations of some of the most common PHP string functions, the
extension and the . These are either accessible via their own namespace (i.e. functions beginning mb_*) or can be used to &overload& the normal PHP implementations, giving you half a chance (expect to have additional work to do) to have an application support a different character set to that it was designed for.
The mbstring extension supports many different character sets, most importantly UTF-8. It also allows for conversion between character sets and implements some level of encoding detection (no idea how effective this is though).
The mbstring extension is not part of the default PHP distribution - if you need it and are using a web hosting service, make sure you provide has compiled it into PHP. Common Linux distributions (like Debian) package PHP with mbstring.
The main purpose of the iconv extension is converting between different character sets. Generally it would be best applied to input sources other than web browsers (e.g. when you&re aggregating RSS feeds encoded in different character sets) and is probably the most effective tool PHP has for character set conversion.
From PHP 5+, the iconv extension also comes with implementations of some common string functions, but from crude benchmarks, is much slower than mbstring or , at least when working with UTF-8. This seems to be because iconv is carefully checking for badly formed UTF-8.
Also from PHP 5+, iconv became a default part of the PHP distribution. For PHP versions &lA 4, make sure your host has installed it.
Essentially does the same thing as iconv, for converting strings to other character sets. General feeling is better use iconv - recode doesn&t get much use, is not available for use on Windows and causes issues with some more popular extensions.
The names of these two functions are slightly misleading - they are specifically for use in converting between ISO-8859-1 and UTF-8 - nothing more, nothing less.
They are package with PHP&s
and could be regarded as &legacy& from the days where 99% of web pages where encoded as ISO-8859-1.
They can be useful in some instances though, for example utf8_decode() has the effect of &squashing& multibyte UTF-8 sequences into a single byte (whether is &recognizes& a target ISO-8859-1 character or not) and is very fast. That means you can implement a UTF-8 aware
function utf8_strlen&#40;$str&#41; &#123;
return &#40;&#40;$str&#41;&#41;;
PHP 6 will be using IBM&s
to provide native support for character sets. This is, in general, very good news and brings PHP on a par with Java in this area.
There&s some information
- at this time it&s not entirely clear how it will end up looking like - if you need advance warning, keep an eye on the i18n and interals .
(TODO) List of key things to think about
- the MUST READ
- nice gentle introduction to the topic
- notes on UTF-8, in particular the pro&s and con&s.
- I?t?rn?ti?nàlizaeti?n - check the Further Reading as well
- although Python specific, still some useful insight
- Python specific but useful general info
- good background info. Also available with $ perldoc perluniintro if you&re on *Nix
- not web specific but some useful discussion here anyway.
- Various problems related to form data and character issues, browser bugs etc.
(pdf) - Andrei Zmievski goes into detail on how it&s going to look in PHP6
(application/pdf) - talk by D. Rethans on i18n and l10n issues and PHP
(application/pdf) - good article on migrating to UTF-8
(application/pdf)
(application/pdf)
- great exploratory presentation, focusing on solving the problems in PHP / MySQL using UTF-8 as the solution.
- replacing various HTML entities with UTF-8 equivalents.
- everyone lives happily ever after
- outline the strategy for Joomla.
- &s talk covering Unicode, a little comparison with Python, Perl and Java then outlines of plans to bring Unicode to PHP at some future date, based on .
- re-implements in PHP parts of the iconv, mbstring and intl extensions and adds grapheme cluster aware versions of most
(Apache-2.0 + GPL-2.0).
- re-implements most
to be &UTF-8& aware.
library (GPL)
contain useful functionality for UTF-8 (GPL)
- among other things, shows how to parse UTF-8 correctly, detecting invalid UTF-8 encoding
- PHP library (written in PHP) for converting character sets - supports quite a few character sets. Also bundled in DokuWiki UTF-8 conversion helper - see
- supports conversions between EUC-JP, Shift_JIS, ISO-2022-JP(JIS) and UTF-8
When editing content outside of a (decent) browser, make sure to use an editor with UTF-8 support (i.e. not notepad!)
- simple text editor, useful for creating and viewing text encoded in different encodings.
& excellent Open Source cross platform editor: make sure you set the properties value code.page=65001 to make it use UTF-8
& Java based editor with UTF-8 support
EMACS & see
& a GTK2 based editor for GNU/Linux
& a very good notepad replacement for Windows
& some online tools to help / learn about Unicode
- main page
- PHP functions and UTF-8
- UTF-8 and MySQL
Note you should be using a text editor capable of encoding PHP source files as UTF-8 - see
we&re talking loose definitions here for humans to grasp - PHP&s internal string representations are ultimately &zeros and ones&
there are exceptions to this of course. PHP&s string functions are &generally safe&, depending on what you&re doing. You need be careful with
and , for example which are &locale aware& and could mistake UTF-8 characters for those in the current locale. Also the
\w meta character in the
regular expression extensions is locale dependendent unless the /u modifier is used - see what references
the modern browsers all do a good job with UTF-8 and support many other character sets as well - they can be more or less trusted to get it right
note some input is need here into how it does this - assume that is regards any multibyte character it is not aware of as being a letter character - that probably means \w will match a chess character like the queen: ?
function outputs &#39; instead of &apos, apparently because IE seems to have trouble with the latter
not confirmed!!
probably very over simplified or even wrong explaination so be warned
php/i18n/charsets.txt & Last modified:
08:32 by nikos}

我要回帖

更多关于 underscores 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信