UTF8 only!

Nao

  • Dadman with a boy
  • Posts: 16,079
UTF8 only!
« on May 6th, 2011, 05:10 PM »
Feature: UTF8 only!
Developer: Arantor & Nao
Target: modders, translators, admins
Status: 99% (believed to be complete.)
Comment:

SMF was celebrated for supporting all possible charsets in its codebase. This was back in 2003. Work on SMF2 was started in 2005. Six years later, UTF8 is supported everywhere, and text string size is not really a problem anymore.
We decided to drop support for all charsets and force the use of UTF8 everywhere. Ultimately, this will be a blessing to modders who always had to make sure their string manipulation routines was using the correct charset path.
Also, $smcFunc functions were slow, we worked on their performance when moving them over to the westr object. Generally speaking, everything should be made easier with this move. It's time to say goodbye to ISO-8859-1.

Re: UTF8 only!
« Reply #1, on May 6th, 2011, 08:42 PM »
I use iso 8859-7, thats greek language. When its time for conversion everything is going to be converted to utf8 by wedge or will you leave that to smf's "convert to utf8" function?

Re: UTF8 only!
« Reply #2, on May 6th, 2011, 08:48 PM »
The wedge importer will convert everything to proper UTF-8  :eheh:

Re: UTF8 only!
« Reply #3, on May 6th, 2011, 09:29 PM »
Yep, hopefully it will handle everything, including Aeva Media fields. (Which, amusingly, were always in ISO-8859 because at the time I didn't get a thing about UTF... It was my kryptonite... Now it's my slave!) (Well, maybe not my slave, but at least it respects me.)

Re: UTF8 only!
« Reply #4, on May 7th, 2011, 09:13 AM »
Quote from Nao/Gilles on May 6th, 2011, 09:29 PM
Yep, hopefully it will handle everything, including Aeva Media fields. (Which, amusingly, were always in ISO-8859 because at the time I didn't get a thing about UTF... It was my kryptonite... Now it's my slave!) (Well, maybe not my slave, but at least it respects me.)
Never mind, I've improved the utf8 encoding class I was working on for the importer and I'm very optimistic it will be able to fix any type of issues related to the charset. Even a mixed or double encoded string (containing utf-8 and ISO characters) will be detected and properly converted :)
Code: [Select]
$string = 'äöüßäöþ';
would be converted to:
Code: [Select]
äöüßäöß
As long as there is a readable character in the database it's fixable  :cool:

Re: UTF8 only!
« Reply #5, on May 7th, 2011, 09:19 AM »
That's great!

Re: UTF8 only!
« Reply #6, on May 7th, 2011, 09:43 AM »
That sounds awesome (more that what I had in mind)!

Re: UTF8 only!
« Reply #7, on May 7th, 2011, 10:05 AM »
Can't wait to be able to test it on Wedge.org :)

(Pete, how 'bout finishing that agreement btw? :P)

Re: UTF8 only!
« Reply #8, on May 7th, 2011, 10:13 AM »
Yeah, I'm running with posh keyboard with iPad for the next week (now up at parents-in-law for the week) so I can definitely get that typed up.

Re: UTF8 only!
« Reply #9, on May 7th, 2011, 10:17 PM »
Those UTF-8 problems are more than annoying. On that board I'm a member off I cannot use the german language as the one is the ISO form and the other is the UTF-8 form and somehow it doesn't really work right as it breaks mods if you use the one or the other. Really annoying.

Re: UTF8 only!
« Reply #10, on May 7th, 2011, 10:43 PM »
That's why I was pretty determined up front to go UTF8 only.

Re: UTF8 only!
« Reply #11, on May 7th, 2011, 11:14 PM »Last edited on May 7th, 2011, 11:20 PM by Makar
Quote from Nao/Gilles on May 6th, 2011, 05:10 PM
Feature: UTF8 only![/b]Ultimately, this will be a blessing to modders who always had to make sure their string manipulation routines was using the correct charset path.
yes yes yes!
I do not like this double standard  [1]

I am only for utf-8
 1. (how to get the novice admins assigns Forum encoded in cp1251 and then mods on it to utf encoding, or vice versa)

Re: UTF8 only!
« Reply #12, on May 7th, 2011, 11:51 PM »
UTF8's only issues are: (1) slightly lower performance, (2) uses more space.
(1) isn't much of an issue because we only use UTF8-aware functions when necessary. There is no need to call a UTF8-aware strlen when we want to know the actual size in bytes rather than characters. Things like that...
(2) is more of a problem but really... It takes about 5% more space in a French language forum. It's not that much... And it's even less in an English-only forum (actually, 0%). Although if you have a Japanese board, it will use much more space than Shift-JIS or whatever. But that's the price to pay. If you have a Japanese board set to ISO-8859-1, it will use up way more space because all chars are converted to HTML entities.

Re: UTF8 only!
« Reply #13, on May 9th, 2011, 12:13 AM »
Does that mean I can use the Ñ in this forum?

Re: UTF8 only!
« Reply #14, on May 9th, 2011, 12:30 AM »
You just did :P

Re: UTF8 only!
« Reply #15, on May 9th, 2011, 12:43 AM »
Añyway, Ñ is part of ISO-8859-15 which is the default here, so it won't be converted to an entity at all. But if you want to use 尚, you can. It just won't be encoded in UTF, it'll be an entity. (Look at the source code.)