Print Page - UTF8 only!

Public area => Features => The Pub => Features: Forward thinking => Topic started by: Nao on May 6th, 2011, 05:10 PM

Title: UTF8 only!
Post by: Nao on May 6th, 2011, 05:10 PM

Feature: UTF8 only!
Developer: Arantor & Nao
Target: modders, translators, admins
Status: 99% (believed to be complete.)
Comment:

SMF was celebrated for supporting all possible charsets in its codebase. This was back in 2003. Work on SMF2 was started in 2005. Six years later, UTF8 is supported everywhere, and text string size is not really a problem anymore.
We decided to drop support for all charsets and force the use of UTF8 everywhere. Ultimately, this will be a blessing to modders who always had to make sure their string manipulation routines was using the correct charset path.
Also, $smcFunc functions were slow, we worked on their performance when moving them over to the westr object. Generally speaking, everything should be made easier with this move. It's time to say goodbye to ISO-8859-1.

Title: Re: UTF8 only!
Post by: abraamz on May 6th, 2011, 08:42 PM

I use iso 8859-7, thats greek language. When its time for conversion everything is going to be converted to utf8 by wedge or will you leave that to smf's "convert to utf8" function?

Title: Re: UTF8 only!
Post by: TE on May 6th, 2011, 08:48 PM

The wedge importer will convert everything to proper UTF-8 :eheh:

Title: Re: UTF8 only!
Post by: Nao on May 6th, 2011, 09:29 PM

Yep, hopefully it will handle everything, including Aeva Media fields. (Which, amusingly, were always in ISO-8859 because at the time I didn't get a thing about UTF... It was my kryptonite... Now it's my slave!) (Well, maybe not my slave, but at least it respects me.)

Title: Re: UTF8 only!
Post by: TE on May 7th, 2011, 09:13 AM

Quote from Nao/Gilles on May 6th, 2011, 09:29 PM

Yep, hopefully it will handle everything, including Aeva Media fields. (Which, amusingly, were always in ISO-8859 because at the time I didn't get a thing about UTF... It was my kryptonite... Now it's my slave!) (Well, maybe not my slave, but at least it respects me.)

Never mind, I've improved the utf8 encoding class I was working on for the importer and I'm very optimistic it will be able to fix any type of issues related to the charset. Even a mixed or double encoded string (containing utf-8 and ISO characters) will be detected and properly converted :)

Code: [Select]

$string = 'äöüßÃ¤Ã¶Ã¾';

would be converted to:

Code: [Select]

äöüßäöß

As long as there is a readable character in the database it's fixable :cool:

Title: Re: UTF8 only!
Post by: live627 on May 7th, 2011, 09:19 AM

That's great!

Title: Re: UTF8 only!
Post by: Arantor on May 7th, 2011, 09:43 AM

That sounds awesome (more that what I had in mind)!

Title: Re: UTF8 only!
Post by: Nao on May 7th, 2011, 10:05 AM

Can't wait to be able to test it on Wedge.org :)

(Pete, how 'bout finishing that agreement btw? :P)

Title: Re: UTF8 only!
Post by: Arantor on May 7th, 2011, 10:13 AM

Yeah, I'm running with posh keyboard with iPad for the next week (now up at parents-in-law for the week) so I can definitely get that typed up.

Title: Re: UTF8 only!
Post by: Artur on May 7th, 2011, 10:17 PM

Those UTF-8 problems are more than annoying. On that board I'm a member off I cannot use the german language as the one is the ISO form and the other is the UTF-8 form and somehow it doesn't really work right as it breaks mods if you use the one or the other. Really annoying.

Title: Re: UTF8 only!
Post by: Arantor on May 7th, 2011, 10:43 PM

That's why I was pretty determined up front to go UTF8 only.

Title: Re: UTF8 only!
Post by: and on May 7th, 2011, 11:14 PM

Quote from Nao/Gilles on May 6th, 2011, 05:10 PM

Feature: UTF8 only![/b]Ultimately, this will be a blessing to modders who always had to make sure their string manipulation routines was using the correct charset path.

yes yes yes!
I do not like this double standard [1]

I am only for utf-8

1.	(how to get the novice admins assigns Forum encoded in cp1251 and then mods on it to utf encoding, or vice versa)

Title: Re: UTF8 only!
Post by: Nao on May 7th, 2011, 11:51 PM

UTF8's only issues are: (1) slightly lower performance, (2) uses more space.
(1) isn't much of an issue because we only use UTF8-aware functions when necessary. There is no need to call a UTF8-aware strlen when we want to know the actual size in bytes rather than characters. Things like that...
(2) is more of a problem but really... It takes about 5% more space in a French language forum. It's not that much... And it's even less in an English-only forum (actually, 0%). Although if you have a Japanese board, it will use much more space than Shift-JIS or whatever. But that's the price to pay. If you have a Japanese board set to ISO-8859-1, it will use up way more space because all chars are converted to HTML entities.

Title: Re: UTF8 only!
Post by: DoctorMalboro on May 9th, 2011, 12:13 AM

Does that mean I can use the Ñ in this forum?

Title: Re: UTF8 only!
Post by: Arantor on May 9th, 2011, 12:30 AM

You just did :P

Title: Re: UTF8 only!
Post by: Nao on May 9th, 2011, 12:43 AM

Añyway, Ñ is part of ISO-8859-15 which is the default here, so it won't be converted to an entity at all. But if you want to use 尚, you can. It just won't be encoded in UTF, it'll be an entity. (Look at the source code.)

Title: Re: UTF8 only!
Post by: Arantor on May 9th, 2011, 12:47 AM

I think it's even part of 8859-1 actually, which does define the very most common accented characters.

But in Wedge it can be 'just used' without any problems.

Title: Re: UTF8 only!
Post by: Nao on May 9th, 2011, 12:59 AM

-15 doesn't have a lot of differences with -1, it just adds a few special chars including ¤, yeah I think it's in 8859-1 then.

Title: Re: UTF8 only!
Post by: DoctorMalboro on May 9th, 2011, 01:15 AM

I hated the fact that many forums didn't support it...

Title: Re: UTF8 only!
Post by: Nao on May 9th, 2011, 01:19 AM

But SMF always did, AFAIK? (It's just a matter of adding accept-charset="ISO-8859-1" into your textareas... Browsers do the rest of the job!)

Title: Re: UTF8 only!
Post by: DoctorMalboro on May 9th, 2011, 01:22 AM

Yeah, hopefully... Anyways, the bastards in the Royal Academy are so fu*** up that the Ñ will die in a few years...

/mefacepalm against the new "language management"...

Title: Re: UTF8 only!
Post by: Stef on June 22nd, 2011, 07:33 PM

Wedge is more Simple ...... forum

I like that.

Stef.

Title: Re: UTF8 only!
Post by: Norodo on July 19th, 2011, 07:20 AM

Is the backup problem with UTF fixed? (Where if you used the backup feature in SMF, and restored the DB, the charset was fubar.)

Title: Re: UTF8 only!
Post by: Nao on July 19th, 2011, 09:47 AM

I never tried doing that.
Don't believe it's been tackled either way.

Title: Re: UTF8 only!
Post by: Arantor on July 19th, 2011, 11:09 AM

There is a part of me that just wants to ditch it because it's not suited to large dumps and has all kinds of odd failure conditions, and part of me wants to find a better way.

Title: Re: UTF8 only!
Post by: Dragooon on July 19th, 2011, 12:30 PM

Perhaps integrating the backup feature with something like bigdump?

Title: Re: UTF8 only!
Post by: Arantor on July 19th, 2011, 12:57 PM

Here's where it gets complicated. Hosts that use cPanel etc, already have a backup facility, that isn't tied to PHP memory limits or Apache timeouts etc. meanwhile folks on unmanaged setups (typically VPS or meatier) will be running their own backup scripts anyway (or *should* be)

It's only then for the hosts that are that bad who don't provide backups AND don't provide access to anything else, in which case you're still screwed anyway since the majority of those hosts don't allow access to SMF's backup service either.

Title: Re: UTF8 only!
Post by: Norodo on July 19th, 2011, 01:57 PM

I would rather you deleted the whole backup function then. I was almost burned on this one, lucky for me my previous experiences had taught me to have two backups from different sources... (one from SMF and one directly from MySQL.)

I restored the database and bam, thousands of "foreign" letters, all gone. (Well, not gone, just mangled.) I promptly changed to the direct from MySQL backup, and stuff was fine.

So yeah. It sort of needs to be addressed. I'd say it's a pretty big bug.

Title: Re: UTF8 only!
Post by: Antes on July 19th, 2011, 08:38 PM

Awesome :)