Wedge
Public area => Features => The Pub => Features: Forward thinking => Topic started by: Nao on May 6th, 2011, 05:10 PM
-
Feature: UTF8 only!
Developer: Arantor & Nao
Target: modders, translators, admins
Status: 99% (believed to be complete.)
Comment:
SMF was celebrated for supporting all possible charsets in its codebase. This was back in 2003. Work on SMF2 was started in 2005. Six years later, UTF8 is supported everywhere, and text string size is not really a problem anymore.
We decided to drop support for all charsets and force the use of UTF8 everywhere. Ultimately, this will be a blessing to modders who always had to make sure their string manipulation routines was using the correct charset path.
Also, $smcFunc functions were slow, we worked on their performance when moving them over to the westr object. Generally speaking, everything should be made easier with this move. It's time to say goodbye to ISO-8859-1.
-
I use iso 8859-7, thats greek language. When its time for conversion everything is going to be converted to utf8 by wedge or will you leave that to smf's "convert to utf8" function?
-
The wedge importer will convert everything to proper UTF-8 :eheh:
-
Yep, hopefully it will handle everything, including Aeva Media fields. (Which, amusingly, were always in ISO-8859 because at the time I didn't get a thing about UTF... It was my kryptonite... Now it's my slave!) (Well, maybe not my slave, but at least it respects me.)
-
Yep, hopefully it will handle everything, including Aeva Media fields. (Which, amusingly, were always in ISO-8859 because at the time I didn't get a thing about UTF... It was my kryptonite... Now it's my slave!) (Well, maybe not my slave, but at least it respects me.)
Never mind, I've improved the utf8 encoding class I was working on for the importer and I'm very optimistic it will be able to fix any type of issues related to the charset. Even a mixed or double encoded string (containing utf-8 and ISO characters) will be detected and properly converted :)
would be converted to:
As long as there is a readable character in the database it's fixable :cool:
-
That's great!
-
That sounds awesome (more that what I had in mind)!
-
Can't wait to be able to test it on Wedge.org :)
(Pete, how 'bout finishing that agreement btw? :P)
-
Yeah, I'm running with posh keyboard with iPad for the next week (now up at parents-in-law for the week) so I can definitely get that typed up.
-
Those UTF-8 problems are more than annoying. On that board I'm a member off I cannot use the german language as the one is the ISO form and the other is the UTF-8 form and somehow it doesn't really work right as it breaks mods if you use the one or the other. Really annoying.
-
That's why I was pretty determined up front to go UTF8 only.
-
Feature: UTF8 only![/b]Ultimately, this will be a blessing to modders who always had to make sure their string manipulation routines was using the correct charset path.
yes yes yes!
I do not like this double standard
I am only for utf-8
-
UTF8's only issues are: (1) slightly lower performance, (2) uses more space.
(1) isn't much of an issue because we only use UTF8-aware functions when necessary. There is no need to call a UTF8-aware strlen when we want to know the actual size in bytes rather than characters. Things like that...
(2) is more of a problem but really... It takes about 5% more space in a French language forum. It's not that much... And it's even less in an English-only forum (actually, 0%). Although if you have a Japanese board, it will use much more space than Shift-JIS or whatever. But that's the price to pay. If you have a Japanese board set to ISO-8859-1, it will use up way more space because all chars are converted to HTML entities.
-
Does that mean I can use the Ñ in this forum?
-
You just did :P
-
Añyway, Ñ is part of ISO-8859-15 which is the default here, so it won't be converted to an entity at all. But if you want to use 尚, you can. It just won't be encoded in UTF, it'll be an entity. (Look at the source code.)
-
I think it's even part of 8859-1 actually, which does define the very most common accented characters.
But in Wedge it can be 'just used' without any problems.
-
-15 doesn't have a lot of differences with -1, it just adds a few special chars including ¤, yeah I think it's in 8859-1 then.
-
I hated the fact that many forums didn't support it...
-
But SMF always did, AFAIK? (It's just a matter of adding accept-charset="ISO-8859-1" into your textareas... Browsers do the rest of the job!)
-
Yeah, hopefully... Anyways, the bastards in the Royal Academy are so fu*** up that the Ñ will die in a few years...
/mefacepalm against the new "language management"...
-
Wedge is more Simple ...... forum
I like that.
Stef.
-
Is the backup problem with UTF fixed? (Where if you used the backup feature in SMF, and restored the DB, the charset was fubar.)
-
I never tried doing that.
Don't believe it's been tackled either way.
-
There is a part of me that just wants to ditch it because it's not suited to large dumps and has all kinds of odd failure conditions, and part of me wants to find a better way.
-
Perhaps integrating the backup feature with something like bigdump?
-
Here's where it gets complicated. Hosts that use cPanel etc, already have a backup facility, that isn't tied to PHP memory limits or Apache timeouts etc. meanwhile folks on unmanaged setups (typically VPS or meatier) will be running their own backup scripts anyway (or *should* be)
It's only then for the hosts that are that bad who don't provide backups AND don't provide access to anything else, in which case you're still screwed anyway since the majority of those hosts don't allow access to SMF's backup service either.
-
I would rather you deleted the whole backup function then. I was almost burned on this one, lucky for me my previous experiences had taught me to have two backups from different sources... (one from SMF and one directly from MySQL.)
I restored the database and bam, thousands of "foreign" letters, all gone. (Well, not gone, just mangled.) I promptly changed to the direct from MySQL backup, and stuff was fine.
So yeah. It sort of needs to be addressed. I'd say it's a pretty big bug.
-
Awesome :)