Wedge

Nao · « on February 24th, 2014, 03:57 PM »

I could do with some ideas here.

Okay, so a couple of weeks ago, I decided to shorten the {db_prefix}members table, by moving some of its 'optional' fields into the data field (which is a do-it-all column that holds a serialized array.)

Among the fields I moved, 'message_labels' had for my account a value of "À répondre", which is a label I made which means "pending reply". As you can see, there are accents in that string...

Now, imagine this. I'm trying to keep it simple:
$data = array('member_labels' => 'À répondre');
$data_field = serialize($data);
Then insert $data_field as the 'data' field in the members table.

All right..?
Now, do the reverse. unserialize($row['data']), basically.
Unserialize error. Ouch.
Why so?
Because, apparently, the string was turned into a different format between the moment it was serialized, and the moment I attempted to unserialize. This only happens for data fields that contain accents or other weird characters, so English users probably never saw this happen. A serialized string first holds the size of a string, then its contents. If the string is modified in the meantime, then the size won't match, and this will trigger an unserialize error.

Now, I attempted to 'correct' this by turning my serialized strings into JSON strings instead. But by default, PHP decodes these into stdClass objects, which sucks, because I either have to specifically recast them as arrays, or add a parameter in json_decode.
Okay, I can live with that... But other tables in Wedge also hold a 'data' field, which in turn is also a serialized string. Does it mean I should use json_encode instead everywhere...?

That was my first solution. The advantage of JSON is that it makes shorter strings, but they're also slightly slower to decode. Not that much, mind you...

Okay, next solution?
Call westr::utf8_to_entity() on every array I'm going to serialize. Unfortunately, it's also a slow function, and to be sure, we need to call it on every single data sub-field, which could eventually waste a lot of time. This is the currently (half-) implemented solution.

Try to find a solution to that storage problem. I didn't try this first, because honestly I suck at handling UTF in databases. But logic dictates that if you store a UTF8 string and retrieve it, as long as the database is UTF8, it should return the same UTF8 string. Unfortunately, it doesn't seem to do that. Is this a bug in the Wedge codebase? SMF codebase? Or a problem with my database?

Honestly, I'm looking into the last solution because it'd be the easiest to implement in the end, but I'm all ears when it comes to other alternatives, or ANYTHING that could help with this situation, really. Did any of you ever get into trouble with something of this kind..?
Yes, I did google this, and found stuff, but nothing helpful, unfortunately. People seem to assume that you shouldn't serialize a string for transmission through a database system, because "you're never sure what you're going to get back"... Heck, YES I know what I should be getting back. I should be getting the same string...! Some people suggest simply rewriting the serialized string's sizes through a preg_replace_callback, but this is too CPU-intensive to my taste, and I'd like to find a logical solution, instead of a dirty hack.

Feel free to chip in, if only to say which solution you'd prefer. Thanks!

Pandos · « Reply #1, on February 24th, 2014, 04:29 PM »

Perhaps try to check if it is serialized and then if validated de-serialize it:

Code: [Select]

if( $unserialized = @unserialize( $row['data'] ) )

Nao · « Reply #3, on February 24th, 2014, 05:45 PM »

Yes, I'll forget about it... :P

base64_encode is faster than entity conversion, but takes more space in the database, especially as it's not often needed.
As far as I know, BLOB columns are not suitable for text operations, such as SELECT data FROM members WHERE data LIKE '%my_variable%'... Not that it matters much, but... It makes it hard for me to adopt, I guess.

Pandos · « Reply #4, on February 24th, 2014, 05:57 PM »

Yes, you are right.
Then it's better to use VARCHAR instead of BLOB.

And who cares about space in DB? It'll be cached anyway :)

Nao · « Reply #5, on February 24th, 2014, 06:58 PM »

It's a text field, not a varchar.

Anyone got an opinion in this?

Pandos · « Reply #6, on February 24th, 2014, 07:50 PM »

VARCHAR is stored inline and is usually faster than text :)

Nao · « Reply #7, on February 24th, 2014, 07:55 PM »

Yes, but the data field is not always loaded. It can take a lot of space, so if you do a table scan on the members table, and data is inline, it'll be less efficient.

Pandos · « Reply #9, on February 24th, 2014, 08:29 PM »

It's because MySQL's UTF-8 before 5.5 is "fake UTF-8"? It only supports 3 bytes per character max., so there is no way to directly store unicode characters plane in MySQL. :geek:
So we should lift up to 5.5?

This will bring the advantage of using "real UTF-8 support" (so you should not bother with serialize errors) and TEXT are also stored inline. :)

And from my personal point of view:
MySQL is almost at the level where InnoDB can perform fulltext search and MyISAM will be almost completely outdated and Wedge is the next generation forum software. So why .... ?

Nao · « Reply #10, on February 25th, 2014, 12:18 AM »

Quote from Pandos on February 24th, 2014, 08:29 PM

It's because MySQL's UTF-8 before 5.5 is "fake UTF-8"? It only supports 3 bytes per character max., so there is no way to directly store unicode characters plane in MySQL. :geek:
So we should lift up to 5.5?

No...
My test strings just have regular accents that are encoded to 2 bytes in UTF8. 4-byte chars in UTF8 are, AFAIK, exclusively for CJK languages. (Especially Chinese Traditional & Simplified -- probably represents about 80.000 characters.)

Quote from Pandos on February 24th, 2014, 08:29 PM

This will bring the advantage of using "real UTF-8 support" (so you should not bother with serialize errors) and TEXT are also stored inline. :)

My local install has MySQL 5.5, and (still) the same problem.

Quote from Pandos on February 24th, 2014, 08:29 PM

And from my personal point of view:
MySQL is almost at the level where InnoDB can perform fulltext search and MyISAM will be almost completely outdated and Wedge is the next generation forum software. So why .... ?

MyISAM is still the de-facto solution for fast reads. I could be convinced to go InnoDB for a few tables, and memory for the session table (if not already done), but not much else...

Okay, so I've done another test by turning my TEXT field into a BLOB and then into a VARCHAR(65535)... And, it didn't change anything. I'm still getting the same contents. Perhaps the problem is happening at store time, rather than retrieve time, I don't know.

I'm still at a loss about what I should do. Remain with all serialize() calls (because there are TONS of theme in the Wedge codebase, and they all work, just look at the number of serialized items stored in $settings!), or convert everything to json strings (shorter, a bit slower to load)..?

Posted: February 25th, 2014, 12:09 AM

Crappy crap...
Did another run of my original code, the one that caused the bug...
Guess what. It worked just fine this time. >_<
So... I've got to suppose, it's not related to the serializing process at all. Uh... Then why did it fail for Wedge.org..?
Oh, maybe it's working on my local install, but not on the Wedge.org MySQL database... I don't know. Maybe it's not set up as UTF, or something.

Wedge

Home

Login

Register

Nao

Stuck on a programming model...

Pandos

Re: Stuck on a programming model...

Re: Stuck on a programming model...

Nao

Re: Stuck on a programming model...

Pandos

Re: Stuck on a programming model...

Nao

Re: Stuck on a programming model...

Pandos

Re: Stuck on a programming model...

Nao

Re: Stuck on a programming model...

Re: Stuck on a programming model...

Pandos

Re: Stuck on a programming model...

Nao

Re: Stuck on a programming model...