Wedge

Public area => The Pub => Features => Topic started by: Nao on February 24th, 2014, 03:57 PM

Title: Stuck on a programming model...
Post by: Nao on February 24th, 2014, 03:57 PM
I could do with some ideas here.

Okay, so a couple of weeks ago, I decided to shorten the {db_prefix}members table, by moving some of its 'optional' fields into the data field (which is a do-it-all column that holds a serialized array.)

Among the fields I moved, 'message_labels' had for my account a value of "À répondre", which is a label I made which means "pending reply". As you can see, there are accents in that string...

Now, imagine this. I'm trying to keep it simple:
$data = array('member_labels' => 'À répondre');
$data_field = serialize($data);
Then insert $data_field as the 'data' field in the members table.

All right..?
Now, do the reverse. unserialize($row['data']), basically.
Unserialize error. Ouch.
Why so?
Because, apparently, the string was turned into a different format between the moment it was serialized, and the moment I attempted to unserialize. This only happens for data fields that contain accents or other weird characters, so English users probably never saw this happen. A serialized string first holds the size of a string, then its contents. If the string is modified in the meantime, then the size won't match, and this will trigger an unserialize error.
Honestly, I'm looking into the last solution because it'd be the easiest to implement in the end, but I'm all ears when it comes to other alternatives, or ANYTHING that could help with this situation, really. Did any of you ever get into trouble with something of this kind..?
Yes, I did google this, and found stuff, but nothing helpful, unfortunately. People seem to assume that you shouldn't serialize a string for transmission through a database system, because "you're never sure what you're going to get back"... Heck, YES I know what I should be getting back. I should be getting the same string...! Some people suggest simply rewriting the serialized string's sizes through a preg_replace_callback, but this is too CPU-intensive to my taste, and I'd like to find a logical solution, instead of a dirty hack.

Feel free to chip in, if only to say which solution you'd prefer. Thanks!
Title: Re: Stuck on a programming model...
Post by: Pandos on February 24th, 2014, 04:29 PM

Perhaps try to check if it is serialized and then if validated de-serialize it:
Code: [Select]
if( $unserialized = @unserialize( $row['data'] ) )

Title: Re: Stuck on a programming model...
Post by: Pandos on February 24th, 2014, 04:48 PM
Forget about the first post.
Anyway. If you serialize data into MySQL it's better to do it the save way because MySQL is very picky when it comes to special chars and serialisation gets corrupted.
Use base64_encode and decode. And please make sure the field is a BLOB column.
Title: Re: Stuck on a programming model...
Post by: Nao on February 24th, 2014, 05:45 PM
Yes, I'll forget about it... :P

base64_encode is faster than entity conversion, but takes more space in the database, especially as it's not often needed.
As far as I know, BLOB columns are not suitable for text operations, such as SELECT data FROM members WHERE data LIKE '%my_variable%'... Not that it matters much, but... It makes it hard for me to adopt, I guess.
Title: Re: Stuck on a programming model...
Post by: Pandos on February 24th, 2014, 05:57 PM
Yes, you are right.
Then it's better to use VARCHAR instead of BLOB.


And who cares about space in DB? It'll be cached anyway :)
Title: Re: Stuck on a programming model...
Post by: Nao on February 24th, 2014, 06:58 PM
It's a text field, not a varchar.

Anyone got an opinion in this?
Title: Re: Stuck on a programming model...
Post by: Pandos on February 24th, 2014, 07:50 PM
VARCHAR is stored inline and is usually faster than text :)
Title: Re: Stuck on a programming model...
Post by: Nao on February 24th, 2014, 07:55 PM
Yes, but the data field is not always loaded. It can take a lot of space, so if you do a table scan on the members table, and data is inline, it'll be less efficient.
Title: Re: Stuck on a programming model...
Post by: Nao on February 24th, 2014, 08:21 PM
Okay, so... It would appear that unserialize(utf8_decode($user_settings['data'])) will actually fix the charset, and properly decode the string. *THEN*, I have to call utf8_encode on the array to make it work at showtime. Seriously, it's overkill. But it fixes the unserializing, so I've added it (temporarily at least) as a 'fallback' if the function fails.

I'm frankly a bit upset that I have to do this. I have no idea why it causes issues, when other fields don't seem to do the same... :-/
Title: Re: Stuck on a programming model...
Post by: Pandos on February 24th, 2014, 08:29 PM
It's because MySQL's UTF-8 before 5.5 is "fake UTF-8"? It only supports 3 bytes per character max., so there is no way to directly store unicode characters plane in MySQL.  :geek:
So we should lift up to 5.5?

This will bring the advantage of using "real UTF-8 support" (so you should not bother with serialize errors) and TEXT are also stored inline. :)


And from my personal point of view:
MySQL is almost at the level where InnoDB can perform fulltext search and MyISAM will be almost completely outdated and Wedge is the next generation forum software. So why .... ?
Title: Re: Stuck on a programming model...
Post by: Nao on February 25th, 2014, 12:18 AM
Quote from Pandos on February 24th, 2014, 08:29 PM
It's because MySQL's UTF-8 before 5.5 is "fake UTF-8"? It only supports 3 bytes per character max., so there is no way to directly store unicode characters plane in MySQL.  :geek:
So we should lift up to 5.5?
No...
My test strings just have regular accents that are encoded to 2 bytes in UTF8. 4-byte chars in UTF8 are, AFAIK, exclusively for CJK languages. (Especially Chinese Traditional & Simplified -- probably represents about 80.000 characters.)
Quote from Pandos on February 24th, 2014, 08:29 PM
This will bring the advantage of using "real UTF-8 support" (so you should not bother with serialize errors) and TEXT are also stored inline. :)
My local install has MySQL 5.5, and (still) the same problem.
Quote from Pandos on February 24th, 2014, 08:29 PM
And from my personal point of view:
MySQL is almost at the level where InnoDB can perform fulltext search and MyISAM will be almost completely outdated and Wedge is the next generation forum software. So why .... ?
MyISAM is still the de-facto solution for fast reads. I could be convinced to go InnoDB for a few tables, and memory for the session table (if not already done), but not much else...

Okay, so I've done another test by turning my TEXT field into a BLOB and then into a VARCHAR(65535)... And, it didn't change anything. I'm still getting the same contents. Perhaps the problem is happening at store time, rather than retrieve time, I don't know.

I'm still at a loss about what I should do. Remain with all serialize() calls (because there are TONS of theme in the Wedge codebase, and they all work, just look at the number of serialized items stored in $settings!), or convert everything to json strings (shorter, a bit slower to load)..?
Posted: February 25th, 2014, 12:09 AM

Crappy crap...
Did another run of my original code, the one that caused the bug...
Guess what. It worked just fine this time. >_<
So... I've got to suppose, it's not related to the serializing process at all. Uh... Then why did it fail for Wedge.org..?
Oh, maybe it's working on my local install, but not on the Wedge.org MySQL database... I don't know. Maybe it's not set up as UTF, or something.