I could do with some ideas here.
Okay, so a couple of weeks ago, I decided to shorten the {db_prefix}members table, by moving some of its 'optional' fields into the data field (which is a do-it-all column that holds a serialized array.)
Among the fields I moved, 'message_labels' had for my account a value of "À répondre", which is a label I made which means "pending reply". As you can see, there are accents in that string...
Now, imagine this. I'm trying to keep it simple:
$data = array('member_labels' => 'À répondre');
$data_field = serialize($data);
Then insert $data_field as the 'data' field in the members table.
All right..?
Now, do the reverse. unserialize($row['data']), basically.
Unserialize error. Ouch.
Why so?
Because, apparently, the string was turned into a different format between the moment it was serialized, and the moment I attempted to unserialize. This only happens for data fields that contain accents or other weird characters, so English users probably never saw this happen. A serialized string first holds the size of a string, then its contents. If the string is modified in the meantime, then the size won't match, and this will trigger an unserialize error.
Yes, I did google this, and found stuff, but nothing helpful, unfortunately. People seem to assume that you shouldn't serialize a string for transmission through a database system, because "you're never sure what you're going to get back"... Heck, YES I know what I should be getting back. I should be getting the same string...! Some people suggest simply rewriting the serialized string's sizes through a preg_replace_callback, but this is too CPU-intensive to my taste, and I'd like to find a logical solution, instead of a dirty hack.
Feel free to chip in, if only to say which solution you'd prefer. Thanks!
Okay, so a couple of weeks ago, I decided to shorten the {db_prefix}members table, by moving some of its 'optional' fields into the data field (which is a do-it-all column that holds a serialized array.)
Among the fields I moved, 'message_labels' had for my account a value of "À répondre", which is a label I made which means "pending reply". As you can see, there are accents in that string...
Now, imagine this. I'm trying to keep it simple:
$data = array('member_labels' => 'À répondre');
$data_field = serialize($data);
Then insert $data_field as the 'data' field in the members table.
All right..?
Now, do the reverse. unserialize($row['data']), basically.
Unserialize error. Ouch.
Why so?
Because, apparently, the string was turned into a different format between the moment it was serialized, and the moment I attempted to unserialize. This only happens for data fields that contain accents or other weird characters, so English users probably never saw this happen. A serialized string first holds the size of a string, then its contents. If the string is modified in the meantime, then the size won't match, and this will trigger an unserialize error.
- Now, I attempted to 'correct' this by turning my serialized strings into JSON strings instead. But by default, PHP decodes these into stdClass objects, which sucks, because I either have to specifically recast them as arrays, or add a parameter in json_decode.
Okay, I can live with that... But other tables in Wedge also hold a 'data' field, which in turn is also a serialized string. Does it mean I should use json_encode instead everywhere...?
That was my first solution. The advantage of JSON is that it makes shorter strings, but they're also slightly slower to decode. Not that much, mind you...
Okay, next solution? - Call westr::utf8_to_entity() on every array I'm going to serialize. Unfortunately, it's also a slow function, and to be sure, we need to call it on every single data sub-field, which could eventually waste a lot of time. This is the currently (half-) implemented solution.
- Try to find a solution to that storage problem. I didn't try this first, because honestly I suck at handling UTF in databases. But logic dictates that if you store a UTF8 string and retrieve it, as long as the database is UTF8, it should return the same UTF8 string. Unfortunately, it doesn't seem to do that. Is this a bug in the Wedge codebase? SMF codebase? Or a problem with my database?
Yes, I did google this, and found stuff, but nothing helpful, unfortunately. People seem to assume that you shouldn't serialize a string for transmission through a database system, because "you're never sure what you're going to get back"... Heck, YES I know what I should be getting back. I should be getting the same string...! Some people suggest simply rewriting the serialized string's sizes through a preg_replace_callback, but this is too CPU-intensive to my taste, and I'd like to find a logical solution, instead of a dirty hack.
Feel free to chip in, if only to say which solution you'd prefer. Thanks!



