@Nao, Can you explain what's the problem with 4-byte UTF8?
Well, MySQL added support for 4-byte UTF8, which is used mostly by new smileys. If you try to insert such a character into a regular UTF8 database, it'll be inserted as a ? (question mark). Which isn't great. Of course, it's only for those smileys.
There are two ways to fix that:
- Either catch all utf8mb4 characters at MySQL insert time, and turn them into HTML entities (which is what Elk does, so that's emanuelle's recommended solution),
- Or ensure that the database uses utf8mb4 from the start, if supported. That way, the character will only take 4 bytes in the database, rather than the length of an entity... Which, okay, is barely twice that number.
I'm not a big fan of parsing all messages for character recognition. I don't know if the array_split followed with an ord() call on all array items would be really fast. (Probably faster than a regexp, but still...)