Wedge

Public area => The Pub => Features => Topic started by: Nao on July 1st, 2012, 01:53 AM

Title: Media boards: fishing for opinions
Post by: Nao on July 1st, 2012, 01:53 AM
So... For the last few weeks I've been considering multiple ways of integrating media items smoothly into Wedge. (Via the upcoming floating board feature.)

After all this time, there's one technique that sticks, and I'll explain it to you. The reason is that I'm not sure whether it's for the best, because it has a few disadvantages.

Simply said, a topic (wedge_topics table) would now have an associated id_media field. This field would represent a single entry in the media table.
With that said:

- an album is a custom board
- an album comment thread is a topic in this board where id_media = 0 (no associated image)
- an album item is actually a topic, which has a custom id_media that tells Wedge what thumbnail it should show. The item description could very simply be the first post in the topic.
- an album item comment thread is, obviously, the list of messages in that topic.

One of the advantages of this system is that it allows you to associate a media item with a topic even when the board isn't an album board.
For instance, we could have a blog where the posts are listed along with a thumbnail of their associated media item. Meaning, for each blog post the first media item you upload for it will become the default id_media for the topic. I could also ensure that you can select another item for use as main item. (Really, you could even use a single id_media for multiple blog posts -- e.g. like a category image.)

Now for the drawbacks...
- when you're writing a post, even a reply to some other post, if you upload a media item to attach in it, it creates a new topic internally. Which makes a lot of sense -- but then you get to the point where creating topics as 'supporting players' of a simple message sounds weird, even uncomfortable. (Then again, it's only internal stuff. I mean, users of Facebook don't really mind URLs with long series of digits... Also, we could always show ?topic=base64encoded_id in the URL, ahah...)
- it's not gonna be fun writing a converter for all of that... (Plus, TE is kinda MIA these days, so I fear it will be up to me to do it :P)
- the 'view unseen' permission is nulled by this, because it's all about new topics by then. Maybe I could add a filter for board types in the 'view unread posts' page. i.e. 'Mark (select box) as read' where the select box has Forums, Media, Blogs...
- I had another drawback in mind, which actually brought me to write this post, but I forgot about it for now. I'll make sure to bump this topic if I remember it.

Also, I've noticed that in AeMe, id_media is set to a size of int(11), while topics are set to mediumint(8). If a topic is now automatically attached to a media item, then it makes sense to set topics to int(11) as well, or set media size to mediumint(8) instead... No?

Which also reminds me that the SQL file has a lot of entries that don't use UNSIGNED where they could. But I don't know what's best -- fixing them as I find them, or leaving them as is, in the event that some would actually want to force a negative value...? (see privacy levels for instance, where a value of -3 means 'everyone' so that positive values can be assigned to membergroups.)

Please discuss all of this :)

(And Pete -- I understand you're on your way to KY, so you're dismissed, I was myself totally unable to participate in complex discussions this week with all the jetlag :P But if there aren't any solutions found by the time you come back, please feel free to share your opinion!)
Title: Re: Media boards: fishing for opinions
Post by: Arantor on July 1st, 2012, 02:39 AM
I'm already here in KY and have been to bed at least once (and now I'm actually on the right time, really, I was going to bed/getting up in American time for a while!)


OK, let me tackle the limits stuff first, since that's the most pressing matter from my perspective. mediumint(8) allows for something like 8.3 million topics, even if it's signed (16,777,215 if unsigned), now I can't imagine a forum having that many topics + attachments. I've never seen any forum except Gaia Online get to the realm of millions of topics, even if they get to millions of posts.

But also consider how many media items you'd have to have to get to that stage, and more importantly how big they'd be. Even at 1KB each, you'd be in to the GBs of storage to cope with it.

On a wider scale, though, you also have to consider that if you have to actually consider many millions (more than a mediumint can hold), you might want to be considering the next field up from an int, which is a bigint, and that causes *many* more problems. Going over the 32 bit threshold is a potential problem on many levels, more than you really want to be diagnosing.

But realistically, a forum that is even close to approaching that is going to require specialist attention long before they get that big, and the single biggest SMF forum I know has 19.7m posts in 520k topics (warriorcatsrpg.com), or, alternatively I looked at Nairaland which used to run SMF, which currently has 913k topics - but even so we're still talking way below the 8m/16m border. So potentially it's a non issue.


As far as the context goes, everything else you've pointed out is fine as far as I'm concerned.

Lastly, re negative ids, there really aren't that many instances of it but we should certainly look through it - ideally every time we find a case where it happens, we should add a comment to the SQL to indicate that. Once we know which columns have to remain signed, we can make everything else unsigned ;)
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 1st, 2012, 06:22 AM
Quote from Arantor on July 1st, 2012, 02:39 AM
I'm already here in KY
I assumed it was likely since you mentioned the hot weather, and last time I checked it's not so hot here in Paris so I doubt it's any better in the UK :P
Quote
and have been to bed at least once (and now I'm actually on the right time, really, I was going to bed/getting up in American time for a while!)
Ah, you're lucky... Somehow. I myself immediately caught the 'right' timezone when going to sleep because it's easy for me to extend a night's work on code, ahah. But when I came back to Paris, there went the problem -- unable to catch the French timezone again.
Quote
OK, let me tackle the limits stuff first, since that's the most pressing matter from my perspective. mediumint(8) allows for something like 8.3 million topics, even if it's signed (16,777,215 if unsigned), now I can't imagine a forum having that many topics + attachments.
I can...
Quote
But also consider how many media items you'd have to have to get to that stage, and more importantly how big they'd be. Even at 1KB each, you'd be in to the GBs of storage to cope with it.
There's a strong possibility that it wouldn't cope easily, indeed. Eheh.
Then again, in the latest .Net I saw an ad about a £49 monthly server delivered with 2x3TB hard drives in RAID... Now, just launch a successful project where people can post their pics and remove them at will. (IDs will stay unique so any pic removed won't free up its database ID space.) Well, I can very well see it working.
Let's say... 20k users.
8MB topics & media items & whatever floating types / 20k = 400 entries per user in average.
Sure, it'll take some time to reach that... But it's doable.
Hence my suggestion to go for int(10) (like id_msg) or int(11) (like id_media), although it's basically the same (int11 is for signed stuff I guess, since it supports the minus sign.)
Quote
On a wider scale, though, you also have to consider that if you have to actually consider many millions (more than a mediumint can hold), you might want to be considering the next field up from an int, which is a bigint, and that causes *many* more problems. Going over the 32 bit threshold is a potential problem on many levels, more than you really want to be diagnosing.
Hmm, 2 billion entries (signed int) is enough for me... I'll leave the rest up to Facebook ;)
Quote
But realistically, a forum that is even close to approaching that is going to require specialist attention long before they get that big, and the single biggest SMF forum I know has 19.7m posts in 520k topics (warriorcatsrpg.com),
Oh my...
Quote
or, alternatively I looked at Nairaland which used to run SMF, which currently has 913k topics
Had a look at that... It's ...quite ugly, to be polite.
Why do 'big' forums always have to be the ones that make you wanna facepalm...?
Quote
As far as the context goes, everything else you've pointed out is fine as far as I'm concerned.
I'm thinking we could even consider dropping the album table... I'll have to look.
Quote
Lastly, re negative ids, there really aren't that many instances of it but we should certainly look through it - ideally every time we find a case where it happens, we should add a comment to the SQL to indicate that. Once we know which columns have to remain signed, we can make everything else unsigned ;)
Well, at the very least, auto increment fields are unsigned for sure...!
Title: Re: Media boards: fishing for opinions
Post by: godboko71 on July 5th, 2012, 01:55 PM
note t self reply when at keyboard.
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 5th, 2012, 03:01 PM
Wanting opinions for everything, myself :)
Title: Re: Media boards: fishing for opinions
Post by: Arantor on July 5th, 2012, 03:16 PM
I'm honestly thinking we don't need to change the topic id column at this time, just because I don't know any forum that is going to get anywhere near it - and again, if they are even remotely close to it, they're going to run into other problems long before that which require us taking a look and helping optimise it anyway, and we can deal with scaling it closer to time.
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 5th, 2012, 06:41 PM
Quote from Arantor on July 5th, 2012, 03:16 PM
I'm honestly thinking we don't need to change the topic id column at this time, just because I don't know any forum that is going to get anywhere near it - and again, if they are even remotely close to it, they're going to run into other problems long before that which require us taking a look and helping optimise it anyway, and we can deal with scaling it closer to time.
I'm not sure.
But I'll just trust your instincts on this.
So... mediumint(8) unsigned? Is that okay for topic IDs and media item IDs? Even if it's on 3 bytes and stuff? (I'm sure MySQL doesn't have performance issues related to byte alignment, but I want to be sure...)
Title: Re: Media boards: fishing for opinions
Post by: Arantor on July 6th, 2012, 12:02 AM
mediumint unsigned is fine for topic and media item ids as far as I'm concerned.

The number of bytes is not a huge issue; MySQL quite happily uses 1, 2, 3 and 4 byte fields with impunity. Since MySQL does even have bitwise fields that actually work at the bit level, and it packs rows itself, it's not a really huge issue.

Of a more practical issue is the fact that all the indexes which use that field (of which there are more than one AFAIR) will all remain the same size rather than growing.
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 6th, 2012, 01:08 PM
Quote from Arantor on July 6th, 2012, 12:02 AM
mediumint unsigned is fine for topic and media item ids as far as I'm concerned.
IIRC topics are originally signed mediumint so we get twice the original amount and no extra space taken so that's cool.
I guess that if someone's forum becomes huge, they'll have all the time they need to simply change the field sizes in the database because, thankfully, that can be done even through phpMyAdmin in a few steps... We can even release a script that will change everything for them. I guess I'm a bit too stressed and I tend to see limits everywhere when really I don't expect Wedge to host forums that have more than a few million posts... (or media items.)
Quote
The number of bytes is not a huge issue; MySQL quite happily uses 1, 2, 3 and 4 byte fields with impunity. Since MySQL does even have bitwise fields that actually work at the bit level, and it packs rows itself, it's not a really huge issue.
Oh, so a 3-byte field will be packed, like, compressed..?
Quote
Of a more practical issue is the fact that all the indexes which use that field (of which there are more than one AFAIR) will all remain the same size rather than growing.
I don't see what that means?
Title: Re: Media boards: fishing for opinions
Post by: Arantor on July 6th, 2012, 04:13 PM
Quote
I guess that if someone's forum becomes huge, they'll have all the time they need to simply change the field sizes in the database because, thankfully, that can be done even through phpMyAdmin in a few steps..
Sadly it's not quite that simple. There are two kinds of schema changes, some are essentially free because they just update the table definition, e.g. changing a mediumint(5) to a mediumint(8) is free because it's just updating the .frm file (for MyISAM, though it's still true for InnoDB too)

The other kind is where it is not free, because it requires rewriting all the data in every row to align it to the new size. Changing the size or type of columns (mediumint to int, or the other way for example, or changing the type from numeric to textual etc) is very expensive on large tables as a result - and naturally it's usually fully table locking in the process. (Certain InnoDB operations can be done without it being a full table lock but generally assume it is)

Also note that adding a column to an existing table has exactly the same problem. For most tables it's not preposterously expensive in most cases, because in most cases people with 100k+ posts don't generally go modifying their DB too much unless absolutely necessary because once they get to that size they're already aware of the limitations.
Quote
Oh, so a 3-byte field will be packed, like, compressed..?
No, it's not compressed, but it is byte-packed, meaning that fields that don't need full bytes don't generally consume full bytes, but they pack all the fields in to use the least number of whole bytes.

There's not really a great amount of compression you can do on three bytes.
Quote
I don't see what that means?
id_topics appear in the indexes in multiple places. Take wedge_messages, of the litany of indexes that table has, four of them have id_topic as part of those indexes, and as a result of that, expanding the id_topic field to 4 bytes means we don't lose an extra byte per row, we lose that multiplied by 5 (1 for the row, 4 for indexes), so we gain 5 bytes per message, just in that one case, and in the case of indexes, we have a vested interest in keeping them lean.
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 15th, 2012, 12:07 AM
I missed your reply, Pete...

Note: I've added a POLL for the media structure. Please answer if you can! The solution I described in my first post is the first one in the list. The last one is the equivalent to Aeva Media's album topic system (which AFAIK was never used by anyone but me -- in FoxProg, as a way to automatically fill my blog with new album uploads.)
Quote from Arantor on July 6th, 2012, 04:13 PM
The other kind is where it is not free, because it requires rewriting all the data in every row to align it to the new size. Changing the size or type of columns (mediumint to int, or the other way for example, or changing the type from numeric to textual etc) is very expensive on large tables as a result - and naturally it's usually fully table locking in the process. (Certain InnoDB operations can be done without it being a full table lock but generally assume it is)
Uh... Well. I can only imagine that if you put your forum into maintenance mode, it won't be too much of an issue...?
Quote
Also note that adding a column to an existing table has exactly the same problem. For most tables it's not preposterously expensive in most cases, because in most cases people with 100k+ posts don't generally go modifying their DB too much unless absolutely necessary because once they get to that size they're already aware of the limitations.
Well, I reached 220K on my old forum without knowing a damn thing about MySQL...
Quote
No, it's not compressed, but it is byte-packed, meaning that fields that don't need full bytes don't generally consume full bytes, but they pack all the fields in to use the least number of whole bytes.
Oh... Good, then. As such, I don't really see any reason to prevent the field from being set to a larger size?
Quote
id_topics appear in the indexes in multiple places. Take wedge_messages, of the litany of indexes that table has, four of them have id_topic as part of those indexes, and as a result of that, expanding the id_topic field to 4 bytes means we don't lose an extra byte per row, we lose that multiplied by 5 (1 for the row, 4 for indexes), so we gain 5 bytes per message, just in that one case, and in the case of indexes, we have a vested interest in keeping them lean.
Now, that's a real problem... :-/
Title: Re: Media boards: fishing for opinions
Post by: live627 on July 15th, 2012, 12:24 AM
How would the first option slow things? If the board and topic types are index, it couldn't be a big deal...?
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 15th, 2012, 12:25 AM
Well, going for album 'boards' means that we're going to get more 'custom boards' than just by offering blog boards.
And more custom boards, means that, as per Pete's remark, {query_see_boards} is going to require a subselect rather than get away with a IN (...), so it's bound to be a bit slower on all board-related queries.
No?

Of course I also saw another post on another website that swears a subselect is faster because it can make use of an index, I'm not even sure about that, and anyway it can hardly be faster than a pure string representation of the board list...?

I don't know.
Title: Re: Media boards: fishing for opinions
Post by: live627 on July 15th, 2012, 12:43 AM
Subselect? Why would IN not suffice? Am I missing something fundamental?
Title: Re: Media boards: fishing for opinions
Post by: Arantor on July 15th, 2012, 01:19 AM
OK, here's the deal. If you expand the type of boards to include albums, you drastically expand the scope of that IN() clause.

Remember how MySQL optimises those: it's basically pre-processed into WHERE var = x OR var = y OR var = z OR var = a. There's no faster way of optimising it, and if you expand it too far it gets ugly quickly.

Using a subselect here is a bit of a misnomer vs an IN() because both potentially generate a decent amount of clauses to match against, and I'm not sure how beneficial it'll be, which is why benchmarking is our friend.
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 15th, 2012, 07:41 AM
And benchmarking is something I'm not sure about doing for now... (Well, maybe we can use John's latest plugin for that :P)

I like the idea of a subselect, though, ideally, because it's cleaner and at least it gives MySQL an opportunity to optimize by itself. Only problem is, MySQL is usually not too good at that task... :^^;:

Only one vote in the poll? :sob:
Posted: July 15th, 2012, 07:30 AM

Pete, that reminds me of this post I saw this week...
http://developwithstyle.com/articles/2012/06/22/speeding-up-sql-queries-containing-text-fields/
Do you think that's true? Because if it is, doing MySQL is like fucking voodoo to me :P
Title: Re: Media boards: fishing for opinions
Post by: live627 on July 15th, 2012, 07:51 AM
Quote
And benchmarking is something I'm not sure about doing for now... (Well, maybe we can use John's latest plugin for that :P)
How would that work...? That plugin doesn't benchmark anything...
Quote
Only one vote in the poll? :sob:
I haven't decided whether to vote or A or B. For B, how would viewing albums work?
Title: Re: Media boards: fishing for opinions
Post by: Arantor on July 15th, 2012, 03:48 PM
Quote
How would that work...? That plugin doesn't benchmark anything...
No, but you get a lot of data to work with to be able to understand how running a larger forum is likely to handle the extra load.
Quote
I haven't decided whether to vote or A or B. For B, how would viewing albums work?
That's a very good question. What it means to me is that MessageIndex.php needs an overhaul, possibly loadBoard() needs to identify the type of board involved and when we would otherwise divert to MessageIndex, we figure out which file needs to be loaded for that type of board.
Quote
Pete, that reminds me of this post I saw this week...
I don't think it's true, I know it is, but I also know the author's an idiot in the first place - SELECT * and SELECT field, table.* levy their own issues, and I suspect something interesting would happen if the author didn't just blindly select * from anywhere.

Anyway, yes, the point is correct: if you do a query where you're sorting by any column and there's a TEXT or larger field in there, it will generate a disk-based temporary table (this is what a filesort is). Getting around it using a subselect is an interesting way of doing it, but essentially yes it's correct, it fools the query parser as to how the table is generated and it means it isn't implicitly a filesort. At that point you just have to hope MySQL can get enough memory to do it in memory otherwise you end up with a filesort anyway.

It's also why Display, for example, gets the message ids first, then fetches messages separately.
Title: Re: Media boards: fishing for opinions
Post by: Nao on July 15th, 2012, 07:25 PM
Quote from Arantor on July 15th, 2012, 03:48 PM
No, but you get a lot of data to work with to be able to understand how running a larger forum is likely to handle the extra load.
Yep.
Quote
That's a very good question. What it means to me is that MessageIndex.php needs an overhaul, possibly loadBoard() needs to identify the type of board involved and when we would otherwise divert to MessageIndex, we figure out which file needs to be loaded for that type of board.
First, I think that we should build a mini-plugin system for floating boards.
MessageIndex.php would load the main stuff, then would determine the board type and call MessageIndex-Blog.php or whatever. (Or just the corresponding template.)
Same for Display -- Display-Blog.template.php, etc.
This would probably help performance if implementing tons of board types. Also helps with plugins enabling their own board type...?

Now, as for your comment, I'd just like to say that nothing prevents MessageIndex from redirecting to the current code in the /media/ folder. Well, what I really mean is, solution (A) would change the album URLs to use something in the style of board URLs, while solution (B) would keep using ?action=media;sa=album;in=123 where the albums table stores id_album = 123 and id_topic = 345, and id_topic is a custom topic created in a custom (unique) board. The main point is that the code changes are limited, the main drawback is that you can't set up board permissions on the album, rather you have to use the current album permission system -- of course the album permission system is currently better because it offers to ban/allow users on a case by case basis, while boards, and will force you to create a custom group just to allow or deny someone.
I'm not sure what the best would be... Hence my poll.

Heck, the C and D solutions are also complex in that they effectively remove 'albums' from possible 'floating boards', instead having a single hidden board hold all of the albums as topics.

(Of course, I could also have added another suggestion where we have a single hidden board to represent the gallery, and then one topic per item -- with the item/album/category relationship being maintained in the media tables. That would make sense, too...

So, to keep it short, solution A makes the most sense if we want to go fully 'floating boards'. Other solutions might be better for performance. I don't know.
Quote
Anyway, yes, the point is correct: if you do a query where you're sorting by any column and there's a TEXT or larger field in there, it will generate a disk-based temporary table (this is what a filesort is). Getting around it using a subselect is an interesting way of doing it, but essentially yes it's correct, it fools the query parser as to how the table is generated and it means it isn't implicitly a filesort. At that point you just have to hope MySQL can get enough memory to do it in memory otherwise you end up with a filesort anyway.
That's... Very funny.
Well, thankfully, MySQL queries can be optimized later on, they don't have to be static :)