Wedge

Public area => The Pub => Features => Topic started by: Nao on July 2nd, 2012, 02:56 PM

Title: Database field formats
Post by: Nao on July 2nd, 2012, 02:56 PM
As discussed earlier, I'm spending some time with the database and trying to harmonize field sizes.

id_album is currenty an int(10), while id_board is smallint(5).
In the future, albums will internally be treated as boards, so I could either reduce the id_album to a smallint (maximum value: unsigned 64K), or increase the size of id_board to medium(8), i.e. in the millions, and reduce the size of id_album accordingly.

While technically it's true that having millions of boards will probably crash Wedge before long, I'm not sure that the 64K limit is good enough. If you have a popular website where thousands of users create media galleries every day, you could easily break the limit in a few months time. More realistically, any image-oriented forum with a decent community could potentially break the limit in a few years. After that, we can either release a patch to increase the id_board size for them -- or simply account for it now.

Only problem is, mediumint(8) is stored on 3 bytes so I'm not sure it's as good in terms of performance as smallint. What do you think...? (The worst possible offender is the messages table, which has an id_board field.)

And finally -- id_group should be increased as well. Currently it's set to mediumint(5) which doesn't make sense (the 5-char limit will likely be ignored anyway if there are millions of groups...?), so I started by setting it to mediumint(8) for clarity, and then changed it to int(10) mainly for one reason: ideally, ANY user can create ANY number of groups to put ANY number of members into it. We should (?) probably put a limit to those, but for now I'm upgrading to int. Anyone knows whether this will have an influence over perfs...?
Title: Re: Database field formats
Post by: Arantor on July 2nd, 2012, 04:21 PM
First of all, do not be too concerned about the number in brackets. It is primarily a display setting from the console tool, for setting how many columns should be used for displaying the field. If the field requires 5 columns but can handle 8, it'll store all 8 but it will attempt to use only 5 for displaying.

The deal, really, is the column type itself, and I'll actually grab the code from a helper function I put in the packages DB stuff:

Code: [Select]
case 'tinyint':
$type_size = $unsigned ? 3 : 4; // 0 to 255 (3) vs -128 (4) to 127
break;
case 'smallint':
$type_size = $unsigned ? 5 : 6; // 0 to 65535 (5) vs -32768 (6) to 32767
break;
case 'mediumint':
$type_size = 8; // 0 to 16777215 (8) vs -8388608 (8) to 8388607
break;
case 'int':
$type_size = $unsigned ? 10 : 11; // 0 to 4294967296 (10) vs -2147483648 (11) to 2147483647
break;
case 'bigint':
$type_size = 20; // 0 to 18446744073709551616 (20) vs -9223372036854775808 (20) to 9223372036854775807
break;

So, there's your ranges for each of the columns, including the typical number involved for column sizes (just because the create/alter table statements *need* a size)


There are problems with using even more than a couple of thousand boards, let alone tens of thousands, and they all revolve around how board access is handled - throwing even a few hundred values into a WHERE ... IN (...) clause is not pretty, doing it for thousands of boards even less so. I've been thinking about this a bit, and have come to the conclusion that maybe we need to rethink how board access is physically implemented - perhaps we need to rethink {query_see_board} as a subselect.

Thing is, this limit is still present whether albums are boards or not, assuming query_see_board is still used to figure out access to them. It's just less noticeable.

Realistically, any forum that's big enough to be hitting 64K boards is also likely to have other problems and need support from us before it gets there anyway, in which case we deal with it before they get there.

Expanding id_group is no issue in itself but we need to rethink additional_groups in that situation. Unless I've been asleep for a long time and it's changed and I wasn't looking, it's done as a textual column in the members table, and there were issues in the past with it being truncated. I can't remember if we expanded that or not, but if not, we need to at least do that.

Lastly, performance. It's always a tough call but generally speaking, using the smallest size possible for your columns is considered to be the best approach. It keeps the table smaller, more importantly it keeps indexes smaller. And when you get up to huge sizes, the bottleneck becomes I/O not CPU anyway, so physically less data being thrown around = faster movement.
Title: Re: Database field formats
Post by: Pandos on July 2nd, 2012, 04:40 PM

I would consider using MEDIUMINT UNSIGNED for all fields.
MEDIUMINT gives you enough space up to 16M rows (unsigned).
There shouldn't be a performance difference, the only advantage you get is a slightly smaller table size (25%)  by using MEDIUMINT instead of INT.

The number in parentheses (int10) in a type declaration is the
display width, which is unrelated to the range of values that can be stored in a data type.

Title: Re: Database field formats
Post by: Arantor on July 2nd, 2012, 04:44 PM
Quote
I would consider using MEDIUMINT UNSIGNED for all fields.
Except for where we have to have signed fields. Which is more than you'd think. And there are quite a few places where INT is still required too, like timestamps.
Quote
There shouldn't be a performance difference, the only advantage you get is a slightly smaller table size (25%)  by using MEDIUMINT instead of INT.
Except for the fact you don't just get a smaller table, you get smaller indexes too. And that does make a difference for querying times.
Quote
The number in parentheses (int10) in a type declaration is the
display width, which is unrelated to the range of values that can be stored in a data type.
Yes, I already mentioned this.
Title: Re: Database field formats
Post by: Pandos on July 2nd, 2012, 05:09 PM
Quote from Arantor on July 2nd, 2012, 04:44 PM
Quote
I would consider using MEDIUMINT UNSIGNED for all fields.
Quote
Except for where we have to have signed fields. Which is more than you'd think. And there are quite a few places where INT is still required too, like timestamps.
Yes. But the description "all fields" where ment for  id_board and id_album.


Perhaps we should merge id_board and id_album together to id_item or something else?
This should be done by adding another field to  the table (e.g. is_board) to determine  that id_item is a board or an album,
Only a suggestion.
Quote
There shouldn't be a performance difference, the only advantage you get is a slightly smaller table size (25%)  by using MEDIUMINT instead of INT.
Quote
Except for the fact you don't just get a smaller table, you get smaller indexes too. And that does make a difference for querying times.
Yes. You're right.