Wedge
Public area => The Pub => Off-topic => Topic started by: Dragooon on June 4th, 2012, 11:01 PM
-
I'm fairly dense when it comes to MySQL optimization mostly because I never had to do it. But now it seems to be required as I got a DB that's bigger than anything I've managed (which hasn't been randomly populated with a bunch of rand() calls). So, basically I got a table called bills which has only one primary key and index as id_bill
I run this query
SELECT <columns>
FROM bhd_bills AS b
INNER JOIN bhd_accounts AS a ON (a.id_account = b.id_account)
INNER JOIN bhd_users AS u ON (u.id_user = b.id_raised)
LEFT JOIN bhd_accounts AS ua ON (ua.id_user = u.id_user)
WHERE cancelled = 0
AND 1=1
ORDER BY b.id_bill DESC
LIMIT 0, 20 And it causes the entire table to be copied into tmp, causing timeouts as the table contains roughly ~40k rows and takes a lot of time to process, whereas dividing it into 2 queries
SELECT b.id_bill
FROM {prefix}bills AS b
WHERE cancelled = 0
AND 1=1
ORDER BY b.id_bill DESC
LIMIT 0, 20;
SELECT <columns>
FROM bhd_bills AS b
INNER JOIN bhd_accounts AS a ON (a.id_account = b.id_account)
INNER JOIN bhd_users AS u ON (u.id_user = b.id_raised)
LEFT JOIN bhd_accounts AS ua ON (ua.id_user = u.id_user)
WHERE id_bill IN (<result from above>) It comes insanely fast. Can anyone explain why?
-
Could you stick EXPLAIN in front of the first query and let me know what the output of EXPLAIN <query> says?
-
Not sure what to make of this
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: b
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 41438
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: u
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 5
Extra: Using where; Using join buffer
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: ua
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 4235
Extra:
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: a
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: bhd.b.id_account
rows: 1
Extra:
4 rows in set (0.00 sec)
-
EXPLAIN is one of those things that make even less sense than a complex regex to me :P
-
OK, so let's step through it.
The first row indicates that the first clause found 41,438 rows. It matched those rows with the WHERE, no index was used for it, it did the filtering work in a temporary table, and that temporary table was dumped into a file to sort through it.
The second row indicates it's working on the u table, and that it was using a WHERE clause again, but this time the join buffer was involved - it's using the primary key of u table to filter/find rows. That's good, and it turned up 5 relevant rows quickly.
The third row indicates that it's working on the ua table, but that it just worked on the entire table through a table scan, and apparently returned the entire table as results.
The fourth row indicates it's working on the a table, of any possible indexes it could use, it was using the primary key's index, referring back to the other table, and returned a single row.
The primary reason for the performance drain is the filesort. When you break it down into two queries, you're doing ordering still, but this time you're ordering only integers - the result can be done totally in memory, doubly so because it's a fixed-width row to be working on, and the ordering therein is actually able to be done as you go along.
@Nao, it sort of is a bit weird but I might have to lend you my copy of High Performance MySQL ;)
-
I see that makes more sense now, but I still don't get one thing. I'm using the same clauses and sorting in both cases, but why is it doing filesort in first case but not in second?
-
Because you're not doing the same thing in both cases.
In the original query, you're pulling down a lot of columns, of which at least one is going to be a text column. Sorting on that always pushes it to disk, for the *entire* data set.
In the two-query set, you're pulling down a single numeric column, sorting that, and instead of applying successive filters to the data and ordering it all as you go - on disk, rather than in RAM - you're getting to cut most of the data you don't need out first, so you end up doing a lot less work on the data you do actually want.
-
Because you're not doing the same thing in both cases.
In the original query, you're pulling down a lot of columns, of which at least one is going to be a text column. Sorting on that always pushes it to disk, for the *entire* data set.
In the two-query set, you're pulling down a single numeric column, sorting that, and instead of applying successive filters to the data and ordering it all as you go - on disk, rather than in RAM - you're getting to cut most of the data you don't need out first, so you end up doing a lot less work on the data you do actually want.
Ohh, so SELECT is what actually causes the problems here? I see, that makes quite more sense. Thanks :)
-
It's not so much the select, it's what you're getting and how you're getting it and when you do what.
-
It's not so much the select, it's what you're getting and how you're getting it and when you do what.
Ah...I got it. Thanks :)