Print Page - BBC whitespace trimming

Public area => Bug reports => The Pub => Archived fixes => Topic started by: Nao on May 26th, 2012, 10:46 AM

Title: BBC whitespace trimming
Post by: Nao on May 26th, 2012, 10:46 AM

So, some time ago I changed quote tags to use whitespace trimming, so that I could precisely style the space between quotes and surrounding text.

Turned out that it didn't work very well.
The first problem was with margins. I used margin collapsing to allow for an opener quote not to take any useless space at the top of a post, but if you quoted that, it would add the extra space inside the new quote. I fixed it this morning by adding an extra div directly above the blockquote tag. I fought long and hard against that (even trying for pseudo-elements to fill in the gap...), until I realized I'd provisioned for an empty <footer> tag in the markup. So I simply removed the footer tag (which can very well be 'emulated' through pseudo-elements, like .bbc_quote:after, well, maybe not, but it can always be re-inserted by a plugin through a SQL query), and replaced them with the div tags. 'div' is shorter than 'footer', and it's acceptable. (Heck, I'm even considering turning <div class="bbc_quote"> into <section class="bbc_quote"> because of the header, but whatever...)

So, problem solved.
Now, there's an extra problem. The trimming code.
Let's say you have some text, then a few empty lines, then a quote, then a few empty lines, then some text.
Wedge (and SMF, I guess) will only remove the TRAILING empty lines. Any empty lines BEFORE the quote tag are left as is.
I was wondering why, and it turned out that... Well, it seems that these are never even looked against...?!
So, quotes are set to trim 'outside', to me that means "before and after the tag"...
But it seems that the two 'outside' trimming code blocks are both about 'anything coming after the tag'.
And if anything -- the regexps both start with '^' and have no closer '$', which doesn mean "look for whitespace at the beginning of the string...", when we should have at some point a test against whitespace soming before the tag, i.e. a test against substr($message, 0, $pos) (with $pos representing the start of a tag, and a regexp ending with $), instead of substr($message, $pos), with $pos representing the end of the tag and the regexp starting with ^, like it is now...

Do you think it's an oversight? In SMF, 'outside' is only used on 'td' and 'li'. I can understand that one would want to only test for whitespace once because if anything, there's little chance of having any text between two td or li tags, so if you test for trailing whitespace, you don't need to test for leading whitespace on the next tag because it's likely to have been removed already... Yet, it isn't working that way in my opinion.

Maybe I should add support for a new trimming codeword that would cover both directions... I don't know.

Anyone here knows about everything I've been talking about...?

Title: Re: BBC whitespace trimming
Post by: Nao on May 26th, 2012, 07:40 PM

Oh well, I knew this wouldn't generate any comments... ::)

Anyway, I'm currently working on this. Trying to optimize parse_bbc() as well -- the folly!!
Heck, I've even created a static array, $strlower, which holds a lowercase version of the ASCII table, just to optimize ONE call to strtolower() that gets executed on each new tag found... It's probably a bit TOO silly -- although this call is now a whopping 3 times faster... Uh.

I'm also considering something else that could be a much better improvement.
Say we do, at SMF import time, a conversion of all tags to lowercase...
Now, in preparsecode(), we convert all tags to lowercase just the same.

See what I mean...? parse_bbc() suddenly no longer has to care for strtolower or stripos. And believe me, stripos is used quite a lot of time, and I know from experience that it's nearly as slow as doing a strpos(strtolower())...

What do you think?

Posted: May 26th, 2012, 07:33 PM

Also... Maybe we could store strlen($tag) inside the database...? It would save us a call to strlen() for each and every possible tag being tested against... (Heck, it would probably even be faster to just calculate and store the strlen in the bbcode cache at the beginning of the function... Because, let's say you have dozens of "tt" tags in your post: for each of them, it'll be calculating strlen('table'), strlen('td'), strlen('time'), strlen('tr') and strlen('tt')... Funny hey?)

Title: Re: BBC whitespace trimming
Post by: Arantor on May 26th, 2012, 07:41 PM

I see what you mean, but I'm bothered by the fact we'd have to convert only known tags - any unknown tags would have to be left alone in case the user intentionally did that.

E.g. a post with things for sale:

Quote

Item 1 [SOLD]

You'd want that left alone.

(I had the tab open from this morning, then had to go do shopping and stuff)

I only know vaguely about the way it works or is supposed to work, it's not something I spent long looking at :(

I do agree with storing strlen, though, that would likely be advantageous.

Title: Re: BBC whitespace trimming
Post by: Nao on May 26th, 2012, 07:57 PM

Quote from Arantor on May 26th, 2012, 07:41 PM

I see what you mean, but I'm bothered by the fact we'd have to convert only known tags - any unknown tags would have to be left alone in case the user intentionally did that.

I guess so. But then again -- I suppose we can live with it...?!
I mean, do you know of many users who manually type their tags, and when they do, they're not geeks and thus don't follow the common use of lowercase for tags...?

Quote

You'd want that left alone.

Yep... (Didn't think of that, really.)

(I had the tab open from this morning, then had to go do shopping and stuff)

I only know vaguely about the way it works or is supposed to work, it's not something I spent long looking at :(

Quote

I do agree with storing strlen, though, that would likely be advantageous.

I did some tests... That piece of code is never called much, in the end. Or maybe that's because there are only 56 tags and only a handful of them are spread over the same repeated starting letter (T, mostly, as I mentioned, or L, etc.)... I actually did the internal database conversion on my local install, added the numbers through a query (update bbcode set len=length(tag), was easy enough...), and tested: most of the time the strlen() version is faster, ahah... Well, it mostly means that it's not where the code is slowed down.

Heck, actually I made some benchmarking over the entire parse_bbc() code, and I nearly always get very high performance (e.g. a few milliseconds per post, even with multiple tags in them)...
I don't really know what's slowing it down, then.
Maybe it's a waste of time...

Title: Re: BBC whitespace trimming
Post by: Arantor on May 26th, 2012, 08:11 PM

Quote

I mean, do you know of many users who manually type their tags, and when they do, they're not geeks and thus don't follow the common use of lowercase for tags...?

Don't forget places like PhotoBucket that provide 'embed code', only for it to be in uppercase.

The reason, ultimately, that parse_bbc is slow is not because of lacking micro-optimisations but because of its overall methodology, where it steps through the content tag by tag. Posts that aren't tag heavy will be quick, posts that have lots and lots of tags will be much higher proportionately.

Title: Re: BBC whitespace trimming
Post by: Nao on May 26th, 2012, 09:30 PM

Do you have a sample post available that is known to be a horrible speed offender? That'd be great...

Title: Re: BBC whitespace trimming
Post by: Arantor on May 26th, 2012, 09:40 PM

Not offhand, but any post with a lot of tags should be slow, especially if they're heavily nested. Large lists for example.

Title: Re: BBC whitespace trimming
Post by: Nao on May 26th, 2012, 09:44 PM

I did a few tests earlier and the only offenders were the Aeva and media tags actually...

Posted: May 26th, 2012, 09:43 PM

Shall I keep the strlen db cache then btw?

Title: Re: BBC whitespace trimming
Post by: Arantor on May 26th, 2012, 09:54 PM

Aeva and media have specific reasons why they make a significant difference. For general parsing, it's just a case of having very long posts with lots and lots of tags.

If the strlen changes make it faster, I'm all for keeping it. Just nudge me to make sure the bbcode support in the plugin manager does it too.

Title: Re: BBC whitespace trimming
Post by: Nao on May 26th, 2012, 10:35 PM

So... I made this post with several list items and a six-cell table, an image and a raw URL, and got it parsed in less than 2 milliseconds... It's pretty much weird that there are performance issues at all?!

Also, a bug! Another...
When you choose to add a link, it shows the img tag and the URL in between. Once you submit, it gets submitted as this:

Code: [Select]

[img]http://[url]http://wedge.org/welog.png[/url][/img]

Ahhhhh... Again.

(http://[url]http://wedge.org/welog.png[/url])

Title: Re: BBC whitespace trimming
Post by: Arantor on May 26th, 2012, 10:56 PM

Quote

So... I made this post with several list items and a six-cell table, an image and a raw URL, and got it parsed in less than 2 milliseconds... It's pretty much weird that there are performance issues at all?!

In the scheme of things, however, it soon adds up (and that's still a 'simple' post really), it can easily account for 50% of a single page's execution. Push it through a profiler.

Also, odd :/

Title: Re: BBC whitespace trimming
Post by: Nao on May 26th, 2012, 11:58 PM

The img problem was due to Aeva, once again...
I added img to the exception list.
I'm guessing this will require a rewrite, still. I don't like the idea of this happening all too often....

I'm actually wondering whether I couldn't simply move all embedding code to parse_bbc(). Eh, why not... Footnotes are already there, after all. Why not the rest. Unless of course I forgot about a good reason to have it outside.

Re: performance, I made more tests, and I still can't get a full page to use more than 10% of the overall resources... (With a full page of regular posts + a few more complex ones.)
I really don't know what's hapenning. Maybe it's not in parse_bbc() itself...?!

Title: Re: BBC whitespace trimming
Post by: Arantor on May 27th, 2012, 12:11 AM

Hrm, interesting. I know that there were already slight changes made (e.g. about how the tag list was built)

I'm just surprised that you're not seeing as much load.