Naming Boards/blogs/sites etc.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Naming Boards/blogs/sites etc.
« Reply #45, on April 13th, 2011, 07:27 PM »
It sort of is, but it's also a drastic oversimplification.

Let me go back to YaBBSE / SMF 1.0 for a moment. parse_bbc was a very different beast then, and it was based on regular expressions, which under some conditions could be an order of magnitude faster than the current incarnation.

At the same time, it's also possible to brutally hurt the server with specifically crafted posts that force the regular expression to be parsed the slowest possible way, which under the right conditions would be easily an order of magnitude slower than the current parse_bbc processing.

That's why Unknown gutted it and rewrote it for 1.1. (Maybe it was for 1.0 replacing YaBBSE's. I can't remember, but certainly at some point prior to 1.1 final it was moved from regexp based to the current incarnation.)

parse_bbc now is slow. But you can't tie it in knots and generate pathological-case processing; it won't run into log time most of the time, and even with the worst case it won't typically be worse than linear time.

So, with that in mind, consider then that you can throw anything at parse_bbc and it should work as intended.

Doing something with it structurally is something I've been pondering for a while, and I keep coming back to this sticking point of its style of regurgitation - it doesn't ever take the content and filter crap out, it starts from sanitised content and specifically enables it. There's no good way I can see of approaching that without compromising security somewhere, and so it becomes a judgement call as to how far is 'far enough'.
When we unite against a common enemy that attacks our ethos, it nurtures group solidarity. Trolls are sensational, yes, but we keep everyone honest. | Game Memorial

AngelinaBelle

  • Still thinking...
  • Posts: 92
I'm an SMF doc writer.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Naming Boards/blogs/sites etc.
« Reply #47, on April 13th, 2011, 07:34 PM »
Unfortunately, yes, for now.

I do have some ideas on it, though - one of Karl's ideas was to be able to provide a 'scope' for the purposes of processing, and that could be used to indicate the type of parsing to do, which might be useful to implement sometime...

AngelinaBelle

  • Still thinking...
  • Posts: 92
Re: Naming Boards/blogs/sites etc.
« Reply #48, on April 13th, 2011, 07:40 PM »
Thanks for trying all these ideas out, Nao and Arantor. Maybe the entire SMF community can profit from what you learn.

Nao

  • Dadman with a boy
  • Posts: 16,079
Re: Naming Boards/blogs/sites etc.
« Reply #49, on April 14th, 2011, 10:13 AM »
Re: parse_bbc performance.

- Maybe we should use the opportunity of a full import to do some sanitizing work on posts that weren't sanitized in the first place (e.g. posed through a non-complying mod, things like that.) That way, we could remove some of the legacy code from parse_bbc. Hopefully.

- For those who aren't in the private area and can't follow my little adventures... One of the features I added was a complex block replacement system that supports optional params. I'd crafted this beautiful regular expression that worked great. Then it crashed a few pages, so I rewrote it for performance (with inspiration from my 'goddess of all regex'[1] and that really odd atomic grouping thing). It became extremely reactive and I couldn't get to crash it anywhere. Was satisfied with it. Only, I wasn't sure it wouldn't crash anywhere else (although my experience with the 'goddess' tends to confirm that it wouldn't...), and I was curious about something: performance of a complex but very optimized regex, against the use of pure non-regex PHP. So I started working on a small function that would do the same work as the regex.

Quite surprisingly, it only took me 20 minutes to get it working, it only took a few lines of code, and it was fucking fast. The final version is about 10% faster than the regex version, and the more blocks you replace, the faster it gets -- once you reach 300 blocks, it becomes a strain on Wedge in the regex version while the pure PHP version still keeps flying (about 5 times faster.)

This is only mean to explain that I used to think regexes were the ultimate solution for complex performance-aware searches/replacements. I'm not so sure about that anymore. Actually, I'm starting to wonder if I'm not going to turn this pure PHP version into a regular multi-purpose function with associated callbacks, that will happily handle any kinds of tags, be it HTML, BBC or Wedge block tags (<we:block>). Maybe then, we can start doing tests in parse_bbc and see if it can be done faster.

One of the things we should sanitize in posts is tag cases. I mean, I don't see why we should accept [Html] tags when it should be 'html', in lowercase. Case insensitivity is one of the things that regular expressions are great at, as Pete can confirm. My pure PHP version is case sensitive. Had I added support for case insensitivity (and I did at some point), it would have been twice slower. Which is okay when you have 300 blocks to deal with (it's still 2.5x faster than the regex version), but it becomes 80% slower than the regex version when using only a dozen blocks.

So... These are a few of the solutions I could offer to try.
 1. This is a special regular expression I wrote for Aeva Lite back in 2008. It fixed a crash bug that had plagued AEVAC for months, and that pretty much drove Karl Benson to give up on the mod. As far as I can remember, getting that regex to work was a defining moment for me -- that's when I accepted the idea that I had become capable of handling SMF mods.