Print Page - Fixing mismatched BBCode

Public area => The Pub => Features => Topic started by: Nao on December 4th, 2011, 11:59 AM

Title: Fixing mismatched BBCode
Post by: Nao on December 4th, 2011, 11:59 AM

The code in Class-Editor that straightens out missing code tags is quite odd to look at... Some variables are initialized and never actually used, etc. I looked into it and decided to rewrite it. As a result, I did pretty much the same thing -- in only three lines of code instead of 10.

Now, the thing is that the I'm using a $num_tags variable instead, which is incremented or decremented based on whether it meets an opening or closing tag. Although technically it works (if you have two opened code tags, it will addd two closers), in reality it doesn't work as expected. Two examples:

- if you first use a closer, then an opener, $num_tags will be set to zero and nothing will be fixed. The code needs to be smarter than that.
- if you use two openers and a closer, Wedge will add an extra closer... Except that code tags within other code tags are treated as code (not tags), so they don't need a closer. It's always been a mess in SMF because if you add two closer tags here, the second one is shown as plain text, instead of the first one. Here, I'd tend to say this is a SMF bug that needs to be fixed in Wedge... What do you think?

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 4th, 2011, 12:57 PM

What needs to happen is that it needs to understand nesting hierarchy properly, not just for code tags (which will solve other issues of stupidity)

As an exercise, modify the post in the DB or hand it directly to parse_bbc to see what that does. That actually keeps track of hierarchy to a point.

Code tags inside code tags won't (and can't, by definition) ever nest properly. The innermost closer will always be hit first, ending the tag, so what comes after it will always be wrong as it will be the remainder of the code tag outside the original code tag. And it cannot and should not be assumed that a code tag nested inside a code tag will have an inner closer either.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 4th, 2011, 11:13 PM

Yeah, it's all a bit complicated... If in SMF you type /code then code, it will simply add the closing code at the end, and leave the first /code as is.

Here's my original rewrite:

Code: [Select]

		if (preg_match_all('~\[(/)?code[^]]*\]~i', $message, $matches))
			foreach ($matches[1] as $tag)
				$num_tags += (empty($tag) ? 1 : -1);

		// If we have open tags, close them.
		while ($num_tags > 0 && $num_tags--)
			$message .= '[/ code]';

		// Open any ones that need to be open, only if we've never had a tag.
		while ($num_tags < 0 && $num_tags++)
			$message = '[code]' . $message;

I think it's quite elegant, but it has the problems I mentioned above.
Now... Why exactly is it so important to have matching tags? AFAIK, because the parser doesn't do recursive regex magic, it doesn't really care whether there are mismatched tags. In fact, it's even worthless to add a code tag before the post is there's an extra /code overall: because the /code is probably a mistake on the user's part, and I'm sure they'd rather see an untransformed post, rather than everything turned to a large code block for no reason whatsoever (in their opinion)...

So, technically, there is not even a strong reason to have the last 3 lines in the code above. It won't change a thing...
Heck, could probably even get rid of it all... :-/

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 4th, 2011, 11:17 PM

Quote

Why exactly is it so important to have matching tags?

Generally, it's not a huge deal because the parser will close any tags at the end (so that the tag endings will be output correctly) but for code tags, processing is quite a bit different in the bowels of the bbc parser, because the content is actually split by code tag, and the loop can safely deal with extra /code closers but it'll get very upset if there's more openers than closers.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 4th, 2011, 11:44 PM

So... How about we forget about closers, and only deal with fixing openers?

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 4th, 2011, 11:53 PM

The parser logic is smart enough to cope with closers (even respecting block level vs non block level tags to a point) - except for code closers. What do you have in mind as far as fixing openers goes?

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 5th, 2011, 12:38 AM

So, do you mean that the code I posted above would be 'perfect' for Wedge?

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 5th, 2011, 12:54 AM

Where does that code run, exactly?

(Haven't looked at the commit yet)

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 5th, 2011, 07:21 AM

Search for $in_tag in class-editor?

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 5th, 2011, 09:45 AM

Hmm, that's just before the splitter code runs, interesting.

I don't think you can cause any kind of vulnerability that easily with it but you can cause all kinds of mess with it if not careful, because the splitter explicitly relies on having matched tags when it does the split.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 5th, 2011, 10:27 AM

I'm actually considering rewriting the splitter to use the strpos (with offset) technique I used a few months ago... It's probably going to be faster, and more solid.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 5th, 2011, 10:31 AM

*nods* Sounds like a plan to me :)

Though at some point the preparser needs to be gutted to actually examine the post with respect to the tag's definitions, rather than relying on its own defined rules (so that table->tr->td isn't defined through some esoteric regexp but because that's how the bbc rules define it - which means it becomes possible to add attributes, or even, dare I suggest a th tag)

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 5th, 2011, 11:48 AM

But that would mean a fuller rewrite of preparsecode, which I'm not ready/willing to do... ;)

So, I'm looking at the code tag stuff and all I see is that it's mainly used to deal with special tags only when they're outside code tags. It'd be more solid to actually search for [ code] from the current offset, make a string out of it, deal with it, then search for the next [ /code] (if not found, add [ /code] at the end of the post), set it as the new offset, and loop.

PS: long PMs eh...! Will answer later today.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 5th, 2011, 12:23 PM

Quote

But that would mean a fuller rewrite of preparsecode, which I'm not ready/willing to do...

I know, and same here. But that is the eventual goal.

Quote

So, I'm looking at the code tag stuff and all I see is that it's mainly used to deal with special tags only when they're outside code tags.

Well, yes and no. The main things it's used for are to make *sure* that url and img tags are legal. It does also serve to carry out certain other exception handling that won't be applicable inside code tags.

Basically, it's the bit that needs overhauling anyway, so provided that the bits between tags remain accessible for manipulating (bearing in mind that they could be rewritten in the process) I'm cool with overhauling it with strpos.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 5th, 2011, 07:21 PM

My new rewrite is a bit messy and longer. What do you think about this: I'm parsing the message to remove any misplaced code tags. Eg two opening tags? The second one is removed. Same for closing tags. Etc. Not in my pc so I won't elaborate. This of course makes it impossible to type in a code tag inside another but who cares.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 5th, 2011, 07:37 PM

I can see the logic of that but I can also see the desire to use code tags inside a code tag... Though that's not practical or really feasible since proper full nesting can't be done anyway. Sounds like a plan to me really.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 5th, 2011, 08:12 PM

I'm guessing that if you're gonna use the code tag inside another, you'll want to add a matching closer, and since this one'll break the whole block, you'll add a space somewhere to break the closer instead... So might as well wanna break the opener as well with a space. Anyway...

Here's the code, for those interested. It seems to work, at least on my test cases. Maybe you can figure out a faster/shorter way to do it ;)

Code: [Select]

// Find all code blocks, and ensure they follow the [ code][ /code] syntax closely.
if (preg_match_all('~\[(/?)code[^]]*]~i', $message, $matches))
{
	$pos = 0;
	$tags = array();
	foreach ($matches[0] as $id => $tag)
	{
		$tag_pos = strpos($message, $tag, $pos);
		$length = strlen($tag);
		$tags[] = array($tag_pos, $length, $matches[1][$id] === '/');
		$pos = $tag_pos + $length;
	}
	$was_closed = true;
	$offset = 0;
	foreach ($tags as $tag)
	{
		if ($was_closed === $tag[2]) // consecutive closing tags, or closing tag at the beginning?
		{
			$message = substr($message, 0, $tag[0] + $offset) . substr($message, $tag[0] + $tag[1] + $offset);
			$offset -= $tag[1];
		}
		$was_closed = $tag[2];
	}
	if (!$tag[2]) // no final closing tag?
		$message .= '[ /code]';
}

(Of course, the irony is that I had to add a space to the tags... :whistle:)

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 5th, 2011, 08:36 PM

And you'd have to otherwise, no?

Looks good to me at a first glance.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 6th, 2011, 08:24 AM

AND.... Interestingly, the same code can be reused for proper quote de-nesting :)

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 6th, 2011, 12:21 PM

Quote from Nao on December 6th, 2011, 08:24 AM

AND.... Interestingly, the same code can be reused for proper quote de-nesting :)

Oh, not that much actually... Because we'd need to remove the contents as well... And improperly nested quotes would then be a nightmare.

Which makes me think... I've often seen cases of people (including me, TBH) surrounding quotes with two closers instead of an opener and a closer. I'm trying to figure out a way to fix these automatically, but honestly it's hard to tell... :-/

Title: Re: Fixing mismatched BBCode
Post by: PantsManUK on December 6th, 2011, 12:49 PM

Quote from Nao on December 6th, 2011, 12:21 PM

Which makes me think... I've often seen cases of people (including me, TBH) surrounding quotes with two closers instead of an opener and a closer. I'm trying to figure out a way to fix these automatically, but honestly it's hard to tell... :-/

Top of my head, thinking out loud... Count openers and closers, if they match Success!, if they don't, Fail!. Haven't thought beyond that point... leave it with me... :eheh:

UPDATE: If they don't match, check that the closer count if an odd number, find the first closer *after* the last valid closer and make it an opener, rinse, repeat...

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 6th, 2011, 07:11 PM

Actually, that's a very good suggestion you know...?
I probably wouldn't 'look' for the exact position of the offender, if only because that would mean writing a UI for it (:P) when I can just use SMF's post error box, but it's definitely something of a good idea: we could go through the list of bbcodes that expect a closer tag, and make sure they're there. If not, post an error message asking the user to fix it...
I guess for long posts it's going to be a pain to fix, but OTOH it's only going to be better because if a quote tag is broken and you don't fix it, you're upsetting all of your users. Better one person than everyone...! ;)

:edit: Go through all bbcodes except for those of the 'closed' type... Simple!

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 6th, 2011, 10:33 PM

Woohoo, this is messy...

Okay, I've finished implementing the checks. It works flawlessly :) Just needs to be optimized... Like, adding a cache for tags.
Unfortunately, Wedge/SMF isn't cooperating. In the sense that it doesn't allow posts to show more than one error of the same type... I'd *love* to know the logic behind that!

Here's what it does:
- Post2 generates an error. Later on, it will fill in $context['post_error']['messages'] with the errors, and $context['post_error'] with the error types.
- Then it calls Post() in Post to regenerate the original form.
- Post(), in turn, EMPTIES $context['post_error']['messages'][1] and attempts to regenerate it based on the error types...
Which means that not only do we lose any additional errors of the same type (it only generates only error message for each type), but we also lose any customizations done in Post2... That's ridiculous.

Now, logic dictates that I 'simply' get rid of the current Post() code and instead use the same code as in Post2. But the code is really shitty in Post()... I mean, there's only one possible error set BEFORE we re-parse 'post_error'. The only other reason to re-parse it, is because Wedge has detected we're previewing or editing a post, i.e. maybe we're coming from Post2... Look at the way the 'no_name' error is handled. In Post2, it's a string added to the list of errors. Then it's handled within the error parser. Then we go back to Post(), where this isn't done until we reach the error parser. At this point, the loop is actually doing the guest tests AGAIN, and adding 'no_name' directly to the list of parsed errors (instead of errors to parse), and... Oh well, it's hard to explain.
Suffice to say, it's really, really really fucked up. I'm so annoying, I'm actually going to let it go for tonight. And I really have no idea how to best handle all of this...
Pete? Any suggestions?

1.	Actually, it starts by emptying $context['post_errors'] (note the plural) which is never used, meaning it's actually a typo...

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 6th, 2011, 10:38 PM

Well, the logic behind it is that you should only ever be receiving one instance of any given type of error. I can imagine any circumstance where - normally - you'd need multiple instances of the same message. Wouldn't you just format an otherwise generic message to include all the problem cases?

Remember that the post code - in that context - can be called from both Post2 and directly, and quite possibly other places that don't occur to me right now.

I'm not arguing that it's fucked up, but I'm not sure it's as fucked up as it might be.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 6th, 2011, 10:48 PM

Look at Post2 -- if errors are found, we're ALWAYS redirected to Post().
The messages array is filled in Post2, and ALWAYS reset in Post().

Conclusion: the messages array is filled for no reason in Post2() when it should be done in Post().

That's why I'm upset -- coz I'm gonna have to move my code to Post() tomorrow... -_-

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 6th, 2011, 10:55 PM

Oh, I see what you're getting at, misread it at first.

What I was saying is that IIRC there are circumstances where we end up at the post page, not at a creation step, and without having gone to Post2.

Also, Post2 is not the only place content updates are done (quick edit for example)

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 6th, 2011, 10:58 PM

Quote from Arantor on December 6th, 2011, 10:55 PM

Oh, I see what you're getting at, misread it at first.

Mostly because my post was confusing, though...

Quote

Also, Post2 is not the only place content updates are done (quick edit for example)

When it comes to post errors, it shouldn't set them, anyway.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 6th, 2011, 11:19 PM

Well, either it should set them and Post should use them, or only Post should be setting them.

Most of the point of Post2 doing anything like that is simply so that there will be something to force it back to Post, but if you're going to rip it out, make sure that there is some way not only to feed back to Post when there's a problem but that it can be extended by plugins too.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 7th, 2011, 07:46 PM

Quote from Nao on December 5th, 2011, 08:12 PM

Here's the code, for those interested. It seems to work, at least on my test cases. Maybe you can figure out a faster/shorter way to do it ;)

So... Now that Post2 does some proper checking on tags to find mismatched occurrences (and it works very well), do you think there's a point in keeping my improved mismatched code tag fixer in?

PRO: well, it'll make things a bit faster...?
CON: if adding code tags inside anything that does NOT go through Post2 (e.g. an AeMe comment, etc.), it won't work...

Obviously, for anything that does NOT do proper error checking on post contents, it's going to be harder to manage... It's not like we can easily plug these situations into Post2...

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 7th, 2011, 07:54 PM

Shouldn't things like that be in preparsecode anyway?

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 7th, 2011, 09:03 PM

Preparsecode can't refuse a post and send back a list of errors afaik...

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 7th, 2011, 10:23 PM

No, it can't. But it can set a global variable to be acted upon.

In any case there's more stuff that doesn't go through Post2 than just Aeva comments. Quick modify for one.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 7th, 2011, 10:38 PM

Maybe we could have the test in an external function. If called from post2, return an error. Otherwise try to fix It by adding as many missing tags as required. Yay?

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 7th, 2011, 10:59 PM

Works for me :)

Purely academically, is the goal for fixing mismatched tags so that it can be removed from the bbc parser? (If so, we probably *should* also route imported posts through this code too)

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 8th, 2011, 12:14 AM

Okay, I'll look into doing that tomorrow...

Still, complex errors won't be magically fixed, I'm afraid... Unless, unless I do a stricter check.
So while typing this post, it came to me that I could use a stack of tags and stack/unstack data and... Well, have a look at this code:

Code: [Select]

[/quote][quote author=Nao link=msg=1 date=1309111289]Lorem ipsum?[/quote]What is that?
I don't speak rubbish!
[nb]I'm wondering if Rory does, though? He had time to learn...[ /code]
[ code][ nb][ /code][ /nb]

And... Here's the error message I'm getting following my latest rewrite. Which is actually SHORTER than the last one ;) It's not perfect but I'm working on it eheh.

Pretty cool uh?

PS: and yes, it works with tag nesting too, since it's a stack... i.e. if I have properly nested 'b' tags inside the quote, everything's fine.

PPS: the main issue with fixing tags in the middle of a message is that I would then have to find the exact position of the tag... I guess it's doable, though, but I'll have to go through a series of strpos etc or something to fill in the list first, so it'll definitely make the code bigger.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 8th, 2011, 12:39 AM

Oh, I think that's pretty awesome :)

Title: Re: Fixing mismatched BBCode
Post by: billy2 on December 8th, 2011, 08:48 AM

:wow:
Awesome indeed.
:)

Typo - 'show' should be 'shown'

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 8th, 2011, 09:29 AM

Thanks, I didn't notice that one bit ;)

Okay, I'm in the process of moving the code to the wedit object (where it should have been from the beginning), and trying to fix the code automatically... So, considering the fact that the most important content (topic posts) is okay because we clearly specify the errors, fixing posts automatically shouldn't be too much of a hassle but obviously the code to actually fix them is going to be more complex...

So, let's say I have this:

Code: [Select]

[quote]post1[/quote]comment[/quote]post2[/quote]

I think it's safe to say that the poster added a / by mistake, and it should be removed, but I hardly see how Wedge is going to be able to spot it automatically without going into a great deal of large-scale testing. So I suppose we could do it this way...

Code: [Select]

[quote]post1[/quote][quote]comment[/quote][quote]post2[/quote]

i.e, the extra /quote knows there is no previous matching quote tag, so it simply looks for the *last* tag it found (or MAYBE the last closer tag of its kind, i.e quote?), and it adds an opener tag right after it. Then we mark the tag as fixed (i.e. we remove 'quote' from the stack of opened tags). So, we are in a situation where 'comment' is suddenly stuck into a quote tag. Continue as usual. Then we spot the other closer quote, and we do the same, i.e. add a closer tag after the last closer, so in this case the 'post2' bit is fixed.

The obvious problem with this solution is that our comment is now in a quote. But because we'll have three separate quotes in a row, it'll be *relatively* obvious (not captain-obvious obvious, but still), that the middle one is a reply to the previous quote. What do you think..?

I was thinking of other solutions, like checking whether a tag is something like '[/quote author=Nao]', which in this case would mean "it's an opening quote where the / was added by mistake" but I don't really think it's a realistic case.

Now for another test case...

Code: [Select]

[quote][b] [s] post1[/quote]

Using the pseudo-code from above, this would be really messy -- an opening quote would be added just before the closer, and then the s and b tags would remain opened until the end of the post, where they would then be closed forcibly.
So in this case, I think it's best that when we look through the latest opened tag in the stack, and it's not our closer, we simply add the related closer automatically and then keep going through the stack in reverse order, closing tags as required, until we find ours (or not). This is actually pretty much what my code is doing right now, as opposed to the pseudo-code in the first example.

Now, if we mix our two examples...

Code: [Select]

[quote]post1[/quote][b] [s] comment[/quote]post2[/quote]

The closer quote will trigger a search for the last closed tag, which in this case is another closer quote, BUT between them it will find two unclosed tags... Which gets confusing, so it's best to first close any opened tags, and THEN go through the search for a place to add a new quote opener.

It's all very 'amusing' because I have to maintain a parallel stack of tag positions and the code has currently jumped from 15 lines to 50, which caused me to write this post in the hope that it'll allow me to sort things out... :lol:

Anyway, opinions welcome...
Hey, perhaps someone has heard of some BSD/MIT code available online that precisely does just that -- a flawless fix of all BBC or HTML tags left opened or closed... :P

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 8th, 2011, 09:41 AM

Quote

Using the pseudo-code from above, this would be really messy -- an opening quote would be added just before the closer, and then the s and b tags would remain opened until the end of the post, where they would then be closed forcibly.

They shouldn't be. The bbc parser should actually close both the s and b tags, honouring proper hierarchy, when it gets to the end of quote.

Quote

~~post1~~

Like so, in fact. (And if you check the source, you'll see it's unmodified.) It works because there's the end of a block level tag with unresolved non block tags inside it. The exact behaviour is incredibly complicated, and is no doubt one of the reasons why the bbc parser is so big and scary - but it's also resilient. This is why I asked if the idea behind this was partly to reduce its complexity or not, because it actually does a lot of silent fixing that most people don't even realise.

The problem as you've discovered with writing such a solution is that unless you can get inside the user's brain and figure out what they meant, rather than what they typed, you have no hope of getting it consistently right. This is, incidentally, probably the one good example of where WYSIWYG actually works, because it allows people to see directly what they're using which typically means fewer mashed up tags.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 8th, 2011, 10:00 AM

Quote from Arantor on December 8th, 2011, 09:41 AM

They shouldn't be. The bbc parser should actually close both the s and b tags, honouring proper hierarchy, when it gets to the end of quote.
Quote
~~post1~~
Like so, in fact.

Allow me to quote myself:

Quote

it's best to first close any opened tags, and THEN go through the search for a place to add a new quote opener.

So, yes, I totally agree... It's just that my original pseudo-code doesn't account for that, and I knew I was in for a complicated day if I didn't start by clearly writing a few broken posts and figuring out what would be the most reliable way to fix them.

Quote

(And if you check the source, you'll see it's unmodified.) It works because there's the end of a block level tag with unresolved non block tags inside it. The exact behaviour is incredibly complicated, and is no doubt one of the reasons why the bbc parser is so big and scary - but it's also resilient.

But *once* it is redone through ::fixNesting, with my super-solid code etc, won't all of the remaining code become unnecessary all of a sudden...? :P

Quote

This is why I asked if the idea behind this was partly to reduce its complexity or not, because it actually does a lot of silent fixing that most people don't even realise.

The only fixing I've seen preparsecode do is add code tags at the beginning or end of a post which sucks a bit...
::parse_bbc does some fixing on its own, IIRC, but if it does, it's in the wrong place. It should be done at save time, obviously. (AND, we should remove any fixer code from ::parse_bbc to force modders to go through ::preparsecode. Believe me, I didn't even know this function existed before Shitiz used it in SMG, and I didn't have the *reflex* to use it systematically until, err.... Now?)

Quote

The problem as you've discovered with writing such a solution is that unless you can get inside the user's brain and figure out what they meant, rather than what they typed, you have no hope of getting it consistently right.

But I'm not writing an AI... Otherwise we won't release before 2015...
Things like, "Okay, this is a closing tag BUT it's at the beginning of a line *and* is immediately followed by content, so MAYBE it's an opener, let's try to turn it into an opener and see if it suddenly validates"... These things are doable, but they take time to implement.

(Well, that particular solution would still be a very good one to the first test case I posted.)

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 8th, 2011, 10:18 AM

Quote

But *once* it is redone through ::fixNesting, with my super-solid code etc, won't all of the remaining code become unnecessary all of a sudden...?

Does fixNesting deal with block tag nesting mismatches or just nesting mismatches? Also, once nesting and mismatches is fixed, we will also need to look at dependencies and must-contain/must-not-contain rules too, which are also specifiable in the bbc parser...

Quote

The only fixing I've seen preparsecode do is add code tags at the beginning or end of a post which sucks a bit...
::parse_bbc does some fixing on its own, IIRC, but if it does, it's in the wrong place. It should be done at save time, obviously. (AND, we should remove any fixer code from ::parse_bbc to force modders to go through ::preparsecode. Believe me, I didn't even know this function existed before Shitiz used it in SMG, and I didn't have the *reflex* to use it systematically until, err.... Now?)

preparsecode does a lot of things, actually. I think I listed what it did in a previous post, but as well as rearranging tags (albeit naively), it also fixes html bbc for non admin users, cleans up nobbc so they're safely unparsed later, attempts to validate the contents of url and img tags (which, to my mind, made the 1.1.11 update pointless, but that's just me, unless there's a deeper issue that needs resolution)

Part of the reason parse_bbc has it and not preparsecode is that posts added to the DB through other sources that won't have come through preparsecode originally, and that's not just for modders (for example, this will include the importer unless we push every post through some kind of fixer during the import)

Still, considering that the html bbcode will do bad things if inserted manually and not through preparsecode, that nobbc may or may not work properly and other stuff, I'm inclined to think that it's OK to remove some of this stuff from parse-bbc, and move it to the preparser, provided that the preparser is able to do *everything* that parse_bbc does, which as I've alluded to, is more than just ensuring tag nesting is sane, it also has rules on what tags can contain other tags (url can't contain another url for example), on what tags must be where (list requires 1+ li, li must be inside a list, and the entire table->tr->td, both parent/children get evaluated)

I always wanted to move that to the preparser anyway to remove this dependency on strange, naive regexps that didn't allow for customising the table tag or adding th tags, without rewriting all that stuff as well.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 8th, 2011, 06:53 PM

Quote from Arantor on December 8th, 2011, 10:18 AM

Does fixNesting deal with block tag nesting mismatches or just nesting mismatches?

What do you mean?
It supposedly fixes any mistmatched tags and that's all... It adds missing openers when it finds orphan closers, and it adds at the end closers to match orphan openers. And believe me, it's already hard enough to manage as it is... I've been on it all day, and it's still pissing me off right now. Granted, it's quite a complex string I'm working with (basically -- if it works with it, it'll work with everything), but right now my code is headache-inducing, and it starts failing after a few fixes... (hint: try not to insert data into an array you're *looping through at the moment*... I should probably reset the loop every time instead of trying to account for all changes...)

Quote

Also, once nesting and mismatches is fixed, we will also need to look at dependencies and must-contain/must-not-contain rules too, which are also specifiable in the bbc parser...

Well, I'm mostly looking at removing the 'light' fixers in preparsecode and parse_bbc, like the code tag fixer. And maybe the quote fixer... I think there's one around...

Quote

preparsecode does a lot of things, actually. I think I listed what it did in a previous post,

Yes, you did and I do remember it :P

Quote

Part of the reason parse_bbc has it and not preparsecode is that posts added to the DB through other sources that won't have come through preparsecode originally, and that's not just for modders (for example, this will include the importer unless we push every post through some kind of fixer during the import)

Well... I suppose we SHOULD call, maybe not preparsecode, but at least any functions that are called in Wedge and not in SMF, e.g. fixNesting.

Quote

I always wanted to move that to the preparser anyway to remove this dependency on strange, naive regexps that didn't allow for customising the table tag or adding th tags, without rewriting all that stuff as well.

Plus, anything to make parse_bbc faster... ;)

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 8th, 2011, 09:00 PM

What I mean is, the code that fixes the above example triggers not because of the mismatch of b/s/quote but because b and s are described as non block tags and quote is a block tag. When the quote tag ends, it looks for any non block tags that are currently open, in nesting order, and closes them. The exact rules are much more complex than I've indicated but it's what forces cases like lists to not be able to contain quotes and for it to safely end the list before the quote.

I also seem to recall that the x bbcode list item is also handled in a similar way, seeing how it is not handled by preparsecode but solely within the BBC parser, and in a way that's fragile. (Frankly I don't have a problem removing the one character list builder shortcuts because I don't know anyone that uses them and whenever I've seen them used, they always break unexpectedly)

The light fixers aren't major pieces of effort in parse_bbc but anything that goes does save effort on page loads generally too.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 8th, 2011, 10:22 PM

Yeah... Didn't think about this all...

Well, it's all pretty fucked up really. A waste of my time... -_-
Basically, it works 100%, until you meet some of the more complicated stuff I have in my test case. And I don't see how to fix it without, tadaaam... Another full rewrite!
If I do this... I'll probably have to, hmm... Do it recursively... Oh, I don't like that... :-/

Y'know, like, "if a tag is opened and is not self-closed, re-run the function on the string AFTER that tag, asking for it to return after it meets the closer tag..." And if it meets a closer that's not the one we're expecting, we'll just add our closer there (manually), and return from the function.

What annoys me the most is that every time, I get this very simple idea that ends up being flawed in one aspect or another... I just don't want to spend another day on that.

PS: any special tags that aren't in the list of double tags, like x or * or whatever, are not important here because Wedge will simply ignore them anyway.

PPS: my test case is as such. It chokes on the first [/nb] (which doesn't have a matching opener at this point because we already rewrote the first opener to add a closer to it.)

Code: [Select]

[/quote][quote author=Nao link=msg=1][b]
Lorem ipsum?
[/b][/quote]What is that?
I don't speak rubbish![nb]I'm wondering if Rory, though? He had time to learn...[ /code][ code][ nb][ /code][ /nb][quote]post1[/quote][b] [s] comment[/quote]post2[/quote]

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 8th, 2011, 11:04 PM

It's a complicated task at the very best of times :( And it shows in browsers too, when they have malformed tags to deal with, some ignore them totally, some make assumptions with interestingly unpredictable results. I remember having a conversation with a fellow geek back in 2000 which illustrates this perfectly: he was building a site with a big complex data-heavy table in it, and it worked perfectly in IE but broke horribly in Netscape. As I discovered... he wasn't putting any of the closing tags because 'it doesn't need them'. Well, it obviously does!

It's also a rabbit hole of a problem, in that no matter how clever you get, you can pretty much always find another example that will break it. The issue is where the line gets drawn.

This is also why I think we will need to adopt the logic used in the parser to unwind and reprocess the tag nesting, simply because it's a lot more than just having balanced tags. It's a pain to get right but the result will be worth it in the end.

FWIW, I've been trying to play with this today, not getting very far with it, just because I'm still trying to get my head around how the post parser really works.

What do you reckon about the special list of tags like x or *? Do we really need them? I actually think they cause more trouble than they're worth - and they're a perfect poster child of why this whole issue is a problem, since no-one seems to know how to actually safely end such a list.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 8th, 2011, 11:19 PM

Quote from Arantor on December 8th, 2011, 11:04 PM

It's a complicated task at the very best of times :( And it shows in browsers too, when they have malformed tags to deal with, some ignore them totally, some make assumptions with interestingly unpredictable results. I remember having a conversation with a fellow geek back in 2000 which illustrates this perfectly: he was building a site with a big complex data-heavy table in it, and it worked perfectly in IE but broke horribly in Netscape. As I discovered... he wasn't putting any of the closing tags because 'it doesn't need them'. Well, it obviously does!

With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue. Unfortunately that pseudo-code is too complex for a 'simple' bbc fixer, so it's not an option to me...

Quote

It's also a rabbit hole of a problem, in that no matter how clever you get, you can pretty much always find another example that will break it. The issue is where the line gets drawn.

The problem with my current code (not the committed code, which is bug-free AFAIK but doesn't do a fantastic job at figuring out where to add tags), is that it doesn't *work* in a certain situation, and I can't think of a way to make it work correctly because my code assumes things that, in that situation, aren't always valid....

Quote

This is also why I think we will need to adopt the logic used in the parser to unwind and reprocess the tag nesting, simply because it's a lot more than just having balanced tags. It's a pain to get right but the result will be worth it in the end.

What do you mean unwind and reprocess?

Quote

What do you reckon about the special list of tags like x or *? Do we really need them? I actually think they cause more trouble than they're worth - and they're a perfect poster child of why this whole issue is a problem, since no-one seems to know how to actually safely end such a list.

Yeah... It'd be a candidate for deletion, I suppose.

SPIP actually supports turning opening dashes automatically into bullet points. At least it did back in 2003... Ah, good old times.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 8th, 2011, 11:31 PM

Quote

With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue.

They still deal with mismatched tags and even malformed tags differently.

Quote

What do you mean unwind and reprocess?

What the parser ultimately does is step through the post and figures out the tags in play, and when it hits a closer (especially of a block level) or certain combinations of block level opener, it reviews all the tags that are open and closes some or maybe all of them.

The way it does it, ultimately, is not much different to constructing a DOM and traversing the node tree, albeit it does it in a linear rather than strict hierarchical fashion.

At each point, not only is the list of open tags maintained, plus block level evaluation, but the potential tags that can be contained inside each other, plus dependencies, are all reviewed too.

There's no way to evaluate a post and find all the instances of url bbc and immediately say beyond doubt that they will always be converted to links, because they won't even if they are otherwise legal. As I said before, there are some rules that even parsed tags cannot contain other tags (url can't contain email or url amongst other things), as well as being able to go up and down the tree in a *very* limited fashion to resolve the table-tr-td chain, which is actually defined in the bbc as well as the rules in preparsecode...

In fact, there is a very old bug on SMF's bug tracker where a table tag is malformed but because it fits the preparser's naive rules and even though there are rules defined for it, the rules aren't actually complete. IIRC, the preparser would actually break the nesting anyway under those circumstances, and that if the preparser were excluded, the main parser's handling would work properly and it would just safely fail to render the table.[1]

This is where it gets interesting. The parser will fix really obvious cases, where tags should be able to be safely closed (non block tags not being terminated when ending a block tag) but it chokes on bigger cases where nesting is broken. But if the nesting rules can be properly enforced at parser level without being broken by the preparser, tags will be safely unrendered until they can be fixed.

Now, if we were to move the full logic from the parser to the preparser, we'd be able to trap improperly nested tags too, and report on them properly.

Quote

Yeah... It'd be a candidate for deletion, I suppose.

SPIP actually supports turning opening dashes automatically into bullet points. At least it did back in 2003... Ah, good old times.

Here's the problem: what ends the list item? What ends the list?

This is one place where I'm actually slightly envious of wiki markup because it actually does it sanely. It has no assumptions about hierarchy, one line is one list item, and the first blank line after the fact is the end of the list. If only it were that simple with the one character shortcuts.

Mind you, if they worked more consistently, it might encourage more use of them.

1.	The alternative is extra closers being injected, causing layout malfunction, seeing how this started out in SMF 1.1.x with its tabular layout.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 9th, 2011, 12:12 AM

Oh... Not a good sign... I'm already lost in my recursive code... :-/

PHP has a PECL library for bbcode parsing, but it only returns html, not fixed bbc.
There's also a PEAR library written in PHP, it seems to be nicely written albeit a bit large (30KB for main source + more for specific tags like url...), and doesn't provide a fixer by default -- but it does seem to fix tags on its side.
I've also found "NBBC" on sourceforge, it has an extensive test suite and documentation. Unfortunately it doesn't provide a 'check' mode either, only converts directly to html but at least this one is BSD.
It doesn't generate opening tags automatically when finding an orphan closed tag, though... So apart from its alleged speed, it's not that interesting. Plus, it's huge.

Quote from Arantor on December 8th, 2011, 11:31 PM

Quote
With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue.
They still deal with mismatched tags and even malformed tags differently.

I think the whole point of the html5 tokenizer is to provide for a common ground to handle errors...?

Quote

What the parser ultimately does is step through the post and figures out the tags in play, and when it hits a closer (especially of a block level) or certain combinations of block level opener, it reviews all the tags that are open and closes some or maybe all of them.

Yes, that's pretty much what fixNesting does...

Quote

At each point, not only is the list of open tags maintained, plus block level evaluation, but the potential tags that can be contained inside each other, plus dependencies, are all reviewed too.

NBBC does that. (Wedge, too, obviously. But not fixNesting.)

Quote

Now, if we were to move the full logic from the parser to the preparser, we'd be able to trap improperly nested tags too, and report on them properly.

The reporting is already being done... If it were JUST about that, I wouldn't have worked on the alternative 'fixer' today.
Unfortunately I can't really feel satisfied with just the report code because it's not going to be used in quick edit etc.
(Well... Unless we add an errorbox for quick edit, of course. Which would not be such a bad idea...)

Quote

Here's the problem: what ends the list item? What ends the list?

The end of the line? :P

Quote

This is one place where I'm actually slightly envious of wiki markup because it actually does it sanely. It has no assumptions about hierarchy, one line is one list item, and the first blank line after the fact is the end of the list. If only it were that simple with the one character shortcuts.

We could always draw our inspiration from the best elements in wiki code...

Anyway. Time for bed.
I would have posted my 'bad work in progress' from today, but the source code is pretty fucked up (commented out code, echos and print_rs everywhere...), and I don't want to ruin my reputation :P

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 9th, 2011, 12:20 AM

Quote

Yes, that's pretty much what fixNesting does...

Yes, but from what I understood of it, it was mostly being done on mismatched tags only, it wasn't fixing or reporting nesting cases that are syntactically correct but semantically invalid, such as a table without a tr in it, would it flag that up as an error? An li not inside a list? These are the case I'm thinking of, not just mismatched tags. Properly handling that would solve a number of problems.

Quote

(Well... Unless we add an errorbox for quick edit, of course. Which would not be such a bad idea...)

It always amazed me that even in the event of a failure, there wasn't proper handling of errors going back to the user.

Quote

The end of the line?

That's the theory but it seems to be broken. Even time I've used the shortcodes, the list items terminate correctly but the list itself doesn't, so I never use them.

Quote

We could always draw our inspiration from the best elements in wiki code...

It makes me wonder at times whether a full bbcode solution really is the best one. It certainly is for some cases (e.g. non-standard/site specific cases) and the quote tag is certainly better handled than the native HTML equivalent (and wiki markup's idea of quoting is... laughable), but it certainly does make for debate about whether bbcode is an overcomplication in some ways.

Quote

There's also a PEAR library written in PHP, it seems to be nicely written albeit a bit large (30KB for main source + more for specific tags like url...), and doesn't provide a fixer by default -- but it does seem to fix tags on its side.

I'm always slightly wary of PEAR code. I've had a lot of bad experiences. But I can't say I'd be entirely surprised at its size, depending on how thoroughly it's doing the job.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 9th, 2011, 12:25 AM

- only handles mismatched tags, but it's still a lot. I never really planned to rewrite the entire parser... then again, nothing prevents us from writing similar code for other uses.
- nothing prevents us from doing that either... (?)
- well, i dunno then.
- i certainly like the idea of a very fast parser, but we'd have to determine if it features code that can fix bbc without turning it to html. I doubt it has.
- PEAR isn't important here -- just requires a few rewrites to give up on the dependency. then again -- NBBC is > 120KB (60KB after some sort of minification), so that pear library isn't that horrible to begin with.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 9th, 2011, 01:13 AM

Well, going to a bigger level, consider WYSIWYG. WYSIWYG in most systems removes the need for parsing entirely because you validate it on save and you're done, you just output raw content later. Notice also that in most cases with WYSIWYG, you also don't have to contend with tag mismatches much, though the sanitisation layer should still fix any issues just in case.

This is why I'm wondering whether bbcode is entirely the right tool for this job. For simple formatting, sure, but we can use basic HTML for that too. For more complex formatting, it's a trade off between the two (tables in particular aren't a lot different between basic HTML and bbcode) while things like lists can be done much more nicely in wiki markup than in bbcode.

It does imply that we might look beyond using bbcode anyway and would trim the fat by having it done in other places and other ways. But I do like the consistency of everything being bbcode, to a point.

Hmm. It's complicated.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 9th, 2011, 09:59 AM

- wysiwyg is still hell... Saving as raw data is worse. Especially when you start changing the output for a bbcode...
- the main issue with bbc vs html, I think, is that most forum users are used to bbcode. So, basically, even if you enabled basic html tags by default, we would still have to support bbcode for those who don't know about html etc... (at best, we could turn it into pseudo-html at parse time;)
- wiki markup isn't very popular, I reckon. If popularity wasn't an issue, I'd have switched to SPIP code long ago :P
- overall... yeah, bbcode is about consistency, mainly. Although I suppose nothing prevents us from adding alternative pseudo-code for people to choose from. But it always makes things more complicated.

I've given up on the recursive code. Anything that ruins my sleep is not something I should keep really. It's a little alarm clock in my head. So I'm back to my 'original' code and instead of going through the list of recorded tags, I'm going back through the stack... It's probably not going to give fantastic results - for instance, I was storing the tag type until now. If you were trying to find the best place for an opener for an orphan '/i', Wedge would go through previous tags, spot a closing quote and stop immediately because it can't enclose quotes inside italics. Things like that... That was pretty cool, but it doesn't work for that friggin' closer nb I mentioned before, and one day hunting for a bug is enough. I'll just make it pretty simple. I HOPE. No guts, no glory. Whatever.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 9th, 2011, 10:46 AM

Code: [Select]

[quote][i] [b] Hello [/i] [/b] [/s][/quote]

In a situation like this, my original code would first fix the mismatched tags in the middle, then it would look for an opener to 's' and eventually add one at the very beginning... Which would break everything because there's a quote in between. So I added a 'last_safe' variable which pointed out what was the last *safe* place to insert something (i.e. anything BEFORE it is considered valid and thus shouldn't be messed with again.)
Problem is, in a situation like that, the variable would be, at best[1] set to... The s closer's position. So we'd end up with an opener, immediately followed by a closer.

So... I'd like to know which you think is best. Shall we:

1- silently remove closer tags if no openers were found?

Code: [Select]

[quote][i] [b] Hello [/b] [/i][/quote]

2- add an opener right before them?

Code: [Select]

[quote][i] [b] Hello [/b] [/i] [b] [/b] [s] [/s][/quote]

3- leave them be, whatever, except maybe for code tags?

Code: [Select]

[quote][i] [b] Hello [/b] [/i] [/b] [/s][/quote]

Posted: December 9th, 2011, 10:44 AM

:edit: Added a footnote. BTW, my favorite is (1), personally...

1.	I'm saying "at best" because there's still an open quote at this point, so I'd have to implement code that would check LATER tags to make sure the quote is actually closed itself and is thus safe...

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 9th, 2011, 10:52 AM

Quote

- wysiwyg is still hell... Saving as raw data is worse. Especially when you start changing the output for a bbcode...

Then don't change the output for bbcode and just save as raw data and be done with it. Other systems do this, and it would ease the parse_bbc overhead.

Quote

- the main issue with bbc vs html, I think, is that most forum users are used to bbcode

Not sure I agree with that. I suspect more people would use the WYSIWYG editor if it weren't so buggy. Hell, I could see myself using it if it had proper keybindings (which IIRC some of the editors do support), because it's even faster to hit Ctrl-B to go bold than it is to use the bbcode.

The biggest stumbling block to WYSIWYG will always be complex bbcode that has no direct equivalent in HTML, e.g. the code or quote tags, not the simple stuff.

Quote

- wiki markup isn't very popular, I reckon. If popularity wasn't an issue, I'd have switched to SPIP code long ago

Well, I'm also not sure I agree with that. It depends more on the aspect of it you're looking at.

Wiki link syntax isn't clever in a forum that supports bbcode due to overloading on the [] operators not to mention the fact that you're not generally linking to conveniently titled pages (though the external link syntax is usable enough outside of operator overloading), the : indentation operator is a bit of a joke (because you don't do that in a forum generally), pipe syntax for tables is beyond a headache, but the place where wiki markup really shines is in lists.

It's almost impossible to screw up a wiki syntax list or if you do, it's usually not difficult to get it to what you want. The caveat is that it's a single line only per entry as opposed to a fully delineated entity (e.g. with li tags that can contain paragraphs, though I'm not sure that's a huge problem)

I agree about the consistency factor of bbcode, overall, it does make things consistent both for the user and the code, but it might be nice to provide wiki-style conventions as well. Probably as a plugin, though, just because it's an extra level of complexity otherwise.

It's a brave thing you've taken on there - overhauling the preparser/parser for tag mismatches was always on my todo list but I haven't been brave enough to tackle it just yet.

Quote

3- leave them be, whatever, except maybe for code tags?

I'm inclined to go with this, provided that the user is made aware that there was a change and that they might want to review the post (since an extra b closer tag has been added and there is now going to be unparsed b closer left behind) and hopefully they'll spot that the s is unopened.

We can't reach inside their mind and to a point we shouldn't be trying to do so. The most reliable of cases we can do something about but where there's any room for error, leave it be.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 9th, 2011, 12:03 PM

Quote from Arantor on December 9th, 2011, 10:52 AM

Then don't change the output for bbcode and just save as raw data and be done with it.

Not changing the output is opening the door to many problems... No?
There are many cases where it can be 'fixed' by using CSS and only applying the changes to them, but when you start using JS in a bbc tag...

Quote

Not sure I agree with that. I suspect more people would use the WYSIWYG editor if it weren't so buggy. Hell, I could see myself using it if it had proper keybindings (which IIRC some of the editors do support), because it's even faster to hit Ctrl-B to go bold than it is to use the bbcode.

Hmm... I dunno.
Well, if we could get the wysiwyg editor to actually show quotes as HTML, it would certainly solve a lot of issues. (And that's where my automatic quote splitter would come in very handy, because that'd really be the only way to split quotes at all...)

Quote

The biggest stumbling block to WYSIWYG will always be complex bbcode that has no direct equivalent in HTML, e.g. the code or quote tags, not the simple stuff.

<we:code> and <we:quote>, that being said...

Quote

It's a brave thing you've taken on there - overhauling the preparser/parser for tag mismatches was always on my todo list but I haven't been brave enough to tackle it just yet.

At the risk of disappointing you -- I didn't plan for an overhaul or anything... I simply wanted to change SMF's behavior of fixing mismatched tags when it could simply return an error. Then it evolved into pointing out the exact location of the error... Then it evolved again into fixing the errors automatically when Wedge didn't, or couldn't, expect a $post_errors return message.

Quote

I'm inclined to go with this, provided that the user is made aware that there was a change and that they might want to review the post (since an extra b closer tag has been added and there is now going to be unparsed b closer left behind) and hopefully they'll spot that the s is unopened.

Hmm... I don't know, maybe we could silently remove any non-block tags, because they're mostly for details, while we could leave in the block tags where the 'bug' might actually break the post layout...?

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 9th, 2011, 06:38 PM

Quote

Not changing the output is opening the door to many problems... No?

Not if you've ascertained that it's valid HTML and you always treat it as such (like always throwing it back and forth to the WYSIWYG editor)

Quote

There are many cases where it can be 'fixed' by using CSS and only applying the changes to them, but when you start using JS in a bbc tag...

As Karl pointed out when he brought the subject up, there are libraries out there that will allow you to exclude scripting and other similar unwanted content. There's no reason why any other tags can't be excluded either.

Quote

Hmm... I dunno.
Well, if we could get the wysiwyg editor to actually show quotes as HTML, it would certainly solve a lot of issues. (And that's where my automatic quote splitter would come in very handy, because that'd really be the only way to split quotes at all...)

This is where it gets very problematic. None of the hybrid solutions I've seen that support both WYSIWYG and bbcode do this. They always leave the bbcode alone and render it from bbcode to HTML as needed.

The only practical alternative is to have it be able to identify the quote tag (and all the header-y bits) and be able to convert it, but if people will want to customise the look of quote tags that can't be done in CSS... it's going to be ugly.

Quote

<we:code> and <we:quote>, that being said...

That being said, it won't render properly in the context of WYSIWYG. WYSIWYG editors, conceptually, are voodoo, and not nice voodoo. (They all work principally the same way: take an iframe, receive all events while the 'textbox' has focus, and transmit all the events to handlers to manipulate the iframe as if it were a true RTE component.)

That's why SMF's code is so buggy, the different browsers generate different HTML fragments in the iframe.

Quote

At the risk of disappointing you -- I didn't plan for an overhaul or anything... I simply wanted to change SMF's behavior of fixing mismatched tags when it could simply return an error. Then it evolved into pointing out the exact location of the error... Then it evolved again into fixing the errors automatically when Wedge didn't, or couldn't, expect a $post_errors return message.

No disappointment at all. Anything that affects tag processing is a scary business.

Quote

Hmm... I don't know, maybe we could silently remove any non-block tags, because they're mostly for details, while we could leave in the block tags where the 'bug' might actually break the post layout...?

That seems reasonable, except for the cases when people do post things without realising that they'll have interesting side effects.

For example, more than once I've seen people shorten [Unknown]'s name to [U], and be surprised at the result. The reason for their surprise is that other systems just silently fail to render the tag at all if there isn't a safe matching closer (IIRC vBulletin does/did this)

There's no one right answer for this.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 9th, 2011, 10:41 PM

Okay, it's like there's an endless list of bugs to fix with this...
First, my code was foolishly taking care of mismatched tags INSIDE code tags. Woops...
Obviously, this had to go. Done. I'm also not fixing anything inside nobbc tags. Anything else?

Now to an interesting bug... My s tag (strike) was not closing correctly. What the hell?
Turns out -- and I only found out after a painful debugging session -- that 's' is actually considered as a block-level tag by Wedge... This isn't the case in SMF, so I'd say it's a typo by the unknown author who moved bbcodes to the database... :whistle:
Also, and this one IS in SMF, the br tag is not a block-level, while the hr tag is. It doesn't sound very logical to me... I'd tend to say they both are, though. Not that it will matter in my code anyway because it skips self-closing tags.
I'd also consider img tags to be block level... Shouldn't they?
li tags and things like that are block level, and THAT bothers me a bit... Because my code treats block level tags in a special way, and I suspect this wouldn't properly add the closer tags, and instead accumulate tags. i.e. [li]Hello[li]world[/li] would add a closer after the last closer, instead of before the last opener.

Also. I added the code to turn a closer quote into an opener if it turned out it's at the start of a line and is followed by anything but a newline. Then I figured, okay MAYBE people could actually start a new line, type in [ /quote] and then immediately after, another [/ quote] to close a nested quote. Or even [ /code] or whatever... So I added a test for '[/' after the tag. Now, I suppose there are suddenly dozens of new ways to break this... Obviously I suppose I could test 'simply' for a-z... I don't know. It's hell.
And all of that because I sometimes would have one of these closers at the start of a line when I meant to be using an opener.

Sometimes I wonder if we shouldn't just, ahem, call parse_bbc inside preparsecode, and have parse_bbc return a pseudo-BBC version instead of a html version... Would probably make things... easier. For me. Ah ah.
Well, I'm saying that but I've nearly reached the end... (And then in a weeks time I'll still be on it........... It's just a fucking pre-parser, arghhh!!)

Quote from Arantor on December 9th, 2011, 06:38 PM

This is where it gets very problematic. None of the hybrid solutions I've seen that support both WYSIWYG and bbcode do this. They always leave the bbcode alone and render it from bbcode to HTML as needed.

I don't see where it would be impossible to implement.
We just need to always have the same code for quotes for instance... I mean, the current code generator is very simple. It can easily be emulated through jQuery to actually add that code to the post.... Heck, we could even take the quote's HTML code from within the database, and put it into a JS string...! That's the logical way to do it, even...

Quote

That being said, it won't render properly in the context of WYSIWYG. WYSIWYG editors, conceptually, are voodoo, and not nice voodoo. (They all work principally the same way: take an iframe, receive all events while the 'textbox' has focus, and transmit all the events to handlers to manipulate the iframe as if it were a true RTE component.)

That, and contentEditable.

Quote

For example, more than once I've seen people shorten [Unknown]'s name to [U], and be surprised at the result. The reason for their surprise is that other systems just silently fail to render the tag at all if there isn't a safe matching closer (IIRC vBulletin does/did this)

That's... interesting.
But then again, as important Unknown is the the SMF community (it's a no-brainer), I'm prepared to have his name removed entirely from posts if people shorten it to 'u'...

Quote

There's no one right answer for this.

In our case, with my code, the missing u closer will be added.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 9th, 2011, 11:00 PM

Quote

Okay, it's like there's an endless list of bugs to fix with this...
First, my code was foolishly taking care of mismatched tags INSIDE code tags. Woops...
Obviously, this had to go. Done. I'm also not fixing anything inside nobbc tags. Anything else?

The contents of html tags should be left untouched too.

Quote

Now to an interesting bug... My s tag (strike) was not closing correctly. What the hell?
Turns out -- and I only found out after a painful debugging session -- that 's' is actually considered as a block-level tag by Wedge... This isn't the case in SMF, so I'd say it's a typo by the unknown author who moved bbcodes to the database...

The hell? I see why I made the mistake now, though, there's a couple immediately before it and I just hit the wrong thing out of repetition.

Quote

Also, and this one IS in SMF, the br tag is not a block-level, while the hr tag is. It doesn't sound very logical to me... I'd tend to say they both are, though. Not that it will matter in my code anyway because it skips self-closing tags.
I'd also consider img tags to be block level... Shouldn't they?

hr is a block tag in HTML, br and img are not.

Putting aside the semantic changes in HTML5 surrounding the a tag, consider it this way: would you consider putting an img inside an a to be valid? Answer: yes. Can't be a block tag then, because you're implicitly allowing it to be running inline.

Same question for br... answer: yes. Same reason. Although it has a layout effect, it's not something that would logically or semantically split a block element. Splitting text nodes, sure, but not splitting an element semantically. A link with a br in it would just be a vertically set link.

Both br and img are special tags in their own right because they're the class of tag that isn't markup by definition (much as hr is part of the same group) but that they have content-replacement as their focus: each tag replaces itself with content in render-time, they're not affecting presentation or structure of any other content (as a div or a span or a strong tag might). But hr doesn't logically fit being part of a link, and you'd (theoretically) never use it in an inline context, and I don't think it's valid for it to be either.

Quote

li tags and things like that are block level, and THAT bothers me a bit... Because my code treats block level tags in a special way, and I suspect this wouldn't properly add the closer tags, and instead accumulate tags. i.e. [li]Hello[li]world[/li] would add a closer after the last closer, instead of before the last opener.

Here's the thing. Do we know what the user was intending to do with this? Ignoring what preparsecode will do with that, that is. Is it meant to be two separate list items, or a list item containing a sublist of another item?

We have two choices. We can manipulate it to the 'most likely' case of affairs, or we can avoid rendering it. If the regexps in preparsecode were not run, I can actually tell you what the parser would do: it would render that as a single list item with the bare li bbcode unmatched in the middle.[1]

Trouble is, with cases like this, it's sufficiently ambiguous (block or not) that we can't really make a call on how it should be changed.

FWIW, list items are considered block items because they can contain lists which are also implicitly block items.

Quote

Also. I added the code to turn a closer quote into an opener if it turned out it's at the start of a line and is followed by anything but a newline. Then I figured, okay MAYBE people could actually start a new line, type in [ /quote] and then immediately after, another [/ quote] to close a nested quote. Or even [ /code] or whatever... So I added a test for '[/' after the tag. Now, I suppose there are suddenly dozens of new ways to break this... Obviously I suppose I could test 'simply' for a-z... I don't know. It's hell.

People are dumb. Make the software smarter, you get dumber people come along. We can't predict how they'll do stuff and to a point I sort of think we shouldn't try and prejudge the intent of users anyway.

Quote

Sometimes I wonder if we shouldn't just, ahem, call parse_bbc inside preparsecode, and have parse_bbc return a pseudo-BBC version instead of a html version... Would probably make things... easier. For me. Ah ah.
Well, I'm saying that but I've nearly reached the end... (And then in a weeks time I'll still be on it........... It's just a fucking pre-parser, arghhh!!)

Something along these lines, yes, but I'd move/clone the code, I wouldn't supersize parse_bbc to be a hybrid HTML and legal bbc parser, since the uses are rather different even if the underlying logic is the same.

Quote

I don't see where it would be impossible to implement.
We just need to always have the same code for quotes for instance... I mean, the current code generator is very simple. It can easily be emulated through jQuery to actually add that code to the post.... Heck, we could even take the quote's HTML code from within the database, and put it into a JS string...! That's the logical way to do it, even...

Which is fine until someone modifies it. Or manually tries to add it. There's all kinds of ways that could be broken, which is why no-one does it.

Quote

That, and contentEditable.

I'm not 100% sure contentEditable is required, you know. I don't think certain browsers (glares at IE6) support it.

Quote

That's... interesting.
But then again, as important Unknown is the the SMF community (it's a no-brainer), I'm prepared to have his name removed entirely from posts if people shorten it to 'u'...

Thing is, that's what already happens in SMF. The missing u will already be cleaned up softly by parse_bbc at the end of a block tag or end of the post if it hasn't otherwise been closed, and it will underline the content. That's what I meant by unexpected behaviour.

Quote

In our case, with my code, the missing u closer will be added.

In our case we make the same decision only it's stored into the post rather than being added softly otherwise.

1.	Well, that's what my understanding of the requires_parent code intimates it should do, put it that way, since requires_parent for li says it requires a list bbcode as a parent and it won't have one, so it should be left alone...

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 10th, 2011, 06:34 PM

I'm nearly done, and that's a feat really... Considering it took me two days to get it right...
I've only got a couple of quote tags opened at the end for no reason. There was an annoying number of bugs in my code which caused all of the problems -- most notably, I'm a bit ashamed of it, an array which I was accessing as if its first entry was $array[1]... Oops. No wonder it kept failing.

Anyway, I'm pretty disappointed overall with the 'dumbness' of the code, in that it doesn't really do much to help. It adds openers in boring places but at least it does it. Could be worse... Also, I'm still unable to get it right when it comes to block tags and inline tags. If you have, for instance, [b][s][/quote] and there's no quote opener in the post, Wedge won't even (as of now) close the b and s tags, it'll simply add an opener before the closer quote... It is definitely dumb. (It correctly closes the tags if it can find an opener quote before the b and s tags.) Problem is, I'm not even sure how best to fix this string, more precisely, where should I add the quote opener...? My guts tell me that it should be after the LAST closed block tag (if it finds any).
Oh, and don't get me started on more complicated setups of block tags being inserted inside non-block tags... I don't think I have any way to protect against these ultimately. Add to that the fact that these tags are to be inserted inside the original post, without its surrounding crap etc... And I haven't even started to consider checking whether a block closer is at the beginning of a line and immediately followed by contents -- in which case it should be turned into an opener for sure. It's so deadly... And tiring. And headache-inducing (literally.)

PS: this post needs no reply, I wrote it yesterday afternoon and forgot to post it... It's just a bonus post. I haven't worked on the feature today, as I'm very, very wary of the difficulties I know I will meet when finishing it. I'm quite tempted to just close unclosed tags when meeting a block closer and then add the block opener just before the closer... And as for non-block closers, just remove them entirely. Uh.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 10th, 2011, 11:49 PM

I'm trying to figure out a 'simple' way to prevent nesting 'li' tags inside others...
Does 'li' have some peculiarity in the bbcode table? I see there are things with require_parent, disallow_children and things like that, but I'm not sure I get the gist of it... li requires the list parent but doesn't disallow anything, right? Shouldn't it disallow li children..?

I didn't make much progress on fixNesting today. Only bug fixed is one that didn't allow me to precisely show where the errors were found... Now it should work fine. i.e. instead of giving you the bad tag in the list of tags, it shows you the bad tag, surrounded by the context in the message (post contents, other tags...)

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 10th, 2011, 11:56 PM

Quote

I'm trying to figure out a 'simple' way to prevent nesting 'li' tags inside others...

You're not entirely supposed to. An li can contain a sub list, provided that the sublist is contained inside a list tag (as HTML itself allows)

That's the beauty of the nesting logic as it stands - if the li > li isn't inside a list (i.e. it's not li > list > li), the inner li should not be parsed by the bbc parser. That's why preparsecode munges it first, so that you always get li being contained inside a list and not anything else.

You don't need to explicitly disallow li as a child, as IIRC disallowing as a child may even disallow it entirely in the nested chain which isn't necessarily desirable (because IIRC it would disallow the otherwise valid permutation of li > list > li)

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 11th, 2011, 08:42 PM

I would assume that even if li > list > li is valid, it would have to be coded this way: li authorizes list and only that -- and list authorizes li and only that... That way, problem fixed. (i.e. I could rely on this setting to fix tag nesting.)

Anyway...

The current code is nearly TWO HUNDRED freaking lines long[1]. As opposed to the original dozen lines that did their job perfectly, i.e. point out any mismatched tags...
I'm not even sure it's worth committing the code. I mean, it (finally) works on my complex string, but OTOH, it's become such a beast that I'm sure we'll definitely find situations where it doesn't work very well... :-/

It was about 150 lines long when I suddenly decided Wedge should be nice enough to prevent using block tags inside non-block tags. Heck... Even then, if you have something strange like an opening "s", then an opening "b", then a closing "s", it will add the closing "b" before the opening block tag of course, but then again AFTER the closing "s". Don't ask too much of me...

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 11th, 2011, 09:12 PM

No, li doesn't have to authorise that. The only rule it really needs is that li must be inside a list. Then the trick is making sure where list is, which IIRC is anywhere that a block context allows it to be.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 11th, 2011, 10:13 PM

Quote from Arantor on December 11th, 2011, 09:12 PM

No, li doesn't have to authorise that. The only rule it really needs is that li must be inside a list. Then the trick is making sure where list is, which IIRC is anywhere that a block context allows it to be.

The preparsecode stuff to check for this is so fucked up, once again...

Okay, I'm... I'm... I think I'm finished with fixNestings. :sob:

It's really, really horribly complex. 200 lines of code after cleanup. It seems to work but it's so... fragile. I don't know if I should commit.

So I just need to know -- is anyone of you with read access, willing to commit some of their time to look through my new code, and try to ensure it won't break anything...?

Oh, and I just discovered that using specific footnote contents, you can break your page layout. Yay.

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 11th, 2011, 10:17 PM

Quote

The preparsecode stuff to check for this is so fucked up, once again...

preparsecode only checks that li is contained inside a list and that list contains li tags, or it should be. Anything else (like the parent/children stuff) is done at parse time.

As far as testing goes, I think it's the sort of thing that is simply going to require hammering until we can say that it doesn't break. We can construct unit tests for it, but they're only going to be generic cases, unfortunately.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 11th, 2011, 11:45 PM

Okay, done...

I've made so many tests using my test code, it should really be okay at this point... But you never know.
Also, I need to fix that bug with nb parsing... Grmpf. It's like it never ends... And my code is pretty useless -- I mean, the auto-fixer doesn't really have a point because parse_bbc already accounts for most errors...

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 11th, 2011, 11:48 PM

Yes, but if the parser can safely avoid accounting for those things, it can be made faster...

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 12th, 2011, 12:02 AM

I'll leave it up to you to look into all of this... I feel I have done my share of the work on this for now! :P

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 12th, 2011, 12:06 AM

You have indeed :)

Title: Re: Fixing mismatched BBCode
Post by: groundup on December 27th, 2011, 04:21 AM

Quote from Nao on December 8th, 2011, 12:14 AM

Okay, I'll look into doing that tomorrow...

Still, complex errors won't be magically fixed, I'm afraid... Unless, unless I do a stricter check.
So while typing this post, it came to me that I could use a stack of tags and stack/unstack data and... Well, have a look at this code:

Code: [Select]
[/quote][quote author=Nao link=msg=1 date=1309111289]Lorem ipsum?[/quote]What is that? I don't speak rubbish! [nb]I'm wondering if Rory does, though? He had time to learn...[ /code] [ code][ nb][ /code][ /nb]

And... Here's the error message I'm getting following my latest rewrite. Which is actually SHORTER than the last one ;) It's not perfect but I'm working on it eheh.

Pretty cool uh?

PS: and yes, it works with tag nesting too, since it's a stack... i.e. if I have properly nested 'b' tags inside the quote, everything's fine.

PPS: the main issue with fixing tags in the middle of a message is that I would then have to find the exact position of the tag... I guess it's doable, though, but I'll have to go through a series of strpos etc or something to fill in the list first, so it'll definitely make the code bigger.

Looks very exciting. I would love to see a link on there to fix them automatically by "guessing" like we do now or at least give the user a hint.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 27th, 2011, 09:37 AM

You mean you'd like to see some pseudo code on how I did it? :p

Title: Re: Fixing mismatched BBCode
Post by: live627 on December 27th, 2011, 09:53 AM

What I think he's trying to say is to offer the user a link to try to fix the codes or simply tell the user of mismatched tags like you already have in the post error box

Title: Re: Fixing mismatched BBCode
Post by: Arantor on December 27th, 2011, 10:29 AM

The problem with auto fixing is that you can actually make things so much worse.

Title: Re: Fixing mismatched BBCode
Post by: Nao on December 27th, 2011, 10:43 AM

I think our current implementation works best.
But I'd like to ensure the error message path is used for quick edit as well. Firefox does it already iirc. At least on my custom noisen code. I need to remember working on that.