Wedge

Nao · « Reply #30, on December 7th, 2011, 09:03 PM »

Preparsecode can't refuse a post and send back a list of errors afaik...

Arantor · « Reply #31, on December 7th, 2011, 10:23 PM »

No, it can't. But it can set a global variable to be acted upon.

In any case there's more stuff that doesn't go through Post2 than just Aeva comments. Quick modify for one.

Nao · « Reply #32, on December 7th, 2011, 10:38 PM »

Maybe we could have the test in an external function. If called from post2, return an error. Otherwise try to fix It by adding as many missing tags as required. Yay?

Arantor · « Reply #33, on December 7th, 2011, 10:59 PM »

Works for me :)

Purely academically, is the goal for fixing mismatched tags so that it can be removed from the bbc parser? (If so, we probably *should* also route imported posts through this code too)

Nao · « Reply #34, on December 8th, 2011, 12:14 AM »

Okay, I'll look into doing that tomorrow...

Still, complex errors won't be magically fixed, I'm afraid... Unless, unless I do a stricter check.
So while typing this post, it came to me that I could use a stack of tags and stack/unstack data and... Well, have a look at this code:

Code: [Select]

[/quote][quote author=Nao link=msg=1 date=1309111289]Lorem ipsum?[/quote]What is that?
I don't speak rubbish!
[nb]I'm wondering if Rory does, though? He had time to learn...[ /code]
[ code][ nb][ /code][ /nb]

And... Here's the error message I'm getting following my latest rewrite. Which is actually SHORTER than the last one ;) It's not perfect but I'm working on it eheh.

Pretty cool uh?

PS: and yes, it works with tag nesting too, since it's a stack... i.e. if I have properly nested 'b' tags inside the quote, everything's fine.

PPS: the main issue with fixing tags in the middle of a message is that I would then have to find the exact position of the tag... I guess it's doable, though, but I'll have to go through a series of strpos etc or something to fill in the list first, so it'll definitely make the code bigger.

Arantor · « Reply #35, on December 8th, 2011, 12:39 AM »

Oh, I think that's pretty awesome :)

billy2 · « Reply #36, on December 8th, 2011, 08:48 AM »

:wow:
Awesome indeed.
:)

Typo - 'show' should be 'shown'

Nao · « Reply #37, on December 8th, 2011, 09:29 AM »

Thanks, I didn't notice that one bit ;)

Okay, I'm in the process of moving the code to the wedit object (where it should have been from the beginning), and trying to fix the code automatically... So, considering the fact that the most important content (topic posts) is okay because we clearly specify the errors, fixing posts automatically shouldn't be too much of a hassle but obviously the code to actually fix them is going to be more complex...

So, let's say I have this:

Code: [Select]

[quote]post1[/quote]comment[/quote]post2[/quote]

I think it's safe to say that the poster added a / by mistake, and it should be removed, but I hardly see how Wedge is going to be able to spot it automatically without going into a great deal of large-scale testing. So I suppose we could do it this way...

Code: [Select]

[quote]post1[/quote][quote]comment[/quote][quote]post2[/quote]

i.e, the extra /quote knows there is no previous matching quote tag, so it simply looks for the *last* tag it found (or MAYBE the last closer tag of its kind, i.e quote?), and it adds an opener tag right after it. Then we mark the tag as fixed (i.e. we remove 'quote' from the stack of opened tags). So, we are in a situation where 'comment' is suddenly stuck into a quote tag. Continue as usual. Then we spot the other closer quote, and we do the same, i.e. add a closer tag after the last closer, so in this case the 'post2' bit is fixed.

The obvious problem with this solution is that our comment is now in a quote. But because we'll have three separate quotes in a row, it'll be *relatively* obvious (not captain-obvious obvious, but still), that the middle one is a reply to the previous quote. What do you think..?

I was thinking of other solutions, like checking whether a tag is something like '[/quote author=Nao]', which in this case would mean "it's an opening quote where the / was added by mistake" but I don't really think it's a realistic case.

Now for another test case...

Code: [Select]

[quote][b] [s] post1[/quote]

Using the pseudo-code from above, this would be really messy -- an opening quote would be added just before the closer, and then the s and b tags would remain opened until the end of the post, where they would then be closed forcibly.
So in this case, I think it's best that when we look through the latest opened tag in the stack, and it's not our closer, we simply add the related closer automatically and then keep going through the stack in reverse order, closing tags as required, until we find ours (or not). This is actually pretty much what my code is doing right now, as opposed to the pseudo-code in the first example.

Now, if we mix our two examples...

Code: [Select]

[quote]post1[/quote][b] [s] comment[/quote]post2[/quote]

The closer quote will trigger a search for the last closed tag, which in this case is another closer quote, BUT between them it will find two unclosed tags... Which gets confusing, so it's best to first close any opened tags, and THEN go through the search for a place to add a new quote opener.

It's all very 'amusing' because I have to maintain a parallel stack of tag positions and the code has currently jumped from 15 lines to 50, which caused me to write this post in the hope that it'll allow me to sort things out... :lol:

Anyway, opinions welcome...
Hey, perhaps someone has heard of some BSD/MIT code available online that precisely does just that -- a flawless fix of all BBC or HTML tags left opened or closed... :P

Arantor · « Reply #38, on December 8th, 2011, 09:41 AM »

Quote

Using the pseudo-code from above, this would be really messy -- an opening quote would be added just before the closer, and then the s and b tags would remain opened until the end of the post, where they would then be closed forcibly.

They shouldn't be. The bbc parser should actually close both the s and b tags, honouring proper hierarchy, when it gets to the end of quote.

Quote

~~post1~~

Like so, in fact. (And if you check the source, you'll see it's unmodified.) It works because there's the end of a block level tag with unresolved non block tags inside it. The exact behaviour is incredibly complicated, and is no doubt one of the reasons why the bbc parser is so big and scary - but it's also resilient. This is why I asked if the idea behind this was partly to reduce its complexity or not, because it actually does a lot of silent fixing that most people don't even realise.

The problem as you've discovered with writing such a solution is that unless you can get inside the user's brain and figure out what they meant, rather than what they typed, you have no hope of getting it consistently right. This is, incidentally, probably the one good example of where WYSIWYG actually works, because it allows people to see directly what they're using which typically means fewer mashed up tags.

Nao · « Reply #39, on December 8th, 2011, 10:00 AM »

Quote from Arantor on December 8th, 2011, 09:41 AM

They shouldn't be. The bbc parser should actually close both the s and b tags, honouring proper hierarchy, when it gets to the end of quote.
Quote
~~post1~~
Like so, in fact.

Allow me to quote myself:

Quote

it's best to first close any opened tags, and THEN go through the search for a place to add a new quote opener.

So, yes, I totally agree... It's just that my original pseudo-code doesn't account for that, and I knew I was in for a complicated day if I didn't start by clearly writing a few broken posts and figuring out what would be the most reliable way to fix them.

Quote

(And if you check the source, you'll see it's unmodified.) It works because there's the end of a block level tag with unresolved non block tags inside it. The exact behaviour is incredibly complicated, and is no doubt one of the reasons why the bbc parser is so big and scary - but it's also resilient.

But *once* it is redone through ::fixNesting, with my super-solid code etc, won't all of the remaining code become unnecessary all of a sudden...? :P

Quote

This is why I asked if the idea behind this was partly to reduce its complexity or not, because it actually does a lot of silent fixing that most people don't even realise.

The only fixing I've seen preparsecode do is add code tags at the beginning or end of a post which sucks a bit...
::parse_bbc does some fixing on its own, IIRC, but if it does, it's in the wrong place. It should be done at save time, obviously. (AND, we should remove any fixer code from ::parse_bbc to force modders to go through ::preparsecode. Believe me, I didn't even know this function existed before Shitiz used it in SMG, and I didn't have the *reflex* to use it systematically until, err.... Now?)

Quote

The problem as you've discovered with writing such a solution is that unless you can get inside the user's brain and figure out what they meant, rather than what they typed, you have no hope of getting it consistently right.

But I'm not writing an AI... Otherwise we won't release before 2015...
Things like, "Okay, this is a closing tag BUT it's at the beginning of a line *and* is immediately followed by content, so MAYBE it's an opener, let's try to turn it into an opener and see if it suddenly validates"... These things are doable, but they take time to implement.

(Well, that particular solution would still be a very good one to the first test case I posted.)

Arantor · « Reply #40, on December 8th, 2011, 10:18 AM »

Quote

But *once* it is redone through ::fixNesting, with my super-solid code etc, won't all of the remaining code become unnecessary all of a sudden...?

Does fixNesting deal with block tag nesting mismatches or just nesting mismatches? Also, once nesting and mismatches is fixed, we will also need to look at dependencies and must-contain/must-not-contain rules too, which are also specifiable in the bbc parser...

Quote

The only fixing I've seen preparsecode do is add code tags at the beginning or end of a post which sucks a bit...
::parse_bbc does some fixing on its own, IIRC, but if it does, it's in the wrong place. It should be done at save time, obviously. (AND, we should remove any fixer code from ::parse_bbc to force modders to go through ::preparsecode. Believe me, I didn't even know this function existed before Shitiz used it in SMG, and I didn't have the *reflex* to use it systematically until, err.... Now?)

preparsecode does a lot of things, actually. I think I listed what it did in a previous post, but as well as rearranging tags (albeit naively), it also fixes html bbc for non admin users, cleans up nobbc so they're safely unparsed later, attempts to validate the contents of url and img tags (which, to my mind, made the 1.1.11 update pointless, but that's just me, unless there's a deeper issue that needs resolution)

Part of the reason parse_bbc has it and not preparsecode is that posts added to the DB through other sources that won't have come through preparsecode originally, and that's not just for modders (for example, this will include the importer unless we push every post through some kind of fixer during the import)

Still, considering that the html bbcode will do bad things if inserted manually and not through preparsecode, that nobbc may or may not work properly and other stuff, I'm inclined to think that it's OK to remove some of this stuff from parse-bbc, and move it to the preparser, provided that the preparser is able to do *everything* that parse_bbc does, which as I've alluded to, is more than just ensuring tag nesting is sane, it also has rules on what tags can contain other tags (url can't contain another url for example), on what tags must be where (list requires 1+ li, li must be inside a list, and the entire table->tr->td, both parent/children get evaluated)

I always wanted to move that to the preparser anyway to remove this dependency on strange, naive regexps that didn't allow for customising the table tag or adding th tags, without rewriting all that stuff as well.

Nao · « Reply #41, on December 8th, 2011, 06:53 PM »

Quote from Arantor on December 8th, 2011, 10:18 AM

Does fixNesting deal with block tag nesting mismatches or just nesting mismatches?

What do you mean?
It supposedly fixes any mistmatched tags and that's all... It adds missing openers when it finds orphan closers, and it adds at the end closers to match orphan openers. And believe me, it's already hard enough to manage as it is... I've been on it all day, and it's still pissing me off right now. Granted, it's quite a complex string I'm working with (basically -- if it works with it, it'll work with everything), but right now my code is headache-inducing, and it starts failing after a few fixes... (hint: try not to insert data into an array you're *looping through at the moment*... I should probably reset the loop every time instead of trying to account for all changes...)

Quote

Also, once nesting and mismatches is fixed, we will also need to look at dependencies and must-contain/must-not-contain rules too, which are also specifiable in the bbc parser...

Well, I'm mostly looking at removing the 'light' fixers in preparsecode and parse_bbc, like the code tag fixer. And maybe the quote fixer... I think there's one around...

Quote

preparsecode does a lot of things, actually. I think I listed what it did in a previous post,

Yes, you did and I do remember it :P

Quote

Part of the reason parse_bbc has it and not preparsecode is that posts added to the DB through other sources that won't have come through preparsecode originally, and that's not just for modders (for example, this will include the importer unless we push every post through some kind of fixer during the import)

Well... I suppose we SHOULD call, maybe not preparsecode, but at least any functions that are called in Wedge and not in SMF, e.g. fixNesting.

Quote

I always wanted to move that to the preparser anyway to remove this dependency on strange, naive regexps that didn't allow for customising the table tag or adding th tags, without rewriting all that stuff as well.

Plus, anything to make parse_bbc faster... ;)

Arantor · « Reply #42, on December 8th, 2011, 09:00 PM »

What I mean is, the code that fixes the above example triggers not because of the mismatch of b/s/quote but because b and s are described as non block tags and quote is a block tag. When the quote tag ends, it looks for any non block tags that are currently open, in nesting order, and closes them. The exact rules are much more complex than I've indicated but it's what forces cases like lists to not be able to contain quotes and for it to safely end the list before the quote.

I also seem to recall that the x bbcode list item is also handled in a similar way, seeing how it is not handled by preparsecode but solely within the BBC parser, and in a way that's fragile. (Frankly I don't have a problem removing the one character list builder shortcuts because I don't know anyone that uses them and whenever I've seen them used, they always break unexpectedly)

The light fixers aren't major pieces of effort in parse_bbc but anything that goes does save effort on page loads generally too.

Nao · « Reply #43, on December 8th, 2011, 10:22 PM »

Yeah... Didn't think about this all...

Well, it's all pretty fucked up really. A waste of my time... -_-
Basically, it works 100%, until you meet some of the more complicated stuff I have in my test case. And I don't see how to fix it without, tadaaam... Another full rewrite!
If I do this... I'll probably have to, hmm... Do it recursively... Oh, I don't like that... :-/

Y'know, like, "if a tag is opened and is not self-closed, re-run the function on the string AFTER that tag, asking for it to return after it meets the closer tag..." And if it meets a closer that's not the one we're expecting, we'll just add our closer there (manually), and return from the function.

What annoys me the most is that every time, I get this very simple idea that ends up being flawed in one aspect or another... I just don't want to spend another day on that.

PS: any special tags that aren't in the list of double tags, like x or * or whatever, are not important here because Wedge will simply ignore them anyway.

PPS: my test case is as such. It chokes on the first [/nb] (which doesn't have a matching opener at this point because we already rewrote the first opener to add a closer to it.)

Code: [Select]

[/quote][quote author=Nao link=msg=1][b]
Lorem ipsum?
[/b][/quote]What is that?
I don't speak rubbish![nb]I'm wondering if Rory, though? He had time to learn...[ /code][ code][ nb][ /code][ /nb][quote]post1[/quote][b] [s] comment[/quote]post2[/quote]

Arantor · « Reply #44, on December 8th, 2011, 11:04 PM »

It's a complicated task at the very best of times :( And it shows in browsers too, when they have malformed tags to deal with, some ignore them totally, some make assumptions with interestingly unpredictable results. I remember having a conversation with a fellow geek back in 2000 which illustrates this perfectly: he was building a site with a big complex data-heavy table in it, and it worked perfectly in IE but broke horribly in Netscape. As I discovered... he wasn't putting any of the closing tags because 'it doesn't need them'. Well, it obviously does!

It's also a rabbit hole of a problem, in that no matter how clever you get, you can pretty much always find another example that will break it. The issue is where the line gets drawn.

This is also why I think we will need to adopt the logic used in the parser to unwind and reprocess the tag nesting, simply because it's a lot more than just having balanced tags. It's a pain to get right but the result will be worth it in the end.

FWIW, I've been trying to play with this today, not getting very far with it, just because I'm still trying to get my head around how the post parser really works.

What do you reckon about the special list of tags like x or *? Do we really need them? I actually think they cause more trouble than they're worth - and they're a perfect poster child of why this whole issue is a problem, since no-one seems to know how to actually safely end such a list.

Wedge

Home

Login

Register

Nao

Re: Fixing mismatched BBCode

Arantor

Re: Fixing mismatched BBCode

Nao

Re: Fixing mismatched BBCode

Arantor

Re: Fixing mismatched BBCode

Nao

Re: Fixing mismatched BBCode

Arantor

Re: Fixing mismatched BBCode

billy2

Re: Fixing mismatched BBCode

Nao

Re: Fixing mismatched BBCode

Arantor

Re: Fixing mismatched BBCode

Nao

Re: Fixing mismatched BBCode

Arantor

Re: Fixing mismatched BBCode

Nao

Re: Fixing mismatched BBCode

Arantor

Re: Fixing mismatched BBCode

Nao

Re: Fixing mismatched BBCode

Arantor

Re: Fixing mismatched BBCode