Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Fixing mismatched BBCode
« Reply #45, on December 8th, 2011, 11:19 PM »
Quote from Arantor on December 8th, 2011, 11:04 PM
It's a complicated task at the very best of times :( And it shows in browsers too, when they have malformed tags to deal with, some ignore them totally, some make assumptions with interestingly unpredictable results. I remember having a conversation with a fellow geek back in 2000 which illustrates this perfectly: he was building a site with a big complex data-heavy table in it, and it worked perfectly in IE but broke horribly in Netscape. As I discovered... he wasn't putting any of the closing tags because 'it doesn't need them'. Well, it obviously does!
With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue. Unfortunately that pseudo-code is too complex for a 'simple' bbc fixer, so it's not an option to me...
Quote
It's also a rabbit hole of a problem, in that no matter how clever you get, you can pretty much always find another example that will break it. The issue is where the line gets drawn.
The problem with my current code (not the committed code, which is bug-free AFAIK but doesn't do a fantastic job at figuring out where to add tags), is that it doesn't *work* in a certain situation, and I can't think of a way to make it work correctly because my code assumes things that, in that situation, aren't always valid....
Quote
This is also why I think we will need to adopt the logic used in the parser to unwind and reprocess the tag nesting, simply because it's a lot more than just having balanced tags. It's a pain to get right but the result will be worth it in the end.
What do you mean unwind and reprocess?
Quote
What do you reckon about the special list of tags like x or *? Do we really need them? I actually think they cause more trouble than they're worth - and they're a perfect poster child of why this whole issue is a problem, since no-one seems to know how to actually safely end such a list.
Yeah... It'd be a candidate for deletion, I suppose.

SPIP actually supports turning opening dashes automatically into bullet points. At least it did back in 2003... Ah, good old times.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Fixing mismatched BBCode
« Reply #46, on December 8th, 2011, 11:31 PM »
Quote
With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue.
They still deal with mismatched tags and even malformed tags differently.
Quote
What do you mean unwind and reprocess?
What the parser ultimately does is step through the post and figures out the tags in play, and when it hits a closer (especially of a block level) or certain combinations of block level opener, it reviews all the tags that are open and closes some or maybe all of them.

The way it does it, ultimately, is not much different to constructing a DOM and traversing the node tree, albeit it does it in a linear rather than strict hierarchical fashion.

At each point, not only is the list of open tags maintained, plus block level evaluation, but the potential tags that can be contained inside each other, plus dependencies, are all reviewed too.

There's no way to evaluate a post and find all the instances of url bbc and immediately say beyond doubt that they will always be converted to links, because they won't even if they are otherwise legal. As I said before, there are some rules that even parsed tags cannot contain other tags (url can't contain email or url amongst other things), as well as being able to go up and down the tree in a *very* limited fashion to resolve the table-tr-td chain, which is actually defined in the bbc as well as the rules in preparsecode...

In fact, there is a very old bug on SMF's bug tracker where a table tag is malformed but because it fits the preparser's naive rules and even though there are rules defined for it, the rules aren't actually complete. IIRC, the preparser would actually break the nesting anyway under those circumstances, and that if the preparser were excluded, the main parser's handling would work properly and it would just safely fail to render the table.[1]

This is where it gets interesting. The parser will fix really obvious cases, where tags should be able to be safely closed (non block tags not being terminated when ending a block tag) but it chokes on bigger cases where nesting is broken. But if the nesting rules can be properly enforced at parser level without being broken by the preparser, tags will be safely unrendered until they can be fixed.

Now, if we were to move the full logic from the parser to the preparser, we'd be able to trap improperly nested tags too, and report on them properly.
Quote
Yeah... It'd be a candidate for deletion, I suppose.

SPIP actually supports turning opening dashes automatically into bullet points. At least it did back in 2003... Ah, good old times.
Here's the problem: what ends the list item? What ends the list?

This is one place where I'm actually slightly envious of wiki markup because it actually does it sanely. It has no assumptions about hierarchy, one line is one list item, and the first blank line after the fact is the end of the list. If only it were that simple with the one character shortcuts.

Mind you, if they worked more consistently, it might encourage more use of them.
 1. The alternative is extra closers being injected, causing layout malfunction, seeing how this started out in SMF 1.1.x with its tabular layout.
When we unite against a common enemy that attacks our ethos, it nurtures group solidarity. Trolls are sensational, yes, but we keep everyone honest. | Game Memorial

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Fixing mismatched BBCode
« Reply #47, on December 9th, 2011, 12:12 AM »
Oh... Not a good sign... I'm already lost in my recursive code... :-/

PHP has a PECL library for bbcode parsing, but it only returns html, not fixed bbc.
There's also a PEAR library written in PHP, it seems to be nicely written albeit a bit large (30KB for main source + more for specific tags like url...), and doesn't provide a fixer by default -- but it does seem to fix tags on its side.
I've also found "NBBC" on sourceforge, it has an extensive test suite and documentation. Unfortunately it doesn't provide a 'check' mode either, only converts directly to html but at least this one is BSD.
It doesn't generate opening tags automatically when finding an orphan closed tag, though... So apart from its alleged speed, it's not that interesting. Plus, it's huge.
Quote from Arantor on December 8th, 2011, 11:31 PM
Quote
With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue.
They still deal with mismatched tags and even malformed tags differently.
I think the whole point of the html5 tokenizer is to provide for a common ground to handle errors...?
Quote
What the parser ultimately does is step through the post and figures out the tags in play, and when it hits a closer (especially of a block level) or certain combinations of block level opener, it reviews all the tags that are open and closes some or maybe all of them.
Yes, that's pretty much what fixNesting does...
Quote
At each point, not only is the list of open tags maintained, plus block level evaluation, but the potential tags that can be contained inside each other, plus dependencies, are all reviewed too.
NBBC does that. (Wedge, too, obviously. But not fixNesting.)
Quote
Now, if we were to move the full logic from the parser to the preparser, we'd be able to trap improperly nested tags too, and report on them properly.
The reporting is already being done... If it were JUST about that, I wouldn't have worked on the alternative 'fixer' today.
Unfortunately I can't really feel satisfied with just the report code because it's not going to be used in quick edit etc.
(Well... Unless we add an errorbox for quick edit, of course. Which would not be such a bad idea...)
Quote
Here's the problem: what ends the list item? What ends the list?
The end of the line? :P
Quote
This is one place where I'm actually slightly envious of wiki markup because it actually does it sanely. It has no assumptions about hierarchy, one line is one list item, and the first blank line after the fact is the end of the list. If only it were that simple with the one character shortcuts.
We could always draw our inspiration from the best elements in wiki code...

Anyway. Time for bed.
I would have posted my 'bad work in progress' from today, but the source code is pretty fucked up (commented out code, echos and print_rs everywhere...), and I don't want to ruin my reputation :P

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Fixing mismatched BBCode
« Reply #48, on December 9th, 2011, 12:20 AM »
Quote
Yes, that's pretty much what fixNesting does...
Yes, but from what I understood of it, it was mostly being done on mismatched tags only, it wasn't fixing or reporting nesting cases that are syntactically correct but semantically invalid, such as a table without a tr in it, would it flag that up as an error? An li not inside a list? These are the case I'm thinking of, not just mismatched tags. Properly handling that would solve a number of problems.
Quote
(Well... Unless we add an errorbox for quick edit, of course. Which would not be such a bad idea...)
It always amazed me that even in the event of a failure, there wasn't proper handling of errors going back to the user.
Quote
The end of the line?
That's the theory but it seems to be broken. Even time I've used the shortcodes, the list items terminate correctly but the list itself doesn't, so I never use them.
Quote
We could always draw our inspiration from the best elements in wiki code...
It makes me wonder at times whether a full bbcode solution really is the best one. It certainly is for some cases (e.g. non-standard/site specific cases) and the quote tag is certainly better handled than the native HTML equivalent (and wiki markup's idea of quoting is... laughable), but it certainly does make for debate about whether bbcode is an overcomplication in some ways.
Quote
There's also a PEAR library written in PHP, it seems to be nicely written albeit a bit large (30KB for main source + more for specific tags like url...), and doesn't provide a fixer by default -- but it does seem to fix tags on its side.
I'm always slightly wary of PEAR code. I've had a lot of bad experiences. But I can't say I'd be entirely surprised at its size, depending on how thoroughly it's doing the job.

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Fixing mismatched BBCode
« Reply #49, on December 9th, 2011, 12:25 AM »
- only handles mismatched tags, but it's still a lot. I never really planned to rewrite the entire parser... then again, nothing prevents us from writing similar code for other uses.
- nothing prevents us from doing that either... (?)
- well, i dunno then.
- i certainly like the idea of a very fast parser, but we'd have to determine if it features code that can fix bbc without turning it to html. I doubt it has.
- PEAR isn't important here -- just requires a few rewrites to give up on the dependency. then again -- NBBC is > 120KB (60KB after some sort of minification), so that pear library isn't that horrible to begin with.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Fixing mismatched BBCode
« Reply #50, on December 9th, 2011, 01:13 AM »
Well, going to a bigger level, consider WYSIWYG. WYSIWYG in most systems removes the need for parsing entirely because you validate it on save and you're done, you just output raw content later. Notice also that in most cases with WYSIWYG, you also don't have to contend with tag mismatches much, though the sanitisation layer should still fix any issues just in case.

This is why I'm wondering whether bbcode is entirely the right tool for this job. For simple formatting, sure, but we can use basic HTML for that too. For more complex formatting, it's a trade off between the two (tables in particular aren't a lot different between basic HTML and bbcode) while things like lists can be done much more nicely in wiki markup than in bbcode.

It does imply that we might look beyond using bbcode anyway and would trim the fat by having it done in other places and other ways. But I do like the consistency of everything being bbcode, to a point.

Hmm. It's complicated.

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Fixing mismatched BBCode
« Reply #51, on December 9th, 2011, 09:59 AM »
- wysiwyg is still hell... Saving as raw data is worse. Especially when you start changing the output for a bbcode...
- the main issue with bbc vs html, I think, is that most forum users are used to bbcode. So, basically, even if you enabled basic html tags by default, we would still have to support bbcode for those who don't know about html etc... (at best, we could turn it into pseudo-html at parse time;)
- wiki markup isn't very popular, I reckon. If popularity wasn't an issue, I'd have switched to SPIP code long ago :P
- overall... yeah, bbcode is about consistency, mainly. Although I suppose nothing prevents us from adding alternative pseudo-code for people to choose from. But it always makes things more complicated.

I've given up on the recursive code. Anything that ruins my sleep is not something I should keep really. It's a little alarm clock in my head. So I'm back to my 'original' code and instead of going through the list of recorded tags, I'm going back through the stack... It's probably not going to give fantastic results - for instance, I was storing the tag type until now. If you were trying to find the best place for an opener for an orphan '/i', Wedge would go through previous tags, spot a closing quote and stop immediately because it can't enclose quotes inside italics. Things like that... That was pretty cool, but it doesn't work for that friggin' closer nb I mentioned before, and one day hunting for a bug is enough. I'll just make it pretty simple. I HOPE. No guts, no glory. Whatever.
Re: Fixing mismatched BBCode
« Reply #52, on December 9th, 2011, 10:46 AM »
Code: [Select]
[quote][i] [b] Hello [/i] [/b] [/s][/quote]

In a situation like this, my original code would first fix the mismatched tags in the middle, then it would look for an opener to 's' and eventually add one at the very beginning... Which would break everything because there's a quote in between. So I added a 'last_safe' variable which pointed out what was the last *safe* place to insert something (i.e. anything BEFORE it is considered valid and thus shouldn't be messed with again.)
Problem is, in a situation like that, the variable would be, at best[1] set to... The s closer's position. So we'd end up with an opener, immediately followed by a closer.

So... I'd like to know which you think is best. Shall we:

1- silently remove closer tags if no openers were found?

Code: [Select]
[quote][i] [b] Hello [/b] [/i][/quote]

2- add an opener right before them?

Code: [Select]
[quote][i] [b] Hello [/b] [/i] [b] [/b] [s] [/s][/quote]

3- leave them be, whatever, except maybe for code tags?

Code: [Select]
[quote][i] [b] Hello [/b] [/i] [/b] [/s][/quote]
Posted: December 9th, 2011, 10:44 AM

:edit: Added a footnote. BTW, my favorite is (1), personally...
 1. I'm saying "at best" because there's still an open quote at this point, so I'd have to implement code that would check LATER tags to make sure the quote is actually closed itself and is thus safe...

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Fixing mismatched BBCode
« Reply #53, on December 9th, 2011, 10:52 AM »
Quote
- wysiwyg is still hell... Saving as raw data is worse. Especially when you start changing the output for a bbcode...
Then don't change the output for bbcode and just save as raw data and be done with it. Other systems do this, and it would ease the parse_bbc overhead.
Quote
- the main issue with bbc vs html, I think, is that most forum users are used to bbcode
Not sure I agree with that. I suspect more people would use the WYSIWYG editor if it weren't so buggy. Hell, I could see myself using it if it had proper keybindings (which IIRC some of the editors do support), because it's even faster to hit Ctrl-B to go bold than it is to use the bbcode.

The biggest stumbling block to WYSIWYG will always be complex bbcode that has no direct equivalent in HTML, e.g. the code or quote tags, not the simple stuff.
Quote
- wiki markup isn't very popular, I reckon. If popularity wasn't an issue, I'd have switched to SPIP code long ago
Well, I'm also not sure I agree with that. It depends more on the aspect of it you're looking at.

Wiki link syntax isn't clever in a forum that supports bbcode due to overloading on the [] operators not to mention the fact that you're not generally linking to conveniently titled pages (though the external link syntax is usable enough outside of operator overloading), the : indentation operator is a bit of a joke (because you don't do that in a forum generally), pipe syntax for tables is beyond a headache, but the place where wiki markup really shines is in lists.

It's almost impossible to screw up a wiki syntax list or if you do, it's usually not difficult to get it to what you want. The caveat is that it's a single line only per entry as opposed to a fully delineated entity (e.g. with li tags that can contain paragraphs, though I'm not sure that's a huge problem)

I agree about the consistency factor of bbcode, overall, it does make things consistent both for the user and the code, but it might be nice to provide wiki-style conventions as well. Probably as a plugin, though, just because it's an extra level of complexity otherwise.

It's a brave thing you've taken on there - overhauling the preparser/parser for tag mismatches was always on my todo list but I haven't been brave enough to tackle it just yet.
Quote
3- leave them be, whatever, except maybe for code tags?
I'm inclined to go with this, provided that the user is made aware that there was a change and that they might want to review the post (since an extra b closer tag has been added and there is now going to be unparsed b closer left behind) and hopefully they'll spot that the s is unopened.

We can't reach inside their mind and to a point we shouldn't be trying to do so. The most reliable of cases we can do something about but where there's any room for error, leave it be.

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Fixing mismatched BBCode
« Reply #54, on December 9th, 2011, 12:03 PM »
Quote from Arantor on December 9th, 2011, 10:52 AM
Then don't change the output for bbcode and just save as raw data and be done with it.
Not changing the output is opening the door to many problems... No?
There are many cases where it can be 'fixed' by using CSS and only applying the changes to them, but when you start using JS in a bbc tag...
Quote
Not sure I agree with that. I suspect more people would use the WYSIWYG editor if it weren't so buggy. Hell, I could see myself using it if it had proper keybindings (which IIRC some of the editors do support), because it's even faster to hit Ctrl-B to go bold than it is to use the bbcode.
Hmm... I dunno.
Well, if we could get the wysiwyg editor to actually show quotes as HTML, it would certainly solve a lot of issues. (And that's where my automatic quote splitter would come in very handy, because that'd really be the only way to split quotes at all...)
Quote
The biggest stumbling block to WYSIWYG will always be complex bbcode that has no direct equivalent in HTML, e.g. the code or quote tags, not the simple stuff.
<we:code> and <we:quote>, that being said...
Quote
It's a brave thing you've taken on there - overhauling the preparser/parser for tag mismatches was always on my todo list but I haven't been brave enough to tackle it just yet.
At the risk of disappointing you -- I didn't plan for an overhaul or anything... I simply wanted to change SMF's behavior of fixing mismatched tags when it could simply return an error. Then it evolved into pointing out the exact location of the error... Then it evolved again into fixing the errors automatically when Wedge didn't, or couldn't, expect a $post_errors return message.
Quote
I'm inclined to go with this, provided that the user is made aware that there was a change and that they might want to review the post (since an extra b closer tag has been added and there is now going to be unparsed b closer left behind) and hopefully they'll spot that the s is unopened.
Hmm... I don't know, maybe we could silently remove any non-block tags, because they're mostly for details, while we could leave in the block tags where the 'bug' might actually break the post layout...?

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Fixing mismatched BBCode
« Reply #55, on December 9th, 2011, 06:38 PM »
Quote
Not changing the output is opening the door to many problems... No?
Not if you've ascertained that it's valid HTML and you always treat it as such (like always throwing it back and forth to the WYSIWYG editor)
Quote
There are many cases where it can be 'fixed' by using CSS and only applying the changes to them, but when you start using JS in a bbc tag...
As Karl pointed out when he brought the subject up, there are libraries out there that will allow you to exclude scripting and other similar unwanted content. There's no reason why any other tags can't be excluded either.
Quote
Hmm... I dunno.
Well, if we could get the wysiwyg editor to actually show quotes as HTML, it would certainly solve a lot of issues. (And that's where my automatic quote splitter would come in very handy, because that'd really be the only way to split quotes at all...)
This is where it gets very problematic. None of the hybrid solutions I've seen that support both WYSIWYG and bbcode do this. They always leave the bbcode alone and render it from bbcode to HTML as needed.

The only practical alternative is to have it be able to identify the quote tag (and all the header-y bits) and be able to convert it, but if people will want to customise the look of quote tags that can't be done in CSS... it's going to be ugly.
Quote
<we:code> and <we:quote>, that being said...
That being said, it won't render properly in the context of WYSIWYG. WYSIWYG editors, conceptually, are voodoo, and not nice voodoo. (They all work principally the same way: take an iframe, receive all events while the 'textbox' has focus, and transmit all the events to handlers to manipulate the iframe as if it were a true RTE component.)

That's why SMF's code is so buggy, the different browsers generate different HTML fragments in the iframe.
Quote
At the risk of disappointing you -- I didn't plan for an overhaul or anything... I simply wanted to change SMF's behavior of fixing mismatched tags when it could simply return an error. Then it evolved into pointing out the exact location of the error... Then it evolved again into fixing the errors automatically when Wedge didn't, or couldn't, expect a $post_errors return message.
No disappointment at all. Anything that affects tag processing is a scary business.
Quote
Hmm... I don't know, maybe we could silently remove any non-block tags, because they're mostly for details, while we could leave in the block tags where the 'bug' might actually break the post layout...?
That seems reasonable, except for the cases when people do post things without realising that they'll have interesting side effects.

For example, more than once I've seen people shorten [Unknown]'s name to [U], and be surprised at the result. The reason for their surprise is that other systems just silently fail to render the tag at all if there isn't a safe matching closer (IIRC vBulletin does/did this)

There's no one right answer for this.

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Fixing mismatched BBCode
« Reply #56, on December 9th, 2011, 10:41 PM »
Okay, it's like there's an endless list of bugs to fix with this...
First, my code was foolishly taking care of mismatched tags INSIDE code tags. Woops...
Obviously, this had to go. Done. I'm also not fixing anything inside nobbc tags. Anything else?

Now to an interesting bug... My s tag (strike) was not closing correctly. What the hell?
Turns out -- and I only found out after a painful debugging session -- that 's' is actually considered as a block-level tag by Wedge... This isn't the case in SMF, so I'd say it's a typo by the unknown author who moved bbcodes to the database... :whistle:
Also, and this one IS in SMF, the br tag is not a block-level, while the hr tag is. It doesn't sound very logical to me... I'd tend to say they both are, though. Not that it will matter in my code anyway because it skips self-closing tags.
I'd also consider img tags to be block level... Shouldn't they?
li tags and things like that are block level, and THAT bothers me a bit... Because my code treats block level tags in a special way, and I suspect this wouldn't properly add the closer tags, and instead accumulate tags. i.e. [li]Hello[li]world[/li] would add a closer after the last closer, instead of before the last opener.

Also. I added the code to turn a closer quote into an opener if it turned out it's at the start of a line and is followed by anything but a newline. Then I figured, okay MAYBE people could actually start a new line, type in [ /quote] and then immediately after, another [/ quote] to close a nested quote. Or even [ /code] or whatever... So I added a test for '[/' after the tag. Now, I suppose there are suddenly dozens of new ways to break this... Obviously I suppose I could test 'simply' for a-z... I don't know. It's hell.
And all of that because I sometimes would have one of these closers at the start of a line when I meant to be using an opener.

Sometimes I wonder if we shouldn't just, ahem, call parse_bbc inside preparsecode, and have parse_bbc return a pseudo-BBC version instead of a html version... Would probably make things... easier. For me. Ah ah.
Well, I'm saying that but I've nearly reached the end... (And then in a weeks time I'll still be on it........... It's just a fucking pre-parser, arghhh!!)
Quote from Arantor on December 9th, 2011, 06:38 PM
This is where it gets very problematic. None of the hybrid solutions I've seen that support both WYSIWYG and bbcode do this. They always leave the bbcode alone and render it from bbcode to HTML as needed.
I don't see where it would be impossible to implement.
We just need to always have the same code for quotes for instance... I mean, the current code generator is very simple. It can easily be emulated through jQuery to actually add that code to the post.... Heck, we could even take the quote's HTML code from within the database, and put it into a JS string...! That's the logical way to do it, even...
Quote
That being said, it won't render properly in the context of WYSIWYG. WYSIWYG editors, conceptually, are voodoo, and not nice voodoo. (They all work principally the same way: take an iframe, receive all events while the 'textbox' has focus, and transmit all the events to handlers to manipulate the iframe as if it were a true RTE component.)
That, and contentEditable.
Quote
For example, more than once I've seen people shorten [Unknown]'s name to [U], and be surprised at the result. The reason for their surprise is that other systems just silently fail to render the tag at all if there isn't a safe matching closer (IIRC vBulletin does/did this)
That's... interesting.
But then again, as important Unknown is the the SMF community (it's a no-brainer), I'm prepared to have his name removed entirely from posts if people shorten it to 'u'...
Quote
There's no one right answer for this.
In our case, with my code, the missing u closer will be added.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Fixing mismatched BBCode
« Reply #57, on December 9th, 2011, 11:00 PM »
Quote
Okay, it's like there's an endless list of bugs to fix with this...
First, my code was foolishly taking care of mismatched tags INSIDE code tags. Woops...
Obviously, this had to go. Done. I'm also not fixing anything inside nobbc tags. Anything else?
The contents of html tags should be left untouched too.
Quote
Now to an interesting bug... My s tag (strike) was not closing correctly. What the hell?
Turns out -- and I only found out after a painful debugging session -- that 's' is actually considered as a block-level tag by Wedge... This isn't the case in SMF, so I'd say it's a typo by the unknown author who moved bbcodes to the database...
The hell? I see why I made the mistake now, though, there's a couple immediately before it and I just hit the wrong thing out of repetition.
Quote
Also, and this one IS in SMF, the br tag is not a block-level, while the hr tag is. It doesn't sound very logical to me... I'd tend to say they both are, though. Not that it will matter in my code anyway because it skips self-closing tags.
I'd also consider img tags to be block level... Shouldn't they?
hr is a block tag in HTML, br and img are not.

Putting aside the semantic changes in HTML5 surrounding the a tag, consider it this way: would you consider putting an img inside an a to be valid? Answer: yes. Can't be a block tag then, because you're implicitly allowing it to be running inline.

Same question for br... answer: yes. Same reason. Although it has a layout effect, it's not something that would logically or semantically split a block element. Splitting text nodes, sure, but not splitting an element semantically. A link with a br in it would just be a vertically set link.

Both br and img are special tags in their own right because they're the class of tag that isn't markup by definition (much as hr is part of the same group) but that they have content-replacement as their focus: each tag replaces itself with content in render-time, they're not affecting presentation or structure of any other content (as a div or a span or a strong tag might). But hr doesn't logically fit being part of a link, and you'd (theoretically) never use it in an inline context, and I don't think it's valid for it to be either.
Quote
li tags and things like that are block level, and THAT bothers me a bit... Because my code treats block level tags in a special way, and I suspect this wouldn't properly add the closer tags, and instead accumulate tags. i.e. [li]Hello[li]world[/li] would add a closer after the last closer, instead of before the last opener.
Here's the thing. Do we know what the user was intending to do with this? Ignoring what preparsecode will do with that, that is. Is it meant to be two separate list items, or a list item containing a sublist of another item?

We have two choices. We can manipulate it to the 'most likely' case of affairs, or we can avoid rendering it. If the regexps in preparsecode were not run, I can actually tell you what the parser would do: it would render that as a single list item with the bare li bbcode unmatched in the middle.[1]

Trouble is, with cases like this, it's sufficiently ambiguous (block or not) that we can't really make a call on how it should be changed.

FWIW, list items are considered block items because they can contain lists which are also implicitly block items.
Quote
Also. I added the code to turn a closer quote into an opener if it turned out it's at the start of a line and is followed by anything but a newline. Then I figured, okay MAYBE people could actually start a new line, type in [ /quote] and then immediately after, another [/ quote] to close a nested quote. Or even [ /code] or whatever... So I added a test for '[/' after the tag. Now, I suppose there are suddenly dozens of new ways to break this... Obviously I suppose I could test 'simply' for a-z... I don't know. It's hell.
People are dumb. Make the software smarter, you get dumber people come along. We can't predict how they'll do stuff and to a point I sort of think we shouldn't try and prejudge the intent of users anyway.
Quote
Sometimes I wonder if we shouldn't just, ahem, call parse_bbc inside preparsecode, and have parse_bbc return a pseudo-BBC version instead of a html version... Would probably make things... easier. For me. Ah ah.
Well, I'm saying that but I've nearly reached the end... (And then in a weeks time I'll still be on it........... It's just a fucking pre-parser, arghhh!!)
Something along these lines, yes, but I'd move/clone the code, I wouldn't supersize parse_bbc to be a hybrid HTML and legal bbc parser, since the uses are rather different even if the underlying logic is the same.
Quote
I don't see where it would be impossible to implement.
We just need to always have the same code for quotes for instance... I mean, the current code generator is very simple. It can easily be emulated through jQuery to actually add that code to the post.... Heck, we could even take the quote's HTML code from within the database, and put it into a JS string...! That's the logical way to do it, even...
Which is fine until someone modifies it. Or manually tries to add it. There's all kinds of ways that could be broken, which is why no-one does it.
Quote
That, and contentEditable.
I'm not 100% sure contentEditable is required, you know. I don't think certain browsers (glares at IE6) support it.
Quote
That's... interesting.
But then again, as important Unknown is the the SMF community (it's a no-brainer), I'm prepared to have his name removed entirely from posts if people shorten it to 'u'...
Thing is, that's what already happens in SMF. The missing u will already be cleaned up softly by parse_bbc at the end of a block tag or end of the post if it hasn't otherwise been closed, and it will underline the content. That's what I meant by unexpected behaviour.
Quote
In our case, with my code, the missing u closer will be added.
In our case we make the same decision only it's stored into the post rather than being added softly otherwise.
 1. Well, that's what my understanding of the requires_parent code intimates it should do, put it that way, since requires_parent for li says it requires a list bbcode as a parent and it won't have one, so it should be left alone...

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Fixing mismatched BBCode
« Reply #58, on December 10th, 2011, 06:34 PM »
I'm nearly done, and that's a feat really... Considering it took me two days to get it right...
I've only got a couple of quote tags opened at the end for no reason. There was an annoying number of bugs in my code which caused all of the problems -- most notably, I'm a bit ashamed of it, an array which I was accessing as if its first entry was $array[1]... Oops. No wonder it kept failing.

Anyway, I'm pretty disappointed overall with the 'dumbness' of the code, in that it doesn't really do much to help. It adds openers in boring places but at least it does it. Could be worse... Also, I'm still unable to get it right when it comes to block tags and inline tags. If you have, for instance, [b][s][/quote] and there's no quote opener in the post, Wedge won't even (as of now) close the b and s tags, it'll simply add an opener before the closer quote... It is definitely dumb. (It correctly closes the tags if it can find an opener quote before the b and s tags.) Problem is, I'm not even sure how best to fix this string, more precisely, where should I add the quote opener...? My guts tell me that it should be after the LAST closed block tag (if it finds any).
Oh, and don't get me started on more complicated setups of block tags being inserted inside non-block tags... I don't think I have any way to protect against these ultimately. Add to that the fact that these tags are to be inserted inside the original post, without its surrounding crap etc... And I haven't even started to consider checking whether a block closer is at the beginning of a line and immediately followed by contents -- in which case it should be turned into an opener for sure. It's so deadly... And tiring. And headache-inducing (literally.)

PS: this post needs no reply, I wrote it yesterday afternoon and forgot to post it... It's just a bonus post. I haven't worked on the feature today, as I'm very, very wary of the difficulties I know I will meet when finishing it. I'm quite tempted to just close unclosed tags when meeting a block closer and then add the block opener just before the closer... And as for non-block closers, just remove them entirely. Uh.
Re: Fixing mismatched BBCode
« Reply #59, on December 10th, 2011, 11:49 PM »Last edited on December 11th, 2011, 09:01 AM by Nao
I'm trying to figure out a 'simple' way to prevent nesting 'li' tags inside others...
Does 'li' have some peculiarity in the bbcode table? I see there are things with require_parent, disallow_children and things like that, but I'm not sure I get the gist of it... li requires the list parent but doesn't disallow anything, right? Shouldn't it disallow li children..?

I didn't make much progress on fixNesting today. Only bug fixed is one that didn't allow me to precisely show where the errors were found... Now it should work fine. i.e. instead of giving you the bad tag in the list of tags, it shows you the bad tag, surrounded by the context in the message (post contents, other tags...)