if (preg_match_all('~\[(/)?code[^]]*\]~i', $message, $matches))
foreach ($matches[1] as $tag)
$num_tags += (empty($tag) ? 1 : -1);
// If we have open tags, close them.
while ($num_tags > 0 && $num_tags--)
$message .= '[/ code]';
// Open any ones that need to be open, only if we've never had a tag.
while ($num_tags < 0 && $num_tags++)
$message = '[code]' . $message;Why exactly is it so important to have matching tags?
But that would mean a fuller rewrite of preparsecode, which I'm not ready/willing to do...
So, I'm looking at the code tag stuff and all I see is that it's mainly used to deal with special tags only when they're outside code tags.
// Find all code blocks, and ensure they follow the [ code][ /code] syntax closely.
if (preg_match_all('~\[(/?)code[^]]*]~i', $message, $matches))
{
$pos = 0;
$tags = array();
foreach ($matches[0] as $id => $tag)
{
$tag_pos = strpos($message, $tag, $pos);
$length = strlen($tag);
$tags[] = array($tag_pos, $length, $matches[1][$id] === '/');
$pos = $tag_pos + $length;
}
$was_closed = true;
$offset = 0;
foreach ($tags as $tag)
{
if ($was_closed === $tag[2]) // consecutive closing tags, or closing tag at the beginning?
{
$message = substr($message, 0, $tag[0] + $offset) . substr($message, $tag[0] + $tag[1] + $offset);
$offset -= $tag[1];
}
$was_closed = $tag[2];
}
if (!$tag[2]) // no final closing tag?
$message .= '[ /code]';
}AND.... Interestingly, the same code can be reused for proper quote de-nesting :)
Which makes me think... I've often seen cases of people (including me, TBH) surrounding quotes with two closers instead of an opener and a closer. I'm trying to figure out a way to fix these automatically, but honestly it's hard to tell... :-/
| 1. | Actually, it starts by emptying $context['post_errors'] (note the plural) which is never used, meaning it's actually a typo... |
Oh, I see what you're getting at, misread it at first.
Also, Post2 is not the only place content updates are done (quick edit for example)
Here's the code, for those interested. It seems to work, at least on my test cases. Maybe you can figure out a faster/shorter way to do it ;)
[/quote][quote author=Nao link=msg=1 date=1309111289]Lorem ipsum?[/quote]What is that?
I don't speak rubbish!
[nb]I'm wondering if Rory does, though? He had time to learn...[ /code]
[ code][ nb][ /code][ /nb][quote]post1[/quote]comment[/quote]post2[/quote][quote]post1[/quote][quote]comment[/quote][quote]post2[/quote][quote][b] [s] post1[/quote][quote]post1[/quote][b] [s] comment[/quote]post2[/quote]Using the pseudo-code from above, this would be really messy -- an opening quote would be added just before the closer, and then the s and b tags would remain opened until the end of the post, where they would then be closed forcibly.
post1
They shouldn't be. The bbc parser should actually close both the s and b tags, honouring proper hierarchy, when it gets to the end of quote.Like so, in fact.Citation post1
it's best to first close any opened tags, and THEN go through the search for a place to add a new quote opener.
(And if you check the source, you'll see it's unmodified.) It works because there's the end of a block level tag with unresolved non block tags inside it. The exact behaviour is incredibly complicated, and is no doubt one of the reasons why the bbc parser is so big and scary - but it's also resilient.
This is why I asked if the idea behind this was partly to reduce its complexity or not, because it actually does a lot of silent fixing that most people don't even realise.
The problem as you've discovered with writing such a solution is that unless you can get inside the user's brain and figure out what they meant, rather than what they typed, you have no hope of getting it consistently right.
But *once* it is redone through ::fixNesting, with my super-solid code etc, won't all of the remaining code become unnecessary all of a sudden...?
The only fixing I've seen preparsecode do is add code tags at the beginning or end of a post which sucks a bit...
::parse_bbc does some fixing on its own, IIRC, but if it does, it's in the wrong place. It should be done at save time, obviously. (AND, we should remove any fixer code from ::parse_bbc to force modders to go through ::preparsecode. Believe me, I didn't even know this function existed before Shitiz used it in SMG, and I didn't have the *reflex* to use it systematically until, err.... Now?)
Does fixNesting deal with block tag nesting mismatches or just nesting mismatches?
Also, once nesting and mismatches is fixed, we will also need to look at dependencies and must-contain/must-not-contain rules too, which are also specifiable in the bbc parser...
preparsecode does a lot of things, actually. I think I listed what it did in a previous post,
Part of the reason parse_bbc has it and not preparsecode is that posts added to the DB through other sources that won't have come through preparsecode originally, and that's not just for modders (for example, this will include the importer unless we push every post through some kind of fixer during the import)
I always wanted to move that to the preparser anyway to remove this dependency on strange, naive regexps that didn't allow for customising the table tag or adding th tags, without rewriting all that stuff as well.
[/quote][quote author=Nao link=msg=1][b]
Lorem ipsum?
[/b][/quote]What is that?
I don't speak rubbish![nb]I'm wondering if Rory, though? He had time to learn...[ /code][ code][ nb][ /code][ /nb][quote]post1[/quote][b] [s] comment[/quote]post2[/quote]It's a complicated task at the very best of times :( And it shows in browsers too, when they have malformed tags to deal with, some ignore them totally, some make assumptions with interestingly unpredictable results. I remember having a conversation with a fellow geek back in 2000 which illustrates this perfectly: he was building a site with a big complex data-heavy table in it, and it worked perfectly in IE but broke horribly in Netscape. As I discovered... he wasn't putting any of the closing tags because 'it doesn't need them'. Well, it obviously does!
It's also a rabbit hole of a problem, in that no matter how clever you get, you can pretty much always find another example that will break it. The issue is where the line gets drawn.
This is also why I think we will need to adopt the logic used in the parser to unwind and reprocess the tag nesting, simply because it's a lot more than just having balanced tags. It's a pain to get right but the result will be worth it in the end.
What do you reckon about the special list of tags like x or *? Do we really need them? I actually think they cause more trouble than they're worth - and they're a perfect poster child of why this whole issue is a problem, since no-one seems to know how to actually safely end such a list.
With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue.
What do you mean unwind and reprocess?
Yeah... It'd be a candidate for deletion, I suppose.
SPIP actually supports turning opening dashes automatically into bullet points. At least it did back in 2003... Ah, good old times.
| 1. | The alternative is extra closers being injected, causing layout malfunction, seeing how this started out in SMF 1.1.x with its tabular layout. |
They still deal with mismatched tags and even malformed tags differently.Citation With most of the browsers now using a unified parsing model (the HTML5 parser), it's no longer an issue.
What the parser ultimately does is step through the post and figures out the tags in play, and when it hits a closer (especially of a block level) or certain combinations of block level opener, it reviews all the tags that are open and closes some or maybe all of them.
At each point, not only is the list of open tags maintained, plus block level evaluation, but the potential tags that can be contained inside each other, plus dependencies, are all reviewed too.
Now, if we were to move the full logic from the parser to the preparser, we'd be able to trap improperly nested tags too, and report on them properly.
Here's the problem: what ends the list item? What ends the list?
This is one place where I'm actually slightly envious of wiki markup because it actually does it sanely. It has no assumptions about hierarchy, one line is one list item, and the first blank line after the fact is the end of the list. If only it were that simple with the one character shortcuts.
Yes, that's pretty much what fixNesting does...
(Well... Unless we add an errorbox for quick edit, of course. Which would not be such a bad idea...)
The end of the line?
We could always draw our inspiration from the best elements in wiki code...
There's also a PEAR library written in PHP, it seems to be nicely written albeit a bit large (30KB for main source + more for specific tags like url...), and doesn't provide a fixer by default -- but it does seem to fix tags on its side.
[quote][i] [b] Hello [/i] [/b] [/s][/quote][quote][i] [b] Hello [/b] [/i][/quote][quote][i] [b] Hello [/b] [/i] [b] [/b] [s] [/s][/quote][quote][i] [b] Hello [/b] [/i] [/b] [/s][/quote]| 1. | I'm saying "at best" because there's still an open quote at this point, so I'd have to implement code that would check LATER tags to make sure the quote is actually closed itself and is thus safe... |
- wysiwyg is still hell... Saving as raw data is worse. Especially when you start changing the output for a bbcode...
- the main issue with bbc vs html, I think, is that most forum users are used to bbcode
- wiki markup isn't very popular, I reckon. If popularity wasn't an issue, I'd have switched to SPIP code long ago
3- leave them be, whatever, except maybe for code tags?
Then don't change the output for bbcode and just save as raw data and be done with it.
Not sure I agree with that. I suspect more people would use the WYSIWYG editor if it weren't so buggy. Hell, I could see myself using it if it had proper keybindings (which IIRC some of the editors do support), because it's even faster to hit Ctrl-B to go bold than it is to use the bbcode.
The biggest stumbling block to WYSIWYG will always be complex bbcode that has no direct equivalent in HTML, e.g. the code or quote tags, not the simple stuff.
It's a brave thing you've taken on there - overhauling the preparser/parser for tag mismatches was always on my todo list but I haven't been brave enough to tackle it just yet.
I'm inclined to go with this, provided that the user is made aware that there was a change and that they might want to review the post (since an extra b closer tag has been added and there is now going to be unparsed b closer left behind) and hopefully they'll spot that the s is unopened.
Not changing the output is opening the door to many problems... No?
There are many cases where it can be 'fixed' by using CSS and only applying the changes to them, but when you start using JS in a bbc tag...
Hmm... I dunno.
Well, if we could get the wysiwyg editor to actually show quotes as HTML, it would certainly solve a lot of issues. (And that's where my automatic quote splitter would come in very handy, because that'd really be the only way to split quotes at all...)
<we:code> and <we:quote>, that being said...
At the risk of disappointing you -- I didn't plan for an overhaul or anything... I simply wanted to change SMF's behavior of fixing mismatched tags when it could simply return an error. Then it evolved into pointing out the exact location of the error... Then it evolved again into fixing the errors automatically when Wedge didn't, or couldn't, expect a $post_errors return message.
Hmm... I don't know, maybe we could silently remove any non-block tags, because they're mostly for details, while we could leave in the block tags where the 'bug' might actually break the post layout...?
This is where it gets very problematic. None of the hybrid solutions I've seen that support both WYSIWYG and bbcode do this. They always leave the bbcode alone and render it from bbcode to HTML as needed.
That being said, it won't render properly in the context of WYSIWYG. WYSIWYG editors, conceptually, are voodoo, and not nice voodoo. (They all work principally the same way: take an iframe, receive all events while the 'textbox' has focus, and transmit all the events to handlers to manipulate the iframe as if it were a true RTE component.)
For example, more than once I've seen people shorten [Unknown]'s name to [U], and be surprised at the result. The reason for their surprise is that other systems just silently fail to render the tag at all if there isn't a safe matching closer (IIRC vBulletin does/did this)
There's no one right answer for this.
Okay, it's like there's an endless list of bugs to fix with this...
First, my code was foolishly taking care of mismatched tags INSIDE code tags. Woops...
Obviously, this had to go. Done. I'm also not fixing anything inside nobbc tags. Anything else?
Now to an interesting bug... My s tag (strike) was not closing correctly. What the hell?
Turns out -- and I only found out after a painful debugging session -- that 's' is actually considered as a block-level tag by Wedge... This isn't the case in SMF, so I'd say it's a typo by the unknown author who moved bbcodes to the database...
Also, and this one IS in SMF, the br tag is not a block-level, while the hr tag is. It doesn't sound very logical to me... I'd tend to say they both are, though. Not that it will matter in my code anyway because it skips self-closing tags.
I'd also consider img tags to be block level... Shouldn't they?
li tags and things like that are block level, and THAT bothers me a bit... Because my code treats block level tags in a special way, and I suspect this wouldn't properly add the closer tags, and instead accumulate tags. i.e. [li]Hello[li]world[/li] would add a closer after the last closer, instead of before the last opener.
Also. I added the code to turn a closer quote into an opener if it turned out it's at the start of a line and is followed by anything but a newline. Then I figured, okay MAYBE people could actually start a new line, type in [ /quote] and then immediately after, another [/ quote] to close a nested quote. Or even [ /code] or whatever... So I added a test for '[/' after the tag. Now, I suppose there are suddenly dozens of new ways to break this... Obviously I suppose I could test 'simply' for a-z... I don't know. It's hell.
Sometimes I wonder if we shouldn't just, ahem, call parse_bbc inside preparsecode, and have parse_bbc return a pseudo-BBC version instead of a html version... Would probably make things... easier. For me. Ah ah.
Well, I'm saying that but I've nearly reached the end... (And then in a weeks time I'll still be on it........... It's just a fucking pre-parser, arghhh!!)
I don't see where it would be impossible to implement.
We just need to always have the same code for quotes for instance... I mean, the current code generator is very simple. It can easily be emulated through jQuery to actually add that code to the post.... Heck, we could even take the quote's HTML code from within the database, and put it into a JS string...! That's the logical way to do it, even...
That, and contentEditable.
That's... interesting.
But then again, as important Unknown is the the SMF community (it's a no-brainer), I'm prepared to have his name removed entirely from posts if people shorten it to 'u'...
In our case, with my code, the missing u closer will be added.
| 1. | Well, that's what my understanding of the requires_parent code intimates it should do, put it that way, since requires_parent for li says it requires a list bbcode as a parent and it won't have one, so it should be left alone... |
I'm trying to figure out a 'simple' way to prevent nesting 'li' tags inside others...
| 1. | It was about 150 lines long when I suddenly decided Wedge should be nice enough to prevent using block tags inside non-block tags. Heck... Even then, if you have something strange like an opening "s", then an opening "b", then a closing "s", it will add the closing "b" before the opening block tag of course, but then again AFTER the closing "s". Don't ask too much of me... |
No, li doesn't have to authorise that. The only rule it really needs is that li must be inside a list. Then the trick is making sure where list is, which IIRC is anywhere that a block context allows it to be.
The preparsecode stuff to check for this is so fucked up, once again...
Okay, I'll look into doing that tomorrow...
Still, complex errors won't be magically fixed, I'm afraid... Unless, unless I do a stricter check.
So while typing this post, it came to me that I could use a stack of tags and stack/unstack data and... Well, have a look at this code:Code: [Sélectionner] [/quote][quote author=Nao link=msg=1 date=1309111289]Lorem ipsum?[/quote]What is that?
I don't speak rubbish!
[nb]I'm wondering if Rory does, though? He had time to learn...[ /code]
[ code][ nb][ /code][ /nb]
And... Here's the error message I'm getting following my latest rewrite. Which is actually SHORTER than the last one ;) It's not perfect but I'm working on it eheh.
Pretty cool uh?
PS: and yes, it works with tag nesting too, since it's a stack... i.e. if I have properly nested 'b' tags inside the quote, everything's fine.
PPS: the main issue with fixing tags in the middle of a message is that I would then have to find the exact position of the tag... I guess it's doable, though, but I'll have to go through a series of strpos etc or something to fill in the list first, so it'll definitely make the code bigger.