htmlspecialchars while inserting into DB

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: htmlspecialchars while inserting into DB
« Reply #15, on June 21st, 2011, 06:04 PM »
Quote
This wasn't done for security reasons (although it's not necessarily a bad thing security wise.)
Oh? Well, I never went as far back as YaBB or YaBBSE, so I'm looking at it all from SMF 1.1 and later's perspective, and what I've seen said. My understanding was that SMF 1.0 did bbc parsing through regexp, on display, but that a vicious ReDoS could take it out because it wasn't protected against that, hence it was doing through string parsing and done on a regurgitation basis, and that security was one of the major factors.

But yes, your other comment about not changing it on upgrading is a good one. In our case, we explicitly allowed for that sort of thing because we were looking at fixing otherwise broken links when coming from non SMF sources.
Quote
And lastly because you need to be able to edit your posts, and we knew there would be bugs if we tried to "reverse the translation" - but were worried about the additional storage requirements of storing two copies of every single post.
Funny you should mention that, that's exactly what the WYSIWYG editor tries to do, badly. So much so that it doesn't bother converting anything other than simpler HTML.
Quote
1. String parsing in PHP is slow.
No argument. Interesting comment there, there were plans to rewrite the bbc parser in C under the banner of smflib. (Not sure if that was after your time or not) but it never really went anywhere. It's still in SMF's SVN, untouched in pretty much forever.

I think whatever happens, we can't really afford to leave PHP, it's not like we can just conjure up a parser in C, and even if we could and did, I'd honestly not want the hassle of support for something like that.
Quote
2. Caching posts needs a bit of a better solution
Correction: it needs *some* solution.

parse_bbc does caching, but only under certain circumstances: if a post takes over a certain length of time to parse, and the parse_tags parameter wasn't given. parse_tags is only applicable when you invoke the parser with a subset of tags to accept, e.g. for signatures that have a different set of rules to normal.

In real practice, only truly huge posts ever hit this limit anyway.
Quote
What I've done is an "indexed cache", which means I mark entries in the cache that are popular, and over time, cache those popular items.
This sounds like a good idea to me. I'm not sure offhand how we'd do it, but it seems more reasonable than just bulk throwing things at cache.
Quote
3. The parse_bbc() routine is far from perfect
I really need to get properly familiar with TOX-G's innards. Is there anything specific you're thinking about how it operates that you'd do differently?
Quote
4. Someone needs to profile parse_bbc()
*nods* It does need to be done properly. I never seem to find the time to do this though :/
Quote
Right.  I think htmlspecialchars is relatively cheap, so I'd either do it on display, or else communicate the data in a text format (such as json, although that has its own escaping) and set it as text via the DOM.
So, then, if we were to htmlspecialchars it on output, presumably we wouldn't do it on saving the data in the first place? Or would we unsanitise it before resanitising it?

My one fear with doing that is that it opens a door I don't really want opened. SMF's ecosystem is populated with those who write SSI functions and jiggle around with the innards, and I'd really rather not give them any more ways to make it easier to make them insecure.
Quote
Hmm.  I feel like we had something in other/tools/ that did this and did do topic reattribution, which Compuart I think wrote.  Weird.
Well, in 2.0, it's in the admin panel. Except it's broken.
Quote
centralized bug fixes in an unofficial package that I updated every day or two.
See, that's neat. And really, I'm not against the idea. I just wish they'd write rules and stick to them; it always used to be the case when I was on the Cust. team that bugfixes weren't allowed. Then they allow these bugfixes, and they even package up the bugfix for the IE 8 "jumping text bug" for 1.1.x as a mod rather than roll it into the update >_<
Quote
I'm the guy who always wanted to remove the calendar from the standard distribution of SMF.
Things like this I'm in one of two minds about - and it's been discussed here too. My take on it is straightforward enough: everything that's in the core of Wedge should be something that a "reasonable percentage of users" are going to use.

Right now the calendar is seemingly used by the relative minority. A small but vocal group use it for events, and a larger group turn it on because of birthdays.

So, my take on it, specifically, is that birthdays should be separated from the calendar, and then the calendar should either be moved out of the core, or made more useful, so as that more people would find it useful. For example, we've talked about accessing "Today's posts" from the calendar.

While I don't like to point at other systems and go "they do it" (it sort of goes against my "innovate, not imitate" philosophy), I look at WordPress and I see they have it. Different horses for different courses but they have a calendar in the core, enabled by default. Because it suits them to do so.

Given that we have blog support - because it's something we both want and I saw more and more requests for it at SMF (and the fact that if you have Wedge, you shouldn't really need to bridge to WP for a simplistic blog, have one system and be done with it), the calendar takes on a new light - and one that helps justify keeping it.
When we unite against a common enemy that attacks our ethos, it nurtures group solidarity. Trolls are sensational, yes, but we keep everyone honest. | Game Memorial

[Unknown]

  • Posts: 77
Re: htmlspecialchars while inserting into DB
« Reply #16, on June 21st, 2011, 07:22 PM »
Quote from Arantor on June 21st, 2011, 06:04 PM
Oh? Well, I never went as far back as YaBB or YaBBSE, so I'm looking at it all from SMF 1.1 and later's perspective, and what I've seen said. My understanding was that SMF 1.0 did bbc parsing through regexp, on display, but that a vicious ReDoS could take it out because it wasn't protected against that, hence it was doing through string parsing and done on a regurgitation basis, and that security was one of the major factors.
We didn't even know about ReDoS back then.  No, it was because people found workarounds - we kept blacklisting more characters - slash, NULs, etc. but Internet Explorer just had more "lax HTML parsing" than we could deal with.  It was a constant source of security holes.

It also had tons of bugs.  If you put tags in the wrong order, all sorts of nasty things could happen, and you couldn't quote my name (I see people have seen fit to bring that bug back, though.)  Autolinking was a constant source of issues too, and you'd often get HTML in your post because of it conflicting with the bbc regexps.

And I'm sure it was vulnerable to ReDoS (and would've had to be to workaround IE's horribleness.)

Given my experience with that, I'll never ever write a bbc parsing system that uses regexps.  If I did, I would be almost certain it was insecure no matter how much I tested it.
Quote from Arantor on June 21st, 2011, 06:04 PM
Funny you should mention that, that's exactly what the WYSIWYG editor tries to do, badly. So much so that it doesn't bother converting anything other than simpler HTML.
This is why I was against a WYSIWYG.  I didn't think the two could live side by side properly.
Quote from Arantor on June 21st, 2011, 06:04 PM
No argument. Interesting comment there, there were plans to rewrite the bbc parser in C under the banner of smflib. (Not sure if that was after your time or not) but it never really went anywhere. It's still in SMF's SVN, untouched in pretty much forever.
Oh yeah, me and Compuart worked on smflib together (mostly him, though.)  What made me (and I think him) sad was that it wasn't really faster.  But he had scripts that automatically converted PHP code into zval using C.  We were thinking to make it similar to what HipHop is today, I think.  Maybe David stole our idea (just kidding.)
Quote from Arantor on June 21st, 2011, 06:04 PM
I think whatever happens, we can't really afford to leave PHP, it's not like we can just conjure up a parser in C, and even if we could and did, I'd honestly not want the hassle of support for something like that.
If it was built with a discreet, simple protocol, and just did the nuts and bolts of the parsing (the slow part), and was optional like Sphinx, it could make sense for large forums.  But, sure.
Quote from Arantor on June 21st, 2011, 06:04 PM
if a post takes over a certain length of time to parse
This was my silly attempt (I think) at trying to not starve the cache of memory.
Quote from Arantor on June 21st, 2011, 06:04 PM
This sounds like a good idea to me. I'm not sure offhand how we'd do it, but it seems more reasonable than just bulk throwing things at cache.
You cache an index (even better with redis, with native types), and then garbage collect it.  I posted about this over at yourasoft.org a while back.
Quote from Arantor on June 21st, 2011, 06:04 PM
I really need to get properly familiar with TOX-G's innards. Is there anything specific you're thinking about how it operates that you'd do differently?
parse_bbc puts the string back together IIRC, which is super slow.  TOX-G tokenizes (a long standing parsing tradition), and maintains the tokens, then builds from those tokens into a stream.

So either storing them in an array, and join afterward, or maybe the right way would may even be to write to a php://memory stream, and then convert that to a variable, or just use ob_start and just echo.  Both of these *should* use better buffers.

Basically, and this is without profiling mind you, but I'm fairly sure the thing that makes parse_bbc slow is the number of mallocs.  Any string concatenation operation is gonna be slow.

But these are just guesses, I wouldn't do anything without profiling it and being able to compare.
Quote from Arantor on June 21st, 2011, 06:04 PM
*nods* It does need to be done properly. I never seem to find the time to do this though :/
Well, cutting it up is the hard part.  And since it's a bit of a beast, the test cases need to be built out some.  I don't know if it still exists, but when I first wrote parse_bbc(), I started some in other/ somewhere for it.
Quote from Arantor on June 21st, 2011, 06:04 PM
So, then, if we were to htmlspecialchars it on output, presumably we wouldn't do it on saving the data in the first place? Or would we unsanitise it before resanitising it?
Yeah, wouldn't do it on saving.  This again gives more options: for example, what if I want the subject to just be text, and I want to communicate it via application/json to an API?  And then insert it directly into the DOM as a text node?  There might be no reason for it to every be html escaped at all.
Quote from Arantor on June 21st, 2011, 06:04 PM
I'd really rather not give them any more ways to make it easier to make them insecure.
Totally agree - true security is easy security.  I think it can be made relatively clean and easy, one way or another.
Quote from Arantor on June 21st, 2011, 06:04 PM
Right now the calendar is seemingly used by the relative minority. A small but vocal group use it for events, and a larger group turn it on because of birthdays.
I agree with your definition, except that even if a bunch of people were using the calendar, it doesn't mean it's the only or best option.  I think if we want event integration, maybe it makes more sense to use Google Calendar.

Even if both were official options, I see them as official mods.
Quote from Arantor on June 21st, 2011, 06:04 PM
I look at WordPress and I see they have it. Different horses for different courses but they have a calendar in the core, enabled by default. Because it suits them to do so.
A full one like SMF?  I had no idea.  And will probably never use it.

The reason for my hate toward the calendar is because I consider it to be its own package on its own right.  Google did it right; Google Calendar isn't just "part of Gmail", it's a separate app, and it has a lot to it.

-[Unknown]