Print Page - htmlspecialchars while inserting into DB

Public area => The Pub => Off-topic => Topic started by: Dragooon on June 20th, 2011, 09:59 PM

Title: htmlspecialchars while inserting into DB
Post by: Dragooon on June 20th, 2011, 09:59 PM

Does this make sense? I've been wondering, wouldn't mysql_real_escape be enough while appending to DB?

Title: Re: htmlspecialchars while inserting into DB
Post by: Nao on June 20th, 2011, 10:28 PM

There ate subtle differences. I think Pete knows best.

Title: Re: htmlspecialchars while inserting into DB
Post by: CJ Jackson on June 20th, 2011, 10:35 PM

I never used mysql_real_escape, I always used prepared statement because I find them a lot easier.

I always used htmlspecialchars for data that goes in a html attribute, I rarely use them for anything else.

Title: Re: htmlspecialchars while inserting into DB
Post by: Dragooon on June 20th, 2011, 10:39 PM

But do you htmlspecialchar them while inserting or while outputting? And does prepared statement automatically take care of escaping and security?

Title: Re: htmlspecialchars while inserting into DB
Post by: CJ Jackson on June 20th, 2011, 10:46 PM

Only while outputting to the browser! Yes it does take care of escaping and security, prevent SQL injection if written properly! Also the data that comes out of the database tends to be dirty, it like to keep the back slashes, so clean it with stripslashes() and always before htmlspecialchars().

Title: Re: htmlspecialchars while inserting into DB
Post by: Arantor on June 21st, 2011, 12:05 AM

It depends what you're trying to protect against.

mysql_real_escape_string going in will protect you against SQL injections, and in SMF/Wedge's case, the parameterisation is sufficient combined with the aforementioned to effectively neutralise this problem.

htmlspecialchars isn't going to protect you against SQL injections but it's there for XSS and anti-malicious scripting purposes. We cannot ever assume what the user is giving us is clean, so we expressly neuter it before letting it anywhere near the DB.

You'll note that in SMF/Wedge's case, anything destined to be used in the html bbcode is reformatted slightly to de-neuter it.

SMF (and Wedge)'s specific brand of content encoding going into the DB, where bbcode is concerned at least, is slightly odd, and one day I'll figure out exactly why it was done the way it was (remove newlines, expressly inject br tags into the stored content, after htmlspecialchars has been run)

@CJ Jackson: I'd rather not be trying to sanitise on output, I'd rather sanitise it when capturing it so that if something screwball tries to dump the contents of the DB, it's still going to be safe because there isn't anything dirty in the DB.

SMF/Wedge is pretty meticulous inside about stuff going into the DB being clean, shame not all mods are quite so clean about it. And if you're getting slashes from the DB, there's something more broken about content going in, in the first place.

Title: Re: htmlspecialchars while inserting into DB
Post by: Dragooon on June 21st, 2011, 12:37 AM

Thanks, I am using tox-g which explicitly uses htmlspecialchars in outputs, although I guess one can't be too secure.

Title: Re: htmlspecialchars while inserting into DB
Post by: Arantor on June 21st, 2011, 12:39 AM

There is a second reason why you might/would htmlspecialchars content on output: simply that you may not always be able to guarantee the sanity of runtime variables etc. in the current context.

The thing you might need to watch out for is double-calling, no need to call htmlspecialchars on outputs where it has already been run before...

Title: Re: htmlspecialchars while inserting into DB
Post by: CJ Jackson on June 21st, 2011, 01:27 AM

I think I might overlooked magic quote, so magic quote might of been causing double escape? :lol:

Posted: June 21st, 2011, 12:55 AM

Just to make you feel better Pete, I did not break the content, WordPress did, it got a magic quote feature of it's own. How nice! :lol:

Title: Re: htmlspecialchars while inserting into DB
Post by: Arantor on June 21st, 2011, 01:32 AM

That's something that bothers me about WP: some people that know what they're doing use PDO etc, others use WP's own functions, and somewhere in the middle, shit happens.

At least with SMF, mods are vetted and generally have had oddities weeded out - so SMF, and Wedge, sort out magic quote once and for all, right up front, and mods are then allowed to use that content - plus they're all supposed to use the proper methodology for inserting data into the DB and querying it using the abstracted methods.

Unless your name is <censored> and you can 'legitimately ignore the rules' because you're one of the people who enforces them, and it worked 'OK' for the more wild and untamed SMF 1.1 so just re-wrapping it in the 2.0 style call is acceptable >:(

Title: Re: htmlspecialchars while inserting into DB
Post by: [Unknown] on June 21st, 2011, 05:03 AM

Quote from Dragooon on June 20th, 2011, 09:59 PM

Does this make sense? I've been wondering, wouldn't mysql_real_escape be enough while appending to DB?

I think it doesn't make sense. I say this even though I think SMF does do it for posts at least, but there are many mistakes I either made or didn't fix when I wrote SMF.

From a security standpoint, the assumption should be that everything else has been compromised. From an optimization standpoint, the assumption should be everything else is perfect. Quality lives somewhere between the two.

Generally, I will cast to int things even from the database - because I don't know if my database query had a SQL injection, or maybe something went wrong in my insertion, or another software was compromised and gained access into my database. The less that an attacker can "gain" even after they successfully exploit a small hole, the better.

Plus, htmlspecialchars'ing before you insert in the database increases space requirements, and makes integration with other systems harder.

Quote from CJ Jackson on June 20th, 2011, 10:35 PM

I never used mysql_real_escape, I always used prepared statement because I find them a lot easier.

If you're using ISO-8859-1, ASCII, or UTF-8, then as long as you escape all the right characters, you are safe. If your database connection (not your output) is ever in a charset other than utf-8, then it matters a lot and you MUST use mysql_real_escape_string. Examples are Shift_JIS or Big-5.

Quote from CJ Jackson on June 20th, 2011, 10:35 PM

I always used htmlspecialchars for data that goes in a html attribute, I rarely use them for anything else.

Not just attributes, but also text.

Quote from CJ Jackson on June 20th, 2011, 10:46 PM

Also the data that comes out of the database tends to be dirty, it like to keep the back slashes, so clean it with stripslashes() and always before htmlspecialchars().

You shouldn't escape twice (or allow magic GPC to get in your way) when inserting. If you have to strip slashes on the way out, you're storing it with slashes in the database.

Quote from Arantor on June 21st, 2011, 12:05 AM

SMF (and Wedge)'s specific brand of content encoding going into the DB, where bbcode is concerned at least, is slightly odd, and one day I'll figure out exactly why it was done the way it was (remove newlines, expressly inject br tags into the stored content, after htmlspecialchars has been run)

This may have been done in the name of optimization, and some parts of it predated me and some parts of it were my fault. I saw bbc as mostly as scary a beast as anyone did, so I was happy to make sure it worked. Ultimately, I think the original text should be stored, but it becomes an upgrade problem (and who wants the upgrade on large forums to change all the text?)

Quote from Arantor on June 21st, 2011, 12:05 AM

@CJ Jackson: I'd rather not be trying to sanitise on output, I'd rather sanitise it when capturing it so that if something screwball tries to dump the contents of the DB, it's still going to be safe because there isn't anything dirty in the DB.

No reason not to sanitize both ends, IMHO.

Quote from Arantor on June 21st, 2011, 01:32 AM

At least with SMF, mods are vetted and generally have had oddities weeded out

I like the methodology I tried to push in SMF: if it's broken, fix it. Same with XMLHttpRequest and etc. I'll never understand why most devs don't go that way. My company, in a standard test, has three lines that make XMLHttpRequest work in all browsers (yes, this test is getting a bit old...) We watch for developers to (a) remove the line and use something else, e.g. jQuery, (b) use XMLHttpRequest directly, or (c) leave the line AND use their own XMLHttpRequest/ActiveXObject or whatever code (that is much longer and harder to read.)

Of people who take it, I think <10% pick a or b. 90% pick route c. It baffles me.

That said, when people do detection or don't check ini settings properly, "just fixing it" can make integration harder.

-[Unknown]

Title: Re: htmlspecialchars while inserting into DB
Post by: Arantor on June 21st, 2011, 09:35 AM

Quote

I say this even though I think SMF does do it for posts at least, but there are many mistakes I either made or didn't fix when I wrote SMF.

SMF and Wedge do it for almost everything that moves into the database, except in a few specific places where the ability to use raw HTML is desirable and only the admin can set that up (board descriptions, censored words - which is why in 2.0 RC4 it became admin only as opposed to being moderate_forum before that).

That is to say, usernames, topic subjects and the like are htmlspecialchars'd with ENT_QUOTES if I remember right, while messages are htmlspecialchars'd, and pushed through preparsecode which fixes up certain bbcode requirements (neuters nobbc contents, enforces the html bbcode is only accessible by admin, attempts to fix/validate certain types of bbcode interaction, e.g. table->tr->td).

I wonder if preparsecode's changing everything into br tags is a holdover from YaBB where everything was in flat files. (I don't know how it was stored, but it seems feasible to me)

Quote

Ultimately, I think the original text should be stored, but it becomes an upgrade problem (and who wants the upgrade on large forums to change all the text?)

That's one of the things we can do something about. We've expressly set ourselves on the path of having an importer rather than just 'upgrading' the existing tables - aside from the fact it lets people run them both side by side to experiment, it also means we can do manipulations along the way like fix any of these things we decide to resolve.

Quote

No reason not to sanitize both ends, IMHO.

Depends what you mean by sanitise. htmlspecialchars both ends strikes me as a bad idea, for example.

Quote

I like the methodology I tried to push in SMF: if it's broken, fix it

Well, the methodology I was referring to was that all mods using the database should use $smcFunc's functions, and should be using the proper parameterisation, or the proper insert method made available that deals with escaping etc for you.

And yes, if it's broken, it should be fixed. That's one of the things that rattled me so much about 2.0 final: an astounding number of bugs were just deferred. While I as much as anyone accept that software is released with bugs, there's a difference between releasing with known bugs and releasing with unknown bugs, as well as releasing with architectural bugs that will need a complete rewrite of components to fix, and it irked me immensely that so many known bugs that weren't major but weren't that hard to fix (in more than one case a fix was provided and shown to work), and were still deferred through some "well, it might cause regressions" mantra.

Quote

Of people who take it, I think <10% pick a or b. 90% pick route c. It baffles me.

We use jQuery in Wedge. Not just because it means we get to minimise the code to be sent to users (since the admin can pick a CDN copy of jQuery), but the time we spent writing JS is shorter, and I suspect less time is also spent in debugging. Plus a lot of users do add stuff that makes use of jQuery, so having it in the core means plugins won't try each adding it and falling over each other.

Title: Re: htmlspecialchars while inserting into DB
Post by: [Unknown] on June 21st, 2011, 10:21 AM

Quote from Arantor on June 21st, 2011, 09:35 AM

I wonder if preparsecode's changing everything into br tags is a holdover from YaBB where everything was in flat files. (I don't know how it was stored, but it seems feasible to me)

Now that you explain which ones, I'm pretty sure they all were holdovers. Definitely the br thing was. And I know the nobbc/html thing was just because that was the easiest way.

Quote from Arantor on June 21st, 2011, 09:35 AM

That's one of the things we can do something about. We've expressly set ourselves on the path of having an importer rather than just 'upgrading' the existing tables - aside from the fact it lets people run them both side by side to experiment, it also means we can do manipulations along the way like fix any of these things we decide to resolve.

While I agree, it also means large forums (where's Douglas at anyway?) are a big problem.

Quote from Arantor on June 21st, 2011, 09:35 AM

Depends what you mean by sanitise. htmlspecialchars both ends strikes me as a bad idea, for example.

Sure. But I explicitly don't agree with the "garbage in, garbage out" philosophy when it comes to security. An example: just because you only put valid filenames into the attachments table doesn't mean you shouldn't validate that assumption on the way out too. This also improves upgrade scenarios anyway (if done right.)

Quote from Arantor on June 21st, 2011, 09:35 AM

Well, the methodology I was referring to was that all mods using the database should use $smcFunc's functions, and should be using the proper parameterisation, or the proper insert method made available that deals with escaping etc for you.

I was referring to where you talked about magic quotes. Maybe it was the wrong way to go at the time, but the codebase expected them on as well as register_globals, so I decided the most secure route was to move forward with magic quotes on and kill register_globals (because with either of those reversed, could've had security holes if I made mistakes, rather than just bugs. I prefer bugs.)

Quote from Arantor on June 21st, 2011, 09:35 AM

were still deferred through some "well, it might cause regressions" mantra.

Well, at work, this is a pretty common mantra. I'm even holding back changes on a project right now for this reason.

At the same time, I agree. Especially with the release of a new major version number, you gotta get it right or you'll pay for those mistakes for a while. I know Chrome and Firefox are going the way of loosey-goosey, but I'm not convinced that will work for either of them long term, especially Firefox.

Back whenever ago, I posted the same basic thing on simplemachines.org; if you need more RCs, release more RCs. There's no shame in releasing RCs. I'm still considering TOX-G alpha, after all (and probably will until I write better docs, some more useful samples, and add a couple nagging features that I want.) It's being used in production by a select few, and I'm ready to help those implementations upgrade if necessary if I have to make breaking changes.

Sad to hear that SMF didn't do the same.

Quote from Arantor on June 21st, 2011, 09:35 AM

We use jQuery in Wedge. Not just because it means we get to minimise the code to be sent to users (since the admin can pick a CDN copy of jQuery), but the time we spent writing JS is shorter, and I suspect less time is also spent in debugging. Plus a lot of users do add stuff that makes use of jQuery, so having it in the core means plugins won't try each adding it and falling over each other.

Well, jQuery is fine. Most people who take the test, even if they add jQuery, leave that line in and don't touch it; they clearly don't understand it. That's what's a bad sign.

[rant]

I don't hate jQuery, but I don't really like it either. It's not a bad choice and very sane for most web apps, though. My problem with it is that it's got a lot of features, most of which I don't need, and doesn't provide most of the features I do. It has animations, but doesn't do colors; it deals with nodelists spectacularly, but makes me learn a second DOM syntax to do so; it has whiz-bang solutions like $.each, but it doesn't have convenience funcs to build this-bound delegates; it makes it super easy to build html, but in a way that encourages XSS-vulnerable code; and it supports xmlns selectors but doesn't make them compat in IE browsers. Off the top of my head.

[/rant]

But yeah, it's popular and people are learning it, so it makes sense.

-[Unknown]

Title: Re: htmlspecialchars while inserting into DB
Post by: Arantor on June 21st, 2011, 11:59 AM

Quote

Definitely the br thing was. And I know the nobbc/html thing was just because that was the easiest way.

I'm not knocking the nobbc/html thing in any way, frankly I think it's the way to go.

Quote

While I agree, it also means large forums (where's Douglas at anyway?) are a big problem.

Large forums are always going to be a problem when you're moving to anything else, if it does anything differently. I'd rather handle it at conversion time and end up with something cleaner - start as you mean to go on.

Douglas is still running his forums but I think he's moved on from his old job at SolarVPS these days, not sure what he's up to now though.

Quote

Sure. But I explicitly don't agree with the "garbage in, garbage out" philosophy when it comes to security. An example: just because you only put valid filenames into the attachments table doesn't mean you shouldn't validate that assumption on the way out too. This also improves upgrade scenarios anyway (if done right.)

That's why I said it depends what you're talking about. As a minimum, what goes into the DB should be sane/safe. The real question about coming back out is how much effort it is to validate that assumption, which is where the performance/security balance comes back in.

The big example is parse_bbc. It is the single slowest component in SMF and Wedge, and it is a scary monster because it's expressly checking that what's coming out is safe, with certain protections carried out separately in preparsecode (like sanitising nobbc and html). I think it's important that some element of that is continued to be done, but how far do you go with it? At what point is it acceptable to trade off performance for security?

On the flip side, there are things like topic subjects, which aren't resanitised on display, what's in the DB is assumed to be safe.

Quote

Well, at work, this is a pretty common mantra. I'm even holding back changes on a project right now for this reason.

It's a common mantra, but in the cases I'm thinking of, it's misapplied.

One that really annoyed me is the reattribute posts function. It reattributes old posts to a new user, e.g. new account after deletion, but it doesn't do topic ownership at the same time. Someone demonstrated the query, it works and I don't see anything about it that would cause regression. But it was deferred.

At the same time, the dreaded "IE 8 jumping text bug", which they fixed late in the day, and that DID cause a regression which they sort of fixed in the templates without fixing the root cause - the editor component now can't be given a width from the create function, so any mod that invokes the editor and asks for the default of 70% width doesn't get it, it just fills the maximum available width, and it's "acceptable" to rewrite it in the template to set the width there, rather than setting it when it's invoked. Which means, I had to go fix all the instances of this in SimpleDesk, because they fixed one bug and made another, which to me is a regression.

Quote

Back whenever ago, I posted the same basic thing on simplemachines.org; if you need more RCs, release more RCs. There's no shame in releasing RCs

To be fair to them, a lot of people were very fed up of a new RC every six months or so that fixed some bugs and not a lot else. The problem is that all the signs hint at 2.0.1 not being a bugfix release but only in the event of security fixes, so some of the deferred bugs will only be solved in 2.1 or higher, which is a butt-ache if they've been on the tracker for a couple of years, especially if there are posted solutions.

You know something's wrong when they're accepting mods on the mod site that are bug-fix mods. I only wish I were kidding.

Quote

Well, jQuery is fine. Most people who take the test, even if they add jQuery, leave that line in and don't touch it; they clearly don't understand it. That's what's a bad sign.

On the other hand, if it's there, it's there for a reason, and usually that reason is because other stuff depends on it working exactly that way. If I were to add jQuery to something like that, I'd be flagging up that this other item was to be removed and all instances that use it should proceed to use jQuery.

I'm willing to bet that not everyone who left it alone did so because they didn't understand it - but I won't argue that it's certainly the more likely reason.

Quote

My problem with it is that it's got a lot of features, most of which I don't need, and doesn't provide most of the features I do.

You can say that about a lot of things, though. It's almost the same argument against something like Microsoft Word: just because you might only use 2% of its features, that sounds like an argument for stripping out the rest - until you realise that everyone else uses a different 2% of its features.

It's a convenient go-to place, but like everything else it has pros and cons that need to be weighed up for the project at hand. For us, it's convenient: it provides a consistent XHR object with some nice conveniences, animation which can be nice for UX if done properly, and it makes DOM manipulation easy for the sorts of thing we're doing.

Plus it's available via CDN, so it's not like users probably won't have it cached.

And, it's popular. But yes, it can encourage bad behaviours - but so too does PHP ;)

Title: Re: htmlspecialchars while inserting into DB
Post by: [Unknown] on June 21st, 2011, 05:00 PM

Quote from Arantor on June 21st, 2011, 11:59 AM

I think it's important that some element of that is continued to be done, but how far do you go with it? At what point is it acceptable to trade off performance for security?

This wasn't done for security reasons (although it's not necessarily a bad thing security wise.) It was done to make upgrading not require updating past posts, and because user settings and admin settings can affect the bbc and we didn't want to deal with "recalculating." And lastly because you need to be able to edit your posts, and we knew there would be bugs if we tried to "reverse the translation" - but were worried about the additional storage requirements of storing two copies of every single post.

Ultimately, for that, I think several things:

1. String parsing in PHP is slow.
Actually, it's slow in a lot of languages (but especially PHP), because they use a NUL at the end of the string, and parsing is slicing heavy. I think for larger forums, the only good solution is to write (parts of?) the parse_bbc algorithm in a tighter language, such as D(http://www.digitalmars.com/d/). This could be done with basic IPC kinda like memcached works.

2. Caching posts needs a bit of a better solution
A lot of people want to just "cache everything." Throw physical boxes at the problem, and use all of them as ram sticks, and just cache, cache, cache, cache. This is expensive, and becomes hard to manage, expunge, etc.
What I've done is an "indexed cache", which means I mark entries in the cache that are popular, and over time, cache those popular items. Then when they get unpopular, they are garbage collected. We use this at work, and it requires very little memory, but significantly benefits viral traffic (which I'm pretty sure is the same sort of traffic forums get.)

3. The parse_bbc() routine is far from perfect
The way TOX-G parses is much better. The BBC parsing routine was written in a similar way to how one would write it in a language like C. I naively hoped PHP would turn this into optimal code (and it was the easiest way to write anyway), but it doesn't.

4. Someone needs to profile parse_bbc()
I think Groundup was trying to do this, but I don't know how much he did with it. It needs to be broken up into smaller functions - ideally like ~5 of them or more - so that the performance problem can be identified. I have some ideas where performance may be bad, but guessing isn't the best road to improvement.

Quote from Arantor on June 21st, 2011, 11:59 AM

On the flip side, there are things like topic subjects, which aren't resanitised on display, what's in the DB is assumed to be safe.

Right. I think htmlspecialchars is relatively cheap, so I'd either do it on display, or else communicate the data in a text format (such as json, although that has its own escaping) and set it as text via the DOM.

Quote from Arantor on June 21st, 2011, 11:59 AM

It reattributes old posts to a new user, e.g. new account after deletion, but it doesn't do topic ownership at the same time.

Hmm. I feel like we had something in other/tools/ that did this and did do topic reattribution, which Compuart I think wrote. Weird.

Quote from Arantor on June 21st, 2011, 11:59 AM

You know something's wrong when they're accepting mods on the mod site that are bug-fix mods. I only wish I were kidding.

Ah, this reminds me of YaBB SE 1.5.x. That's when I "joined the scene." I created a package server (reverse engineered how) and centralized bug fixes in an unofficial package that I updated every day or two.

Quote from Arantor on June 21st, 2011, 11:59 AM

You can say that about a lot of things, though. It's almost the same argument against something like Microsoft Word: just because you might only use 2% of its features, that sounds like an argument for stripping out the rest - until you realise that everyone else uses a different 2% of its features.

Well, Word is a boat. I don't know about you, but I don't want to write software like that. I'm the guy who always wanted to remove the calendar from the standard distribution of SMF. It's higher attack surface, more checks for perf impact, more to download, etc.

-[Unknown]

Title: Re: htmlspecialchars while inserting into DB
Post by: Arantor on June 21st, 2011, 06:04 PM

Quote

This wasn't done for security reasons (although it's not necessarily a bad thing security wise.)

Oh? Well, I never went as far back as YaBB or YaBBSE, so I'm looking at it all from SMF 1.1 and later's perspective, and what I've seen said. My understanding was that SMF 1.0 did bbc parsing through regexp, on display, but that a vicious ReDoS could take it out because it wasn't protected against that, hence it was doing through string parsing and done on a regurgitation basis, and that security was one of the major factors.

But yes, your other comment about not changing it on upgrading is a good one. In our case, we explicitly allowed for that sort of thing because we were looking at fixing otherwise broken links when coming from non SMF sources.

Quote

And lastly because you need to be able to edit your posts, and we knew there would be bugs if we tried to "reverse the translation" - but were worried about the additional storage requirements of storing two copies of every single post.

Funny you should mention that, that's exactly what the WYSIWYG editor tries to do, badly. So much so that it doesn't bother converting anything other than simpler HTML.

Quote

1. String parsing in PHP is slow.

No argument. Interesting comment there, there were plans to rewrite the bbc parser in C under the banner of smflib. (Not sure if that was after your time or not) but it never really went anywhere. It's still in SMF's SVN, untouched in pretty much forever.

I think whatever happens, we can't really afford to leave PHP, it's not like we can just conjure up a parser in C, and even if we could and did, I'd honestly not want the hassle of support for something like that.

Quote

2. Caching posts needs a bit of a better solution

Correction: it needs *some* solution.

parse_bbc does caching, but only under certain circumstances: if a post takes over a certain length of time to parse, and the parse_tags parameter wasn't given. parse_tags is only applicable when you invoke the parser with a subset of tags to accept, e.g. for signatures that have a different set of rules to normal.

In real practice, only truly huge posts ever hit this limit anyway.

Quote

What I've done is an "indexed cache", which means I mark entries in the cache that are popular, and over time, cache those popular items.

This sounds like a good idea to me. I'm not sure offhand how we'd do it, but it seems more reasonable than just bulk throwing things at cache.

Quote

3. The parse_bbc() routine is far from perfect

I really need to get properly familiar with TOX-G's innards. Is there anything specific you're thinking about how it operates that you'd do differently?

Quote

4. Someone needs to profile parse_bbc()

*nods* It does need to be done properly. I never seem to find the time to do this though :/

Quote

Right. I think htmlspecialchars is relatively cheap, so I'd either do it on display, or else communicate the data in a text format (such as json, although that has its own escaping) and set it as text via the DOM.

So, then, if we were to htmlspecialchars it on output, presumably we wouldn't do it on saving the data in the first place? Or would we unsanitise it before resanitising it?

My one fear with doing that is that it opens a door I don't really want opened. SMF's ecosystem is populated with those who write SSI functions and jiggle around with the innards, and I'd really rather not give them any more ways to make it easier to make them insecure.

Quote

Hmm. I feel like we had something in other/tools/ that did this and did do topic reattribution, which Compuart I think wrote. Weird.

Well, in 2.0, it's in the admin panel. Except it's broken.

Quote

centralized bug fixes in an unofficial package that I updated every day or two.

See, that's neat. And really, I'm not against the idea. I just wish they'd write rules and stick to them; it always used to be the case when I was on the Cust. team that bugfixes weren't allowed. Then they allow these bugfixes, and they even package up the bugfix for the IE 8 "jumping text bug" for 1.1.x as a mod rather than roll it into the update >_<

Quote

I'm the guy who always wanted to remove the calendar from the standard distribution of SMF.

Things like this I'm in one of two minds about - and it's been discussed here too. My take on it is straightforward enough: everything that's in the core of Wedge should be something that a "reasonable percentage of users" are going to use.

Right now the calendar is seemingly used by the relative minority. A small but vocal group use it for events, and a larger group turn it on because of birthdays.

So, my take on it, specifically, is that birthdays should be separated from the calendar, and then the calendar should either be moved out of the core, or made more useful, so as that more people would find it useful. For example, we've talked about accessing "Today's posts" from the calendar.

While I don't like to point at other systems and go "they do it" (it sort of goes against my "innovate, not imitate" philosophy), I look at WordPress and I see they have it. Different horses for different courses but they have a calendar in the core, enabled by default. Because it suits them to do so.

Given that we have blog support - because it's something we both want and I saw more and more requests for it at SMF (and the fact that if you have Wedge, you shouldn't really need to bridge to WP for a simplistic blog, have one system and be done with it), the calendar takes on a new light - and one that helps justify keeping it.

Title: Re: htmlspecialchars while inserting into DB
Post by: [Unknown] on June 21st, 2011, 07:22 PM

Quote from Arantor on June 21st, 2011, 06:04 PM

Oh? Well, I never went as far back as YaBB or YaBBSE, so I'm looking at it all from SMF 1.1 and later's perspective, and what I've seen said. My understanding was that SMF 1.0 did bbc parsing through regexp, on display, but that a vicious ReDoS could take it out because it wasn't protected against that, hence it was doing through string parsing and done on a regurgitation basis, and that security was one of the major factors.

We didn't even know about ReDoS back then. No, it was because people found workarounds - we kept blacklisting more characters - slash, NULs, etc. but Internet Explorer just had more "lax HTML parsing" than we could deal with. It was a constant source of security holes.

It also had tons of bugs. If you put tags in the wrong order, all sorts of nasty things could happen, and you couldn't quote my name (I see people have seen fit to bring that bug back, though.) Autolinking was a constant source of issues too, and you'd often get HTML in your post because of it conflicting with the bbc regexps.

And I'm sure it was vulnerable to ReDoS (and would've had to be to workaround IE's horribleness.)

Given my experience with that, I'll never ever write a bbc parsing system that uses regexps. If I did, I would be almost certain it was insecure no matter how much I tested it.

Quote from Arantor on June 21st, 2011, 06:04 PM

Funny you should mention that, that's exactly what the WYSIWYG editor tries to do, badly. So much so that it doesn't bother converting anything other than simpler HTML.

This is why I was against a WYSIWYG. I didn't think the two could live side by side properly.

Quote from Arantor on June 21st, 2011, 06:04 PM

No argument. Interesting comment there, there were plans to rewrite the bbc parser in C under the banner of smflib. (Not sure if that was after your time or not) but it never really went anywhere. It's still in SMF's SVN, untouched in pretty much forever.

Oh yeah, me and Compuart worked on smflib together (mostly him, though.) What made me (and I think him) sad was that it wasn't really faster. But he had scripts that automatically converted PHP code into zval using C. We were thinking to make it similar to what HipHop is today, I think. Maybe David stole our idea (just kidding.)

Quote from Arantor on June 21st, 2011, 06:04 PM

I think whatever happens, we can't really afford to leave PHP, it's not like we can just conjure up a parser in C, and even if we could and did, I'd honestly not want the hassle of support for something like that.

If it was built with a discreet, simple protocol, and just did the nuts and bolts of the parsing (the slow part), and was optional like Sphinx, it could make sense for large forums. But, sure.

Quote from Arantor on June 21st, 2011, 06:04 PM

if a post takes over a certain length of time to parse

This was my silly attempt (I think) at trying to not starve the cache of memory.

Quote from Arantor on June 21st, 2011, 06:04 PM

This sounds like a good idea to me. I'm not sure offhand how we'd do it, but it seems more reasonable than just bulk throwing things at cache.

You cache an index (even better with redis, with native types), and then garbage collect it. I posted about this over at yourasoft.org a while back.

Quote from Arantor on June 21st, 2011, 06:04 PM

I really need to get properly familiar with TOX-G's innards. Is there anything specific you're thinking about how it operates that you'd do differently?

parse_bbc puts the string back together IIRC, which is super slow. TOX-G tokenizes (a long standing parsing tradition), and maintains the tokens, then builds from those tokens into a stream.

So either storing them in an array, and join afterward, or maybe the right way would may even be to write to a php://memory stream, and then convert that to a variable, or just use ob_start and just echo. Both of these *should* use better buffers.

Basically, and this is without profiling mind you, but I'm fairly sure the thing that makes parse_bbc slow is the number of mallocs. Any string concatenation operation is gonna be slow.

But these are just guesses, I wouldn't do anything without profiling it and being able to compare.

Quote from Arantor on June 21st, 2011, 06:04 PM

*nods* It does need to be done properly. I never seem to find the time to do this though :/

Well, cutting it up is the hard part. And since it's a bit of a beast, the test cases need to be built out some. I don't know if it still exists, but when I first wrote parse_bbc(), I started some in other/ somewhere for it.

Quote from Arantor on June 21st, 2011, 06:04 PM

So, then, if we were to htmlspecialchars it on output, presumably we wouldn't do it on saving the data in the first place? Or would we unsanitise it before resanitising it?

Yeah, wouldn't do it on saving. This again gives more options: for example, what if I want the subject to just be text, and I want to communicate it via application/json to an API? And then insert it directly into the DOM as a text node? There might be no reason for it to every be html escaped at all.

Quote from Arantor on June 21st, 2011, 06:04 PM

I'd really rather not give them any more ways to make it easier to make them insecure.

Totally agree - true security is easy security. I think it can be made relatively clean and easy, one way or another.

Quote from Arantor on June 21st, 2011, 06:04 PM

Right now the calendar is seemingly used by the relative minority. A small but vocal group use it for events, and a larger group turn it on because of birthdays.

I agree with your definition, except that even if a bunch of people were using the calendar, it doesn't mean it's the only or best option. I think if we want event integration, maybe it makes more sense to use Google Calendar.

Even if both were official options, I see them as official mods.

Quote from Arantor on June 21st, 2011, 06:04 PM

I look at WordPress and I see they have it. Different horses for different courses but they have a calendar in the core, enabled by default. Because it suits them to do so.

A full one like SMF? I had no idea. And will probably never use it.

The reason for my hate toward the calendar is because I consider it to be its own package on its own right. Google did it right; Google Calendar isn't just "part of Gmail", it's a separate app, and it has a lot to it.

-[Unknown]