htmlspecialchars while inserting into DB

Dragooon

  • I can code! Really!
  • polygon.com has to be one of the best sites I've seen recently.
  • Posts: 1,841
The way it's meant to be

Nao

  • Dadman with a boy
  • Posts: 16,082

CJ Jackson

  • I got myself a new iPad, a different world to the iPhone!
  • Posts: 241
Re: htmlspecialchars while inserting into DB
« Reply #2, on June 20th, 2011, 10:35 PM »
I never used mysql_real_escape, I always used prepared statement because I find them a lot easier.

I always used htmlspecialchars for data that goes in a html attribute, I rarely use them for anything else.

Dragooon

  • I can code! Really!
  • polygon.com has to be one of the best sites I've seen recently.
  • Posts: 1,841
Re: htmlspecialchars while inserting into DB
« Reply #3, on June 20th, 2011, 10:39 PM »
But do you htmlspecialchar them while inserting or while outputting? And does prepared statement automatically take care of escaping and security?

CJ Jackson

  • I got myself a new iPad, a different world to the iPhone!
  • Posts: 241
Re: htmlspecialchars while inserting into DB
« Reply #4, on June 20th, 2011, 10:46 PM »
Only while outputting to the browser!  Yes it does take care of escaping and security, prevent SQL injection if written properly!  Also the data that comes out of the database tends to be dirty, it like to keep the back slashes, so clean it with stripslashes() and always before htmlspecialchars().

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: htmlspecialchars while inserting into DB
« Reply #5, on June 21st, 2011, 12:05 AM »
It depends what you're trying to protect against.

mysql_real_escape_string going in will protect you against SQL injections, and in SMF/Wedge's case, the parameterisation is sufficient combined with the aforementioned to effectively neutralise this problem.

htmlspecialchars isn't going to protect you against SQL injections but it's there for XSS and anti-malicious scripting purposes. We cannot ever assume what the user is giving us is clean, so we expressly neuter it before letting it anywhere near the DB.

You'll note that in SMF/Wedge's case, anything destined to be used in the html bbcode is reformatted slightly to de-neuter it.

SMF (and Wedge)'s specific brand of content encoding going into the DB, where bbcode is concerned at least, is slightly odd, and one day I'll figure out exactly why it was done the way it was (remove newlines, expressly inject br tags into the stored content, after htmlspecialchars has been run)

@CJ Jackson: I'd rather not be trying to sanitise on output, I'd rather sanitise it when capturing it so that if something screwball tries to dump the contents of the DB, it's still going to be safe because there isn't anything dirty in the DB.

SMF/Wedge is pretty meticulous inside about stuff going into the DB being clean, shame not all mods are quite so clean about it. And if you're getting slashes from the DB, there's something more broken about content going in, in the first place.
When we unite against a common enemy that attacks our ethos, it nurtures group solidarity. Trolls are sensational, yes, but we keep everyone honest. | Game Memorial

Dragooon

  • I can code! Really!
  • polygon.com has to be one of the best sites I've seen recently.
  • Posts: 1,841

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: htmlspecialchars while inserting into DB
« Reply #7, on June 21st, 2011, 12:39 AM »
There is a second reason why you might/would htmlspecialchars content on output: simply that you may not always be able to guarantee the sanity of runtime variables etc. in the current context.

The thing you might need to watch out for is double-calling, no need to call htmlspecialchars on outputs where it has already been run before...

CJ Jackson

  • I got myself a new iPad, a different world to the iPhone!
  • Posts: 241
Re: htmlspecialchars while inserting into DB
« Reply #8, on June 21st, 2011, 01:27 AM »
I think I might overlooked magic quote, so magic quote might of been causing double escape? :lol:
Posted: June 21st, 2011, 12:55 AM

Just to make you feel better Pete, I did not break the content, WordPress did, it got a magic quote feature of it's own. How nice!  :lol:

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: htmlspecialchars while inserting into DB
« Reply #9, on June 21st, 2011, 01:32 AM »
That's something that bothers me about WP: some people that know what they're doing use PDO etc, others use WP's own functions, and somewhere in the middle, shit happens.

At least with SMF, mods are vetted and generally have had oddities weeded out - so SMF, and Wedge, sort out magic quote once and for all, right up front, and mods are then allowed to use that content - plus they're all supposed to use the proper methodology for inserting data into the DB and querying it using the abstracted methods.

Unless your name is <censored> and you can 'legitimately ignore the rules' because you're one of the people who enforces them, and it worked 'OK' for the more wild and untamed SMF 1.1 so just re-wrapping it in the 2.0 style call is acceptable >:(

[Unknown]

  • Posts: 77
Re: htmlspecialchars while inserting into DB
« Reply #10, on June 21st, 2011, 05:03 AM »
Quote from Dragooon on June 20th, 2011, 09:59 PM
Does this make sense? I've been wondering, wouldn't mysql_real_escape be enough while appending to DB?
I think it doesn't make sense.  I say this even though I think SMF does do it for posts at least, but there are many mistakes I either made or didn't fix when I wrote SMF.

From a security standpoint, the assumption should be that everything else has been compromised.  From an optimization standpoint, the assumption should be everything else is perfect.  Quality lives somewhere between the two.

Generally, I will cast to int things even from the database - because I don't know if my database query had a SQL injection, or maybe something went wrong in my insertion, or another software was compromised and gained access into my database.  The less that an attacker can "gain" even after they successfully exploit a small hole, the better.

Plus, htmlspecialchars'ing before you insert in the database increases space requirements, and makes integration with other systems harder.
Quote from CJ Jackson on June 20th, 2011, 10:35 PM
I never used mysql_real_escape, I always used prepared statement because I find them a lot easier.
If you're using ISO-8859-1, ASCII, or UTF-8, then as long as you escape all the right characters, you are safe.  If your database connection (not your output) is ever in a charset other than utf-8, then it matters a lot and you MUST use mysql_real_escape_string.  Examples are Shift_JIS or Big-5.
Quote from CJ Jackson on June 20th, 2011, 10:35 PM
I always used htmlspecialchars for data that goes in a html attribute, I rarely use them for anything else.
Not just attributes, but also text.
Quote from CJ Jackson on June 20th, 2011, 10:46 PM
Also the data that comes out of the database tends to be dirty, it like to keep the back slashes, so clean it with stripslashes() and always before htmlspecialchars().
You shouldn't escape twice (or allow magic GPC to get in your way) when inserting.  If you have to strip slashes on the way out, you're storing it with slashes in the database.
Quote from Arantor on June 21st, 2011, 12:05 AM
SMF (and Wedge)'s specific brand of content encoding going into the DB, where bbcode is concerned at least, is slightly odd, and one day I'll figure out exactly why it was done the way it was (remove newlines, expressly inject br tags into the stored content, after htmlspecialchars has been run)
This may have been done in the name of optimization, and some parts of it predated me and some parts of it were my fault.  I saw bbc as mostly as scary a beast as anyone did, so I was happy to make sure it worked.  Ultimately, I think the original text should be stored, but it becomes an upgrade problem (and who wants the upgrade on large forums to change all the text?)
Quote from Arantor on June 21st, 2011, 12:05 AM
@CJ Jackson: I'd rather not be trying to sanitise on output, I'd rather sanitise it when capturing it so that if something screwball tries to dump the contents of the DB, it's still going to be safe because there isn't anything dirty in the DB.
No reason not to sanitize both ends, IMHO.
Quote from Arantor on June 21st, 2011, 01:32 AM
At least with SMF, mods are vetted and generally have had oddities weeded out
I like the methodology I tried to push in SMF: if it's broken, fix it.  Same with XMLHttpRequest and etc.  I'll never understand why most devs don't go that way.  My company, in a standard test, has three lines that make XMLHttpRequest work in all browsers (yes, this test is getting a bit old...)  We watch for developers to (a) remove the line and use something else, e.g. jQuery, (b) use XMLHttpRequest directly, or (c) leave the line AND use their own XMLHttpRequest/ActiveXObject or whatever code (that is much longer and harder to read.)

Of people who take it, I think <10% pick a or b.  90% pick route c.  It baffles me.

That said, when people do detection or don't check ini settings properly, "just fixing it" can make integration harder.

-[Unknown]

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: htmlspecialchars while inserting into DB
« Reply #11, on June 21st, 2011, 09:35 AM »
Quote
I say this even though I think SMF does do it for posts at least, but there are many mistakes I either made or didn't fix when I wrote SMF.
SMF and Wedge do it for almost everything that moves into the database, except in a few specific places where the ability to use raw HTML is desirable and only the admin can set that up (board descriptions, censored words - which is why in 2.0 RC4 it became admin only as opposed to being moderate_forum before that).

That is to say, usernames, topic subjects and the like are htmlspecialchars'd with ENT_QUOTES if I remember right, while messages are htmlspecialchars'd, and pushed through preparsecode which fixes up certain bbcode requirements (neuters nobbc contents, enforces the html bbcode is only accessible by admin, attempts to fix/validate certain types of bbcode interaction, e.g. table->tr->td).

I wonder if preparsecode's changing everything into br tags is a holdover from YaBB where everything was in flat files. (I don't know how it was stored, but it seems feasible to me)
Quote
Ultimately, I think the original text should be stored, but it becomes an upgrade problem (and who wants the upgrade on large forums to change all the text?)
That's one of the things we can do something about. We've expressly set ourselves on the path of having an importer rather than just 'upgrading' the existing tables - aside from the fact it lets people run them both side by side to experiment, it also means we can do manipulations along the way like fix any of these things we decide to resolve.
Quote
No reason not to sanitize both ends, IMHO.
Depends what you mean by sanitise. htmlspecialchars both ends strikes me as a bad idea, for example.
Quote
I like the methodology I tried to push in SMF: if it's broken, fix it
Well, the methodology I was referring to was that all mods using the database should use $smcFunc's functions, and should be using the proper parameterisation, or the proper insert method made available that deals with escaping etc for you.

And yes, if it's broken, it should be fixed. That's one of the things that rattled me so much about 2.0 final: an astounding number of bugs were just deferred. While I as much as anyone accept that software is released with bugs, there's a difference between releasing with known bugs and releasing with unknown bugs, as well as releasing with architectural bugs that will need a complete rewrite of components to fix, and it irked me immensely that so many known bugs that weren't major but weren't that hard to fix (in more than one case a fix was provided and shown to work), and were still deferred through some "well, it might cause regressions" mantra.
Quote
Of people who take it, I think <10% pick a or b.  90% pick route c.  It baffles me.
We use jQuery in Wedge. Not just because it means we get to minimise the code to be sent to users (since the admin can pick a CDN copy of jQuery), but the time we spent writing JS is shorter, and I suspect less time is also spent in debugging. Plus a lot of users do add stuff that makes use of jQuery, so having it in the core means plugins won't try each adding it and falling over each other.

[Unknown]

  • Posts: 77
Re: htmlspecialchars while inserting into DB
« Reply #12, on June 21st, 2011, 10:21 AM »
Quote from Arantor on June 21st, 2011, 09:35 AM
I wonder if preparsecode's changing everything into br tags is a holdover from YaBB where everything was in flat files. (I don't know how it was stored, but it seems feasible to me)
Now that you explain which ones, I'm pretty sure they all were holdovers.  Definitely the br thing was.  And I know the nobbc/html thing was just because that was the easiest way.
Quote from Arantor on June 21st, 2011, 09:35 AM
That's one of the things we can do something about. We've expressly set ourselves on the path of having an importer rather than just 'upgrading' the existing tables - aside from the fact it lets people run them both side by side to experiment, it also means we can do manipulations along the way like fix any of these things we decide to resolve.
While I agree, it also means large forums (where's Douglas at anyway?) are a big problem.
Quote from Arantor on June 21st, 2011, 09:35 AM
Depends what you mean by sanitise. htmlspecialchars both ends strikes me as a bad idea, for example.
Sure.  But I explicitly don't agree with the "garbage in, garbage out" philosophy when it comes to security.  An example: just because you only put valid filenames into the attachments table doesn't mean you shouldn't validate that assumption on the way out too.  This also improves upgrade scenarios anyway (if done right.)
Quote from Arantor on June 21st, 2011, 09:35 AM
Well, the methodology I was referring to was that all mods using the database should use $smcFunc's functions, and should be using the proper parameterisation, or the proper insert method made available that deals with escaping etc for you.
I was referring to where you talked about magic quotes.  Maybe it was the wrong way to go at the time, but the codebase expected them on as well as register_globals, so I decided the most secure route was to move forward with magic quotes on and kill register_globals (because with either of those reversed, could've had security holes if I made mistakes, rather than just bugs.  I prefer bugs.)
Quote from Arantor on June 21st, 2011, 09:35 AM
were still deferred through some "well, it might cause regressions" mantra.
Well, at work, this is a pretty common mantra.  I'm even holding back changes on a project right now for this reason.

At the same time, I agree.  Especially with the release of a new major version number, you gotta get it right or you'll pay for those mistakes for a while.  I know Chrome and Firefox are going the way of loosey-goosey, but I'm not convinced that will work for either of them long term, especially Firefox.

Back whenever ago, I posted the same basic thing on simplemachines.org; if you need more RCs, release more RCs.  There's no shame in releasing RCs.  I'm still considering TOX-G alpha, after all (and probably will until I write better docs, some more useful samples, and add a couple nagging features that I want.)  It's being used in production by a select few, and I'm ready to help those implementations upgrade if necessary if I have to make breaking changes.

Sad to hear that SMF didn't do the same.
Quote from Arantor on June 21st, 2011, 09:35 AM
We use jQuery in Wedge. Not just because it means we get to minimise the code to be sent to users (since the admin can pick a CDN copy of jQuery), but the time we spent writing JS is shorter, and I suspect less time is also spent in debugging. Plus a lot of users do add stuff that makes use of jQuery, so having it in the core means plugins won't try each adding it and falling over each other.
Well, jQuery is fine.  Most people who take the test, even if they add jQuery, leave that line in and don't touch it; they clearly don't understand it.  That's what's a bad sign.

[rant]

I don't hate jQuery, but I don't really like it either.  It's not a bad choice and very sane for most web apps, though.  My problem with it is that it's got a lot of features, most of which I don't need, and doesn't provide most of the features I do.  It has animations, but doesn't do colors; it deals with nodelists spectacularly, but makes me learn a second DOM syntax to do so; it has whiz-bang solutions like $.each, but it doesn't have convenience funcs to build this-bound delegates; it makes it super easy to build html, but in a way that encourages XSS-vulnerable code; and it supports xmlns selectors but doesn't make them compat in IE browsers.  Off the top of my head.

[/rant]

But yeah, it's popular and people are learning it, so it makes sense.

-[Unknown]

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: htmlspecialchars while inserting into DB
« Reply #13, on June 21st, 2011, 11:59 AM »
Quote
Definitely the br thing was.  And I know the nobbc/html thing was just because that was the easiest way.
I'm not knocking the nobbc/html thing in any way, frankly I think it's the way to go.
Quote
While I agree, it also means large forums (where's Douglas at anyway?) are a big problem.
Large forums are always going to be a problem when you're moving to anything else, if it does anything differently. I'd rather handle it at conversion time and end up with something cleaner - start as you mean to go on.

Douglas is still running his forums but I think he's moved on from his old job at SolarVPS these days, not sure what he's up to now though.
Quote
Sure.  But I explicitly don't agree with the "garbage in, garbage out" philosophy when it comes to security.  An example: just because you only put valid filenames into the attachments table doesn't mean you shouldn't validate that assumption on the way out too.  This also improves upgrade scenarios anyway (if done right.)
That's why I said it depends what you're talking about. As a minimum, what goes into the DB should be sane/safe. The real question about coming back out is how much effort it is to validate that assumption, which is where the performance/security balance comes back in.

The big example is parse_bbc. It is the single slowest component in SMF and Wedge, and it is a scary monster because it's expressly checking that what's coming out is safe, with certain protections carried out separately in preparsecode (like sanitising nobbc and html). I think it's important that some element of that is continued to be done, but how far do you go with it? At what point is it acceptable to trade off performance for security?

On the flip side, there are things like topic subjects, which aren't resanitised on display, what's in the DB is assumed to be safe.
Quote
Well, at work, this is a pretty common mantra.  I'm even holding back changes on a project right now for this reason.
It's a common mantra, but in the cases I'm thinking of, it's misapplied.

One that really annoyed me is the reattribute posts function. It reattributes old posts to a new user, e.g. new account after deletion, but it doesn't do topic ownership at the same time. Someone demonstrated the query, it works and I don't see anything about it that would cause regression. But it was deferred.

At the same time, the dreaded "IE 8 jumping text bug", which they fixed late in the day, and that DID cause a regression which they sort of fixed in the templates without fixing the root cause - the editor component now can't be given a width from the create function, so any mod that invokes the editor and asks for the default of 70% width doesn't get it, it just fills the maximum available width, and it's "acceptable" to rewrite it in the template to set the width there, rather than setting it when it's invoked. Which means, I had to go fix all the instances of this in SimpleDesk, because they fixed one bug and made another, which to me is a regression.
Quote
Back whenever ago, I posted the same basic thing on simplemachines.org; if you need more RCs, release more RCs.  There's no shame in releasing RCs
To be fair to them, a lot of people were very fed up of a new RC every six months or so that fixed some bugs and not a lot else. The problem is that all the signs hint at 2.0.1 not being a bugfix release but only in the event of security fixes, so some of the deferred bugs will only be solved in 2.1 or higher, which is a butt-ache if they've been on the tracker for a couple of years, especially if there are posted solutions.

You know something's wrong when they're accepting mods on the mod site that are bug-fix mods. I only wish I were kidding.
Quote
Well, jQuery is fine.  Most people who take the test, even if they add jQuery, leave that line in and don't touch it; they clearly don't understand it.  That's what's a bad sign.
On the other hand, if it's there, it's there for a reason, and usually that reason is because other stuff depends on it working exactly that way. If I were to add jQuery to something like that, I'd be flagging up that this other item was to be removed and all instances that use it should proceed to use jQuery.

I'm willing to bet that not everyone who left it alone did so because they didn't understand it - but I won't argue that it's certainly the more likely reason.
Quote
My problem with it is that it's got a lot of features, most of which I don't need, and doesn't provide most of the features I do.
You can say that about a lot of things, though. It's almost the same argument against something like Microsoft Word: just because you might only use 2% of its features, that sounds like an argument for stripping out the rest - until you realise that everyone else uses a different 2% of its features.

It's a convenient go-to place, but like everything else it has pros and cons that need to be weighed up for the project at hand. For us, it's convenient: it provides a consistent XHR object with some nice conveniences, animation which can be nice for UX if done properly, and it makes DOM manipulation easy for the sorts of thing we're doing.

Plus it's available via CDN, so it's not like users probably won't have it cached.

And, it's popular. But yes, it can encourage bad behaviours - but so too does PHP ;)

[Unknown]

  • Posts: 77
Re: htmlspecialchars while inserting into DB
« Reply #14, on June 21st, 2011, 05:00 PM »
Quote from Arantor on June 21st, 2011, 11:59 AM
I think it's important that some element of that is continued to be done, but how far do you go with it? At what point is it acceptable to trade off performance for security?
This wasn't done for security reasons (although it's not necessarily a bad thing security wise.)  It was done to make upgrading not require updating past posts, and because user settings and admin settings can affect the bbc and we didn't want to deal with "recalculating."  And lastly because you need to be able to edit your posts, and we knew there would be bugs if we tried to "reverse the translation" - but were worried about the additional storage requirements of storing two copies of every single post.

Ultimately, for that, I think several things:

1. String parsing in PHP is slow.
Actually, it's slow in a lot of languages (but especially PHP), because they use a NUL at the end of the string, and parsing is slicing heavy.  I think for larger forums, the only good solution is to write (parts of?) the parse_bbc algorithm in a tighter language, such as D.  This could be done with basic IPC kinda like memcached works.

2. Caching posts needs a bit of a better solution
A lot of people want to just "cache everything."  Throw physical boxes at the problem, and use all of them as ram sticks, and just cache, cache, cache, cache.  This is expensive, and becomes hard to manage, expunge, etc.
What I've done is an "indexed cache", which means I mark entries in the cache that are popular, and over time, cache those popular items.  Then when they get unpopular, they are garbage collected.  We use this at work, and it requires very little memory, but significantly benefits viral traffic (which I'm pretty sure is the same sort of traffic forums get.)

3. The parse_bbc() routine is far from perfect
The way TOX-G parses is much better.  The BBC parsing routine was written in a similar way to how one would write it in a language like C.  I naively hoped PHP would turn this into optimal code (and it was the easiest way to write anyway), but it doesn't.

4. Someone needs to profile parse_bbc()
I think Groundup was trying to do this, but I don't know how much he did with it.  It needs to be broken up into smaller functions - ideally like ~5 of them or more - so that the performance problem can be identified.  I have some ideas where performance may be bad, but guessing isn't the best road to improvement.
Quote from Arantor on June 21st, 2011, 11:59 AM
On the flip side, there are things like topic subjects, which aren't resanitised on display, what's in the DB is assumed to be safe.
Right.  I think htmlspecialchars is relatively cheap, so I'd either do it on display, or else communicate the data in a text format (such as json, although that has its own escaping) and set it as text via the DOM.
Quote from Arantor on June 21st, 2011, 11:59 AM
It reattributes old posts to a new user, e.g. new account after deletion, but it doesn't do topic ownership at the same time.
Hmm.  I feel like we had something in other/tools/ that did this and did do topic reattribution, which Compuart I think wrote.  Weird.
Quote from Arantor on June 21st, 2011, 11:59 AM
You know something's wrong when they're accepting mods on the mod site that are bug-fix mods. I only wish I were kidding.
Ah, this reminds me of YaBB SE 1.5.x.  That's when I "joined the scene."  I created a package server (reverse engineered how) and centralized bug fixes in an unofficial package that I updated every day or two.
Quote from Arantor on June 21st, 2011, 11:59 AM
You can say that about a lot of things, though. It's almost the same argument against something like Microsoft Word: just because you might only use 2% of its features, that sounds like an argument for stripping out the rest - until you realise that everyone else uses a different 2% of its features.
Well, Word is a boat.  I don't know about you, but I don't want to write software like that.  I'm the guy who always wanted to remove the calendar from the standard distribution of SMF.  It's higher attack surface, more checks for perf impact, more to download, etc.

-[Unknown]