Buggy Feed links

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Buggy Feed links
« Reply #15, on April 14th, 2012, 10:41 PM »
But technically would it be possible to forget phpsessid in the first page view?
Maybe we could enforce loading a second page like login before logging in is accepted.

As for feeds, are we sure phpsessid is used in these? If yes we should certainly ensure it isn't included...

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Buggy Feed links
« Reply #16, on April 14th, 2012, 10:50 PM »
Quote
But technically would it be possible to forget phpsessid in the first page view?
PHPSESSID is invoked when there isn't an existing Wedge cookie (like the start of every browsing session from a search engine) and that first page view, it will always shove PHPSESSID links everywhere. Remove your cookie, refresh, boom you'll see them.
Quote
Maybe we could enforce loading a second page like login before logging in is accepted.
We actually do, as it happens. You cannot go into login2 if you don't have a valid session. But that's not really the problem.

The problem that we have to weigh up is accuracy of reporting vs. PHPSESSID in URLs. Specifically, it's simply about tracking how many 'probably unique' guests there are, given the requests being made, since the whole nature of cookies is a friggin' bolt on to the specification in the first place (as HTTP is specifically designed to be stateless)
Quote
As for feeds, are we sure phpsessid is used in these? If yes we should certainly ensure it isn't included...
Yes, I am sure. It's not possible to avoid them, because it still goes through ob_sessrewrite to handle pretty URLs, so the usual logic applies - namely that if no cookie was found, PHPSESSID is injected. And given the approach made by Google Reader etc., they wouldn't have a cookie, so they get PHPSESSIDs, which is what we're seeing here (and I separately validated that PHPSESSID was added)
When we unite against a common enemy that attacks our ethos, it nurtures group solidarity. Trolls are sensational, yes, but we keep everyone honest. | Game Memorial

nolsilang

  • Lurking <i class=
  • Posts: 106

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Buggy Feed links
« Reply #18, on April 15th, 2012, 05:32 PM »
Currently testing Wedge with cookies disabled... (Fun.)
Quote from Arantor on April 14th, 2012, 10:50 PM
PHPSESSID is invoked when there isn't an existing Wedge cookie (like the start of every browsing session from a search engine) and that first page view, it will always shove PHPSESSID links everywhere. Remove your cookie, refresh, boom you'll see them.
Yeah, I know that from countless forums visited where I got the phpsessid on first page and it was removed later. ;)

However, Wedge does it this way... First of all, it determines whether user is a robot. If it is, it will skip adding PHPSESSID to links. Which is the reason why they never show up in our RSS feeds in Google Reader. So the reason why it broke in GR is that the URLs returned were not prettified.

0x, your screenshots would be more helpful if they showed what URL these entries linked to ;) (So, basically... All of them have ?topic=... in the URL, right?)
Quote
Quote
Maybe we could enforce loading a second page like login before logging in is accepted.
We actually do, as it happens. You cannot go into login2 if you don't have a valid session. But that's not really the problem.
I was thinking, maybe we could simply put the phpsessid links on the login links... Instead of all links. Maybe that would help... i.e. if you're not going to log in, there's no reason you should have a constant session (of course arises the problem of increased opened sessions, but it can be solved another way I suppose... We could even track people by IP...), while if they want to login, then they'll want to have their session followed just to give Wedge time to process the user name & password... By that time, the browser correctly reports a cookie anyway, so we won't need to add the phpsessid links by then.

Of course, there's still the problem of browsers that disable cookies entirely... i.e. if you're logged in and have cookies disabled like me right now, you absolutely need a phpsessid link in the URL.

Now, about phpsessid...
- Looking for it directly (in the code) is bad manners. Pretty URLs did that. I'm going to strip that code out -- in fact, it should be looking for SID, which is "PHPSESSID=abcdef" altogether. This also means doing a str_replace instead of a preg_replace on all links, because we know the session value. (Same for $session_var... No need for a preg_replace. I simply used $context['session_query'] instead. I knew that one would prove useful one day ;))
- I'm not sure why, but SMF and Wedge both add "&" at the end of the SID URL... Instead of simply using ";". I'm not sure SMF/Wedge would work *at all* if the installed PHP didn't support ";" as a separator.
- Also, SMF/Wedge use SID != '' in one place (ob_sessrewrite), and defined('SID') in others. I don't see how SID could possibly be, err, set to an empty string if it was defined... Or maybe it's always defined, in which case I should probably test against that when I get to re-enable cookies... Do you remember anything about that anyway?
Quote
Quote
As for feeds, are we sure phpsessid is used in these? If yes we should certainly ensure it isn't included...
Yes, I am sure.
(Jean Dujardin voice) Not so fast, Mr. Bond!
Quote
It's not possible to avoid them, because it still goes through ob_sessrewrite to handle pretty URLs, so the usual logic applies - namely that if no cookie was found, PHPSESSID is injected. And given the approach made by Google Reader etc., they wouldn't have a cookie, so they get PHPSESSIDs, which is what we're seeing here (and I separately validated that PHPSESSID was added)
As pointed out above, SID isn't injected is $user_info['possibly_robot'] is true, which is always the case for Google Bot. (And probably Google Reader's bot as well.)
Posted: April 15th, 2012, 05:30 PM

Yup, looks like SID is always defined, and just empty when cookies are set...

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Buggy Feed links
« Reply #19, on April 15th, 2012, 05:58 PM »
Quote
However, Wedge does it this way... First of all, it determines whether user is a robot. If it is, it will skip adding PHPSESSID to links
Except that not all feed readers are detected - and IIRC, Google Reader does not trip that test.
Quote
So the reason why it broke in GR is that the URLs returned were not prettified.
So not being prettified will make every topic show up as a new topic (because of a unique URL) every time it looks? Even here, where URLs were being prettified, PHPSESSID was being injected - I specifically checked this when it was reported.
Quote
Of course, there's still the problem of browsers that disable cookies entirely... i.e. if you're logged in and have cookies disabled like me right now, you absolutely need a phpsessid link in the URL.
There is a (valid) argument about rejecting those cases and disallowing it entirely for security reasons.
Quote
I'm not sure why, but SMF and Wedge both add "&" at the end of the SID URL... Instead of simply using ";". I'm not sure SMF/Wedge would work *at all* if the installed PHP didn't support ";" as a separator.
Because & is the argument separator defined in PHP's configuration. Check the code in QueryString, if ; is arg-separator, it does one thing, but otherwise, it does something else to manually parse out the parameters. ; is just not the default. But we get cases, just for fun, of malformed entities being prepared occasionally too.
Quote
(Jean Dujardin voice) Not so fast, Mr. Bond!
Quote
As pointed out above, SID isn't injected is $user_info['possibly_robot'] is true, which is always the case for Google Bot. (And probably Google Reader's bot as well.)
Seriously, it IS injecting PHPSESSID. It does it on the RSS validator. I have no idea what user agent Google Reader uses but I'm willing to bet it doesn't trip possibly_robot. It's actually irrelevant in this case. It does not matter whether it trips possibly_robot or not, it should NEVER be issuing PHPSESSID in feeds, ever, because some other feed readers will choke on non unique URLs, I know Thunderbird used to have issues with it, for example (because that didn't bother with cookies at one time)

nolsilang

  • Lurking <i class=
  • Posts: 106

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Buggy Feed links
« Reply #21, on April 15th, 2012, 11:41 PM »
@0x> It SHOULD be working now... I made a test with Google Reader and it no longer shows PHPSESSID links :)
I took the lazy way out really... Added a context variable in Feed.php to say we don't want to insert the session ID. This is tested against in Subs-Template.php. The reason why I was lazy for it, is that (1) just testing for Feedfetcher-Google (or whatever it's called) isn't going to do any good for other feed reader bots, (2) there is VERY little reason to have a session ID in a feed URL. If you have cookies disabled, then your session won't be active forever. Your feed reader will soon end up trying to access an incorrect session ID anyway.
Quote from Arantor on April 15th, 2012, 05:58 PM
Quote
Of course, there's still the problem of browsers that disable cookies entirely... i.e. if you're logged in and have cookies disabled like me right now, you absolutely need a phpsessid link in the URL.
There is a (valid) argument about rejecting those cases and disallowing it entirely for security reasons.
So.. What do we do, eh!
Quote
Because & is the argument separator defined in PHP's configuration. Check the code in QueryString, if ; is arg-separator, it does one thing, but otherwise, it does something else to manually parse out the parameters. ; is just not the default. But we get cases, just for fun, of malformed entities being prepared occasionally too.
I'm aware of how SMF/Wedge deals with it. But I can't find of any (good) reason to disable the ability to specify a new argument separator. I for one never heard of such a case... And I don't even think it would make a forum work because we're assuming pretty much everywhere that ; is the separator, and using it in links... And I don't seem to remember Wedge parsing all links to replace ; to & manually ;) (Not only that, but it's silly to assume that adding & will work everywhere... If a link is inside CDATA tags, then you're screwed!)

I would be tempted to say that we should remove that compatibility code...
Quote
Seriously, it IS injecting PHPSESSID. It does it on the RSS validator. I have no idea what user agent Google Reader uses but I'm willing to bet it doesn't trip possibly_robot. It's actually irrelevant in this case. It does not matter whether it trips possibly_robot or not, it should NEVER be issuing PHPSESSID in feeds, ever, because some other feed readers will choke on non unique URLs, I know Thunderbird used to have issues with it, for example (because that didn't bother with cookies at one time)
And we should have dealt with that long ago. Seriously, I'm surprised we had the problem at all, because if you'll look at noisen.com, the phpsessid is never injected *at all* into feeds...
Re: Buggy Feed links
« Reply #22, on April 15th, 2012, 11:45 PM »
Hmm... Double-checked the Noisen codebase, and it doesn't have anything different in that respect... Which is probably why I was so positive Wedge didn't have the problem either.
Anyway, now that we're sure it's never gonna happen... I can safely move back to working on PURLs ;)

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Buggy Feed links
« Reply #23, on April 16th, 2012, 12:06 AM »
Quote
I would be tempted to say that we should remove that compatibility code...
Then it WILL break. The default in PHP itself is &, as per http://php.net/manual/en/ini.core.php#ini.arg-separator.input and the choice to use ; is for avoiding all kinds of nightmares. QueryString.php, line 81 onwards.

Code: [Select]
// Are we going to need to parse the ; out?
if (strpos(@ini_get('arg_separator.input'), ';') === false && !empty($_SERVER['QUERY_STRING']))
{
// Get rid of the old one! You don't know where it's been!
$_GET = array();

// Was this redirected? If so, get the REDIRECT_QUERY_STRING.
$_SERVER['QUERY_STRING'] = urldecode(substr($_SERVER['QUERY_STRING'], 0, 5) === 'url=/' ? $_SERVER['REDIRECT_QUERY_STRING'] : $_SERVER['QUERY_STRING']);

// Replace ';' with '&' and '&something&' with '&something=&'. (This is done for compatibility...)
parse_str(preg_replace('/&(\w+)(?=&|$)/', '&$1=', strtr($_SERVER['QUERY_STRING'], array(';?' => '&', ';' => '&', '%00' => '', "\0" => ''))), $_GET);

// Magic quotes still applies with parse_str - so clean it up.
if (function_exists('get_magic_quotes_gpc') && @get_magic_quotes_gpc() != 0 && empty($settings['integrate_magic_quotes']))
$_GET = $removeMagicQuoteFunction($_GET);
}
elseif (strpos(@ini_get('arg_separator.input'), ';') !== false)
{
if (function_exists('get_magic_quotes_gpc') && @get_magic_quotes_gpc() != 0 && empty($settings['integrate_magic_quotes']))
$_GET = $removeMagicQuoteFunction($_GET);

// Search engines will send action=profile%3Bu=1, which confuses PHP.
foreach ($_GET as $k => $v)
{
if (is_string($v) && strpos($k, ';') !== false)
{
$temp = explode(';', $v);
$_GET[$k] = $temp[0];

for ($i = 1, $n = count($temp); $i < $n; $i++)
{
@list ($key, $val) = @explode('=', $temp[$i], 2);
if (!isset($_GET[$key]))
$_GET[$key] = $val;
}
}

// This helps a lot with integration!
if (strpos($k, '?') === 0)
{
$_GET[substr($k, 1)] = $v;
unset($_GET[$k]);
}
}
}

So we have to leave that code in, unless you plan on fixing every URL to not use ; (and with all the other problems related to it). SMF and Wedge using ; is a definite oddity though it does solve so many problems.
Quote
And we should have dealt with that long ago. Seriously, I'm surprised we had the problem at all, because if you'll look at noisen.com, the phpsessid is never injected *at all* into feeds...
I can think of multiple reasons why it might be different between the two. Either way, the bottom line is that it wasn't and while I have patched around it, I'm not happy with it (which is why I haven't yet committed it)
Quote
Added a context variable in Feed.php to say we don't want to insert the session ID. This is tested against in Subs-Template.php. The reason why I was lazy for it, is that (1) just testing for Feedfetcher-Google (or whatever it's called) isn't going to do any good for other feed reader bots, (2) there is VERY little reason to have a session ID in a feed URL
Oh, you've added it, then much as I had. And yes, exactly, there is no reason to have it in the URL, and testing for bots isn't enough because of the sheer variety of feed consumers.

Note that SID containing a & is injected by SMF and Wedge, and I have no idea why ; wasn't used there.
Quote
If you have cookies disabled, then your session won't be active forever. Your feed reader will soon end up trying to access an incorrect session ID anyway.
Sure that's the case. Hence having feeds never contain a session id - and also why I was so adamant about fixing the canonical URL link as well, which also got SID added into it.

The whole thing about using SID in URLs is an interesting one and one I've been unwilling to make a move on because I'm inclined to think that having 'probably accurate' stats about the number of guests is probably slightly more important than having an SEO benefit to it (though having a canonical URL should fix most issues)

If probably_robot were more thorough, we can be happier about leaving it in. On the other hand, note, cookies being disabled will break other functionality anyway. It's a tough one to call :/

nolsilang

  • Lurking <i class=
  • Posts: 106
Re: Buggy Feed links
« Reply #24, on April 16th, 2012, 02:34 AM »
Ah.. yes now the feed linked to wedge.org.

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Buggy Feed links
« Reply #25, on April 16th, 2012, 08:34 AM »
(Post written last evening... Axed because of GF axing me for being late :^^;:)

My current implementation has the index.php remover disabled if no cookie is set. This is (was) because the session ID injector wouldn't correctly identify static cache files as static, and would inject the SID into the names.
So I rewrote it to correctly test for a question mark or nothing at all (this was complicated by the fact I had to ensure there wasn't already a SID in the URL.) I'm not sure this is best. Instead of \\?? after the SID test (beginning of ob_sessrewrite), which is already a bit of a buggy thing (you only need to escape the question mark once... it'll still work though, it's just a waste!), I added (?:\?|(?=")), meaning, "either followed by a question mark or a double quote", so it should always inject to homepage links too. Dunno if it's going to be very fast... So maybe I should just not care and simply insert index.php links to cookieless guests..?! (They'll still be transformed to /? if the remover is enabled! Only homepage links won't.)

Links to the homepage are annoying for another reason... As they're not treated by purls, we always end up with a ";" at the end, the one that's injected by the SID test... (Currently '&' in SVN, which is even worse... Or maybe it isn't so? I remember removing a test against & which was probably done for that reason...?)

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Buggy Feed links
« Reply #26, on April 16th, 2012, 11:50 AM »
Quote
This is (was) because the session ID injector wouldn't correctly identify static cache files as static, and would inject the SID into the names.
Yes, that's important. The number of people who got bitten by the 'trick' to alter where $scripturl was defined, was frightening.
Quote
As they're not treated by purls, we always end up with a ";" at the end, the one that's injected by the SID test... (Currently '&' in SVN, which is even worse... Or maybe it isn't so? I remember removing a test against & which was probably done for that reason...?)
Actually, it's worse than that. When this topic kicked off and I threw it at the validator, that error it was giving relates to the fact that there were *malformed* entities in the data, namely amp without the trailing ; in it, where injected by this code.

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Buggy Feed links
« Reply #27, on April 16th, 2012, 03:11 PM »
Quote from Arantor on April 16th, 2012, 12:06 AM
Then it WILL break.
It will break for servers that don't support both @ini_set and the specific setting that is arg_separator...
Quote
The default in PHP itself is &, as per http://[url]http://php.net/manual/en/ini.core.php#ini.arg-separator.input[/url] and the choice to use ; is for avoiding all kinds of nightmares. QueryString.php, line 81 onwards.
I know, I modified that area while harmonizing the preg_replace delimiters last week ;) Ended up being in my commit of today BTW...
That code block is about providing compatibility with & (which it doesn't do entirely because it doesn't change links in the HTML afaik..??), and fixing other things.
Quote
So we have to leave that code in, unless you plan on fixing every URL to not use ; (and with all the other problems related to it).
Actually, as I said, I'd rather we don't provide support for & links... (Well, inside PURLs if anything!)
Quote
SMF and Wedge using ; is a definite oddity though it does solve so many problems.
Oh, it's so much better than using &... Because we don't have to go through &(amp;)? when doing link checking... Of course, there's the issue with php_security or whatever it's called, but all ";id" calls are now ";in" or something anyway...
Quote
I can think of multiple reasons why it might be different between the two.
That's "multiple" better than me!
Quote
Either way, the bottom line is that it wasn't and while I have patched around it, I'm not happy with it (which is why I haven't yet committed it)
Committed what...? (I already feel a commit conflict coming...)
Quote
Note that SID containing a & is injected by SMF and Wedge, and I have no idea why ; wasn't used there.
Which is why I changed it into a ;...
I suspect that's a leftover from an old version.
Sometimes, you need to re-break things to understand what they tried to fix in the first place.
Quote
Sure that's the case. Hence having feeds never contain a session id - and also why I was so adamant about fixing the canonical URL link as well, which also got SID added into it.
And I really (really) think we should only add SID to the login and register links *on first visit*, i.e. if cookie is never set...
Quote
The whole thing about using SID in URLs is an interesting one and one I've been unwilling to make a move on because I'm inclined to think that having 'probably accurate' stats about the number of guests is probably slightly more important than having an SEO benefit to it (though having a canonical URL should fix most issues)
But how come we can't simply determine that a new session with the same IP is simply the same session...? (Or at least the same person, when it comes to calculating stats...)
If anything, it's actually more accurate to calculate things this way -- different IPs, rather than different sessions.
Posted: April 16th, 2012, 03:09 PM
Quote from Arantor on April 16th, 2012, 11:50 AM
Actually, it's worse than that. When this topic kicked off and I threw it at the validator, that error it was giving relates to the fact that there were *malformed* entities in the data, namely amp without the trailing ; in it, where injected by this code.
Which was due to my currently doing a rtrim() on the final URL and deleting any extra characters like ";"...
Previously, it was first removing & at the end and then removing ; or whatever.
When I noticed that, I simply asked Wedge to remove &amp (without the semicolon) from the URL...
And then I ended up completely removing it. Because it smelt bad.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Buggy Feed links
« Reply #28, on April 16th, 2012, 07:56 PM »
Before I reply, I should note that I've been fighting with DIY endeavours today and am very tired and frustrated (and still only got one of the two items built)
Quote
It will break for servers that don't support both @ini_set and the specific setting that is arg_separator...
You can't set it with ini_set. It's a PERDIR setting, not an ALL setting. You can only set it via .htaccess if your host allows you to do so (since it's a PHP setup thing, if you do it by ini_set, it's already too late as $_GET will by then already have been processed)
Quote
That code block is about providing compatibility with & (which it doesn't do entirely because it doesn't change links in the HTML afaik..??), and fixing other things.
No, no, no it doesn't. Read what the code does. It detects whether arg_separator is ; or not, if it is, it sanitises $_GET based on it, if it's not, it re-parses the query string.

You cannot do this in ini_set, and if you make it rely on having ; set as arg_separator, Wedge WILL break if that code is removed.
Quote
Actually, as I said, I'd rather we don't provide support for & links... (Well, inside PURLs if anything!)
Are you planning on restricting Wedge to hosts that are VPSes or better, or allow users to reconfigure things in .htaccess (specifically PHP settings via .htaccess)?

You don't actually have much of a choice on this. arg_separator is not ; by default, it's &, and since that's set up prior to a PHP script running, you don't have much choice in how that works. Either you restrict Wedge to hosts that allow custom configuration, or you leave the code alone because it works just fine as it is.
Quote
Committed what...? (I already feel a commit conflict coming...)
I did the same thing you did, a $context setting to have it never inject SID into URLs. But since you've committed it, I didn't bother (and I wasn't that happy with it, I felt there should have been something more elaborate)
Quote
Sometimes, you need to re-break things to understand what they tried to fix in the first place.
I know what they were trying to fix, I know exactly what and how they were trying to fix it, I just don't understand why that was there.
Quote
But how come we can't simply determine that a new session with the same IP is simply the same session...? (Or at least the same person, when it comes to calculating stats...)
For the same reason IP addresses are unreliable as sin anyway, things like NAT routing, where an IP address does not relate to a unique computer any more nor to a unique user.
Quote
Which was due to my currently doing a rtrim() on the final URL and deleting any extra characters like ";"...
Previously, it was first removing & at the end and then removing ; or whatever.
Except that I saw cases where the trailing ; was present and cases where it wasn't - in the same page.

Nao

  • Dadman with a boy
  • Posts: 16,082
Re: Buggy Feed links
« Reply #29, on April 18th, 2012, 05:58 PM »
Quote from Arantor on April 16th, 2012, 07:56 PM
You can't set it with ini_set. It's a PERDIR setting, not an ALL setting. You can only set it via .htaccess if your host allows you to do so (since it's a PHP setup thing, if you do it by ini_set, it's already too late as $_GET will by then already have been processed)
As I pointed out in my thought -- I got confused. Twas a bad day for me I guess... (Not that it's much better now.)
Obviously you can't set the separator *after* you started the PHP page, because you already have the query string in the $_SERVER variable, ah ah :)
I was a little too proud regarding that piece of code. Because I'd modified it for Noisen, I 'thought' I knew it by heart. I'd actually forgotten all about it...
Quote
No, no, no it doesn't. Read what the code does. It detects whether arg_separator is ; or not, if it is, it sanitises $_GET based on it, if it's not, it re-parses the query string.
We should always sanitize $_GET anyway. Just assume that ; is always used to separate stuff. Which is the case, anyway...
Occasionally though, SMF uses '&', as in the SID injector, and I don't know what it's there for. I'm assuming this is to 'emulate' enable_trans_sid being set to 1. Because I guess some servers won't allow you to use ini_set, if it's at 1, then it automatically injects & at the end of the SID and then you end up with two different URL styles to deal with -- ?PHPSESSID=...; and ?PHPSESSID=...& -- so they chose to inject & and deal with it later (which I don't believe they do...)
I for one couldn't care less about that, but I suppose we should deal with it overall?
Quote
I did the same thing you did, a $context setting to have it never inject SID into URLs. But since you've committed it, I didn't bother (and I wasn't that happy with it, I felt there should have been something more elaborate)
I hope you liked the name I chose for the setting :lol:
Although I'm never one to refuse SID! But only if it's C64-related ;)
Quote
For the same reason IP addresses are unreliable as sin anyway, things like NAT routing, where an IP address does not relate to a unique computer any more nor to a unique user.
The same could be said of a session ID. If I have my forum running on 3 different browsers, should Wedge say I have 3 users online...? For debugging purpose it's fine. What about Googlebot? Sometimes you get many requests from the same IP IIRC. Because they're all using different sessions, Wedge considers them as multiple bots. I for one would rather see it recognized as a single 'user'.
Quote
Quote
Which was due to my currently doing a rtrim() on the final URL and deleting any extra characters like ";"...
Previously, it was first removing & at the end and then removing ; or whatever.
Except that I saw cases where the trailing ; was present and cases where it wasn't - in the same page.
It should always be there in homepage links (index.php). And never elsewhere.