Wedge

Public area => Bug reports => The Pub => Archived fixes => Topic started by: nolsilang on April 14th, 2012, 01:30 PM

Title: Buggy Feed links
Post by: nolsilang on April 14th, 2012, 01:30 PM
I do not consider this is a bug yet, but today my Google Reader got bombarded from Wedge Development Blog. As if anything new got posted in Development Blog repeatedly[1].

On the other hand,I want to ask is the date posted in Development Blog correspond to latest reply date not the date of the thread posted?

Thanks.
 1. picture attached
Title: Re: Buggy Feed links
Post by: Arantor on April 14th, 2012, 01:47 PM
Well, you can look at the threads just as easily as I can, This. Is. Crazy was 'Posted by Nao, on April 3rd, 11:00 PM' for example.

I do suspect there's a problem here though after pushing the Dev Blog's new-topics feed link through a validator:
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fwedge.org%2Fblog%2F%3Faction%3Dfeed%3Bsa%3Dnews

The first issue:
Quote
In addition, interoperability with the widest range of feed readers could be improved by implementing the following recommendation.
line 2, column 0: Missing atom:link with rel="self"
That I can fix. The other error, though, I can't fix so readily - and I suspect it's why it's broken (because it's including the session ID in the URL, making each one unique, and breaking each one)
Title: Re: Buggy Feed links
Post by: Arantor on April 14th, 2012, 02:00 PM
OK, so I actually fixed the bigger issue quite easily, but now I need to go away and research the specification now.
Title: Re: Buggy Feed links
Post by: nolsilang on April 14th, 2012, 02:07 PM
I just got another in my reader :).

The title in feed is not pointing to thread url, but to wedge homepage.

At first I thought that is the intended behaviour so the subscriber know if any new reply have been made.(the frequency and date)
Title: Re: Buggy Feed links
Post by: Nao on April 14th, 2012, 02:34 PM
There is nothing to fix.
As soon as Google will retrieve the feed, it will try and find differences with the earlier feed. If anything changes, such as the URL, it will mark the item as new/unread.
Yesterday I made many tweaks to the pretty URL system, and as a result I broke feed links a few times. I think I fixed that no more than a couple of hours after it was broken, but it's enough for Google to mark something as new.

Oddly, though, I *thought* that with the system I'm using in Wedge (unique tags), it wouldn't bother with URL changes... That is one thing that really bothers me. But other than that -- I checked the current feed and it's all 'normal' to me...
Posted: April 14th, 2012, 02:31 PM

Note: a minor bug...
I was logged off when I posted my reply. It asked me to log in. I did my thing, and it said "Page not found" or something. Went back to the original page, clicked Submit again -- told me "Already submitted". Went back, copied my text, refreshed the page: it was NOT already submitted... So I just pasted the text and voilĂ .

So, two possible bugs:
- "already submitted" being said even when the post was actually refused...
- "page not found". I think it also does this in "View query" pages. Only admins can see that, though... (That bug may be due to my pretty URL tweaks though.)
Title: Re: Buggy Feed links
Post by: nolsilang on April 14th, 2012, 02:36 PM
Oh okay, I just report what I think odd. :) that's why I said earlier I don't think it's a bug yet, because I know wedge is still work in progress[1]

Thanks.
 1. Maybe I should wait more before reporting again
Title: Re: Buggy Feed links
Post by: Arantor on April 14th, 2012, 05:04 PM
Yes there is something to fix.

Every single time Google reads it, it will have PHPSESSID links in it, so each time it refetches the feed, it sees different items in it as a result (because whenever it hits, it has a different session id). I've already patched it so there's a new parameter ($context['no_phpsessid']), if empty it will inject PHPSESSID=whatever into the URLs if appropriate, or forcibly disable it if it's non-empty - there is absolutely no reason to submit PHPSESSIDs in feed items, and several reasons (like this) to expressly not do so.

There is still a bug with the XML not passing validation, though that won't cause this behaviour.
Title: Re: Buggy Feed links
Post by: Nao on April 14th, 2012, 05:12 PM
Ah, yes, I remember our talk about phpsessid... :)
Title: Re: Buggy Feed links
Post by: Arantor on April 14th, 2012, 05:16 PM
Oh, there's several things wrong with PHPSESSID but I'm beginning to think the magic injection into every link on the page is actually more of a hindrance than a help. Certainly it screws up a lot of search engine stuff (including SEO) and feeds are no exception.

I have thought about dropping it entirely and relying on only cookies to handle sessions but that will confuse the tracking of max users on systems that can't properly handle cookies (like some search engines, paranoid guests), but not sure yet.
Title: Re: Buggy Feed links
Post by: MultiformeIngegno on April 14th, 2012, 06:34 PM
Quote from Arantor on April 14th, 2012, 05:16 PM
I have thought about dropping it entirely and relying on only cookies to handle sessions but that will confuse the tracking of max users on systems that can't properly handle cookies (like some search engines, paranoid guests), but not sure yet.
This seems interesting! ;)
Title: Re: Buggy Feed links
Post by: Nao on April 14th, 2012, 07:24 PM
Yeah, I'm pro-removing sessid too. And sort of determining that if the client won't accept cookies, and it's not the first page they load, then give them a message saying to fuck off :P

Then again I'm not sure we won't be getting genuine feedback about horror stories that arose from that design choice ;)
Title: Re: Buggy Feed links
Post by: Arantor on April 14th, 2012, 08:56 PM
Well, it's a bit more complicated than that because cookies are used to handle sessions even for guests. Where it gets problematic is for tracking the number of unique guests, and without proper session support that just won't happen properly - and PHPSESSID is only ever sent when there isn't a cookie, which is where search engines use it.

It isn't just about not having cookie support, it is also about when there simply hasn't been a cookie, e.g. the very first visit, but search engines typically have 'new sessions', and you could very easily go from having '2 or 3' Google visits at a time to dozens where it can't properly handle the session.

That's why I haven't changed it, because I have a nasty feeling it would break the 'number of guests online at present'.
Title: Re: Buggy Feed links
Post by: Nao on April 14th, 2012, 09:06 PM
And google removes the phpsessid var from the URL iirc? Isn't it a default php name?
If it doesn't, we might want to look into whether it can remove anything.

Also my main issue with phpsessid is not with purls but with people posting links that contain it.
Title: Re: Buggy Feed links
Post by: Nao on April 14th, 2012, 09:07 PM
Wouldn't it work if we removed phpsessid for first visits (ip based), and restore it on the second visit if the cookie still isn't there..? (visit = pageview)
Title: Re: Buggy Feed links
Post by: Arantor on April 14th, 2012, 09:17 PM
Re Google... for *web searches*, yes it does - if you have Google Webmaster Tools and tell it not to do so. It doesn't do so automatically. And other services - like the feed reader that started this thread, no, it doesn't, because that's what causes it to repeatedly read in topics - because the URL changes.

The problem with that solution is that it's still not that reliable, especially for those who would actually trip it - we'd be better just accepting when it's wrong instead.
Title: Re: Buggy Feed links
Post by: Nao on April 14th, 2012, 10:41 PM
But technically would it be possible to forget phpsessid in the first page view?
Maybe we could enforce loading a second page like login before logging in is accepted.

As for feeds, are we sure phpsessid is used in these? If yes we should certainly ensure it isn't included...
Title: Re: Buggy Feed links
Post by: Arantor on April 14th, 2012, 10:50 PM
Quote
But technically would it be possible to forget phpsessid in the first page view?
PHPSESSID is invoked when there isn't an existing Wedge cookie (like the start of every browsing session from a search engine) and that first page view, it will always shove PHPSESSID links everywhere. Remove your cookie, refresh, boom you'll see them.
Quote
Maybe we could enforce loading a second page like login before logging in is accepted.
We actually do, as it happens. You cannot go into login2 if you don't have a valid session. But that's not really the problem.

The problem that we have to weigh up is accuracy of reporting vs. PHPSESSID in URLs. Specifically, it's simply about tracking how many 'probably unique' guests there are, given the requests being made, since the whole nature of cookies is a friggin' bolt on to the specification in the first place (as HTTP is specifically designed to be stateless)
Quote
As for feeds, are we sure phpsessid is used in these? If yes we should certainly ensure it isn't included...
Yes, I am sure. It's not possible to avoid them, because it still goes through ob_sessrewrite to handle pretty URLs, so the usual logic applies - namely that if no cookie was found, PHPSESSID is injected. And given the approach made by Google Reader etc., they wouldn't have a cookie, so they get PHPSESSIDs, which is what we're seeing here (and I separately validated that PHPSESSID was added)
Title: Re: Buggy Feed links
Post by: nolsilang on April 15th, 2012, 04:03 AM
I don't know if this helped or not, but from date on my reader. This started from April 12. Before that it seemed normal.
Title: Re: Buggy Feed links
Post by: Nao on April 15th, 2012, 05:32 PM
Currently testing Wedge with cookies disabled... (Fun.)
Quote from Arantor on April 14th, 2012, 10:50 PM
PHPSESSID is invoked when there isn't an existing Wedge cookie (like the start of every browsing session from a search engine) and that first page view, it will always shove PHPSESSID links everywhere. Remove your cookie, refresh, boom you'll see them.
Yeah, I know that from countless forums visited where I got the phpsessid on first page and it was removed later. ;)

However, Wedge does it this way... First of all, it determines whether user is a robot. If it is, it will skip adding PHPSESSID to links. Which is the reason why they never show up in our RSS feeds in Google Reader. So the reason why it broke in GR is that the URLs returned were not prettified.

0x, your screenshots would be more helpful if they showed what URL these entries linked to ;) (So, basically... All of them have ?topic=... in the URL, right?)
Quote
Quote
Maybe we could enforce loading a second page like login before logging in is accepted.
We actually do, as it happens. You cannot go into login2 if you don't have a valid session. But that's not really the problem.
I was thinking, maybe we could simply put the phpsessid links on the login links... Instead of all links. Maybe that would help... i.e. if you're not going to log in, there's no reason you should have a constant session (of course arises the problem of increased opened sessions, but it can be solved another way I suppose... We could even track people by IP...), while if they want to login, then they'll want to have their session followed just to give Wedge time to process the user name & password... By that time, the browser correctly reports a cookie anyway, so we won't need to add the phpsessid links by then.

Of course, there's still the problem of browsers that disable cookies entirely... i.e. if you're logged in and have cookies disabled like me right now, you absolutely need a phpsessid link in the URL.

Now, about phpsessid...
- Looking for it directly (in the code) is bad manners. Pretty URLs did that. I'm going to strip that code out -- in fact, it should be looking for SID, which is "PHPSESSID=abcdef" altogether. This also means doing a str_replace instead of a preg_replace on all links, because we know the session value. (Same for $session_var... No need for a preg_replace. I simply used $context['session_query'] instead. I knew that one would prove useful one day ;))
- I'm not sure why, but SMF and Wedge both add "&" at the end of the SID URL... Instead of simply using ";". I'm not sure SMF/Wedge would work *at all* if the installed PHP didn't support ";" as a separator.
- Also, SMF/Wedge use SID != '' in one place (ob_sessrewrite), and defined('SID') in others. I don't see how SID could possibly be, err, set to an empty string if it was defined... Or maybe it's always defined, in which case I should probably test against that when I get to re-enable cookies... Do you remember anything about that anyway?
Quote
Quote
As for feeds, are we sure phpsessid is used in these? If yes we should certainly ensure it isn't included...
Yes, I am sure.
(Jean Dujardin voice) Not so fast, Mr. Bond!
Quote
It's not possible to avoid them, because it still goes through ob_sessrewrite to handle pretty URLs, so the usual logic applies - namely that if no cookie was found, PHPSESSID is injected. And given the approach made by Google Reader etc., they wouldn't have a cookie, so they get PHPSESSIDs, which is what we're seeing here (and I separately validated that PHPSESSID was added)
As pointed out above, SID isn't injected is $user_info['possibly_robot'] is true, which is always the case for Google Bot. (And probably Google Reader's bot as well.)
Posted: April 15th, 2012, 05:30 PM

Yup, looks like SID is always defined, and just empty when cookies are set...
Title: Re: Buggy Feed links
Post by: Arantor on April 15th, 2012, 05:58 PM
Quote
However, Wedge does it this way... First of all, it determines whether user is a robot. If it is, it will skip adding PHPSESSID to links
Except that not all feed readers are detected - and IIRC, Google Reader does not trip that test.
Quote
So the reason why it broke in GR is that the URLs returned were not prettified.
So not being prettified will make every topic show up as a new topic (because of a unique URL) every time it looks? Even here, where URLs were being prettified, PHPSESSID was being injected - I specifically checked this when it was reported.
Quote
Of course, there's still the problem of browsers that disable cookies entirely... i.e. if you're logged in and have cookies disabled like me right now, you absolutely need a phpsessid link in the URL.
There is a (valid) argument about rejecting those cases and disallowing it entirely for security reasons.
Quote
I'm not sure why, but SMF and Wedge both add "&" at the end of the SID URL... Instead of simply using ";". I'm not sure SMF/Wedge would work *at all* if the installed PHP didn't support ";" as a separator.
Because & is the argument separator defined in PHP's configuration. Check the code in QueryString, if ; is arg-separator, it does one thing, but otherwise, it does something else to manually parse out the parameters. ; is just not the default. But we get cases, just for fun, of malformed entities being prepared occasionally too.
Quote
(Jean Dujardin voice) Not so fast, Mr. Bond!
Quote
As pointed out above, SID isn't injected is $user_info['possibly_robot'] is true, which is always the case for Google Bot. (And probably Google Reader's bot as well.)
Seriously, it IS injecting PHPSESSID. It does it on the RSS validator. I have no idea what user agent Google Reader uses but I'm willing to bet it doesn't trip possibly_robot. It's actually irrelevant in this case. It does not matter whether it trips possibly_robot or not, it should NEVER be issuing PHPSESSID in feeds, ever, because some other feed readers will choke on non unique URLs, I know Thunderbird used to have issues with it, for example (because that didn't bother with cookies at one time)
Title: Re: Buggy Feed links
Post by: nolsilang on April 15th, 2012, 06:19 PM
These feeds linked to index+session, not topic+session I'm afraid.
Title: Re: Buggy Feed links
Post by: Nao on April 15th, 2012, 11:41 PM
@0x> It SHOULD be working now... I made a test with Google Reader and it no longer shows PHPSESSID links :)
I took the lazy way out really... Added a context variable in Feed.php to say we don't want to insert the session ID. This is tested against in Subs-Template.php. The reason why I was lazy for it, is that (1) just testing for Feedfetcher-Google (or whatever it's called) isn't going to do any good for other feed reader bots, (2) there is VERY little reason to have a session ID in a feed URL. If you have cookies disabled, then your session won't be active forever. Your feed reader will soon end up trying to access an incorrect session ID anyway.
Quote from Arantor on April 15th, 2012, 05:58 PM
Quote
Of course, there's still the problem of browsers that disable cookies entirely... i.e. if you're logged in and have cookies disabled like me right now, you absolutely need a phpsessid link in the URL.
There is a (valid) argument about rejecting those cases and disallowing it entirely for security reasons.
So.. What do we do, eh!
Quote
Because & is the argument separator defined in PHP's configuration. Check the code in QueryString, if ; is arg-separator, it does one thing, but otherwise, it does something else to manually parse out the parameters. ; is just not the default. But we get cases, just for fun, of malformed entities being prepared occasionally too.
I'm aware of how SMF/Wedge deals with it. But I can't find of any (good) reason to disable the ability to specify a new argument separator. I for one never heard of such a case... And I don't even think it would make a forum work because we're assuming pretty much everywhere that ; is the separator, and using it in links... And I don't seem to remember Wedge parsing all links to replace ; to & manually ;) (Not only that, but it's silly to assume that adding & will work everywhere... If a link is inside CDATA tags, then you're screwed!)

I would be tempted to say that we should remove that compatibility code...
Quote
Seriously, it IS injecting PHPSESSID. It does it on the RSS validator. I have no idea what user agent Google Reader uses but I'm willing to bet it doesn't trip possibly_robot. It's actually irrelevant in this case. It does not matter whether it trips possibly_robot or not, it should NEVER be issuing PHPSESSID in feeds, ever, because some other feed readers will choke on non unique URLs, I know Thunderbird used to have issues with it, for example (because that didn't bother with cookies at one time)
And we should have dealt with that long ago. Seriously, I'm surprised we had the problem at all, because if you'll look at noisen.com, the phpsessid is never injected *at all* into feeds...
Title: Re: Buggy Feed links
Post by: Nao on April 15th, 2012, 11:45 PM
Hmm... Double-checked the Noisen codebase, and it doesn't have anything different in that respect... Which is probably why I was so positive Wedge didn't have the problem either.
Anyway, now that we're sure it's never gonna happen... I can safely move back to working on PURLs ;)
Title: Re: Buggy Feed links
Post by: Arantor on April 16th, 2012, 12:06 AM
Quote
I would be tempted to say that we should remove that compatibility code...
Then it WILL break. The default in PHP itself is &, as per http://php.net/manual/en/ini.core.php#ini.arg-separator.input and the choice to use ; is for avoiding all kinds of nightmares. QueryString.php, line 81 onwards.

Code: [Select]
// Are we going to need to parse the ; out?
if (strpos(@ini_get('arg_separator.input'), ';') === false && !empty($_SERVER['QUERY_STRING']))
{
// Get rid of the old one! You don't know where it's been!
$_GET = array();

// Was this redirected? If so, get the REDIRECT_QUERY_STRING.
$_SERVER['QUERY_STRING'] = urldecode(substr($_SERVER['QUERY_STRING'], 0, 5) === 'url=/' ? $_SERVER['REDIRECT_QUERY_STRING'] : $_SERVER['QUERY_STRING']);

// Replace ';' with '&' and '&something&' with '&something=&'. (This is done for compatibility...)
parse_str(preg_replace('/&(\w+)(?=&|$)/', '&$1=', strtr($_SERVER['QUERY_STRING'], array(';?' => '&', ';' => '&', '%00' => '', "\0" => ''))), $_GET);

// Magic quotes still applies with parse_str - so clean it up.
if (function_exists('get_magic_quotes_gpc') && @get_magic_quotes_gpc() != 0 && empty($settings['integrate_magic_quotes']))
$_GET = $removeMagicQuoteFunction($_GET);
}
elseif (strpos(@ini_get('arg_separator.input'), ';') !== false)
{
if (function_exists('get_magic_quotes_gpc') && @get_magic_quotes_gpc() != 0 && empty($settings['integrate_magic_quotes']))
$_GET = $removeMagicQuoteFunction($_GET);

// Search engines will send action=profile%3Bu=1, which confuses PHP.
foreach ($_GET as $k => $v)
{
if (is_string($v) && strpos($k, ';') !== false)
{
$temp = explode(';', $v);
$_GET[$k] = $temp[0];

for ($i = 1, $n = count($temp); $i < $n; $i++)
{
@list ($key, $val) = @explode('=', $temp[$i], 2);
if (!isset($_GET[$key]))
$_GET[$key] = $val;
}
}

// This helps a lot with integration!
if (strpos($k, '?') === 0)
{
$_GET[substr($k, 1)] = $v;
unset($_GET[$k]);
}
}
}

So we have to leave that code in, unless you plan on fixing every URL to not use ; (and with all the other problems related to it). SMF and Wedge using ; is a definite oddity though it does solve so many problems.
Quote
And we should have dealt with that long ago. Seriously, I'm surprised we had the problem at all, because if you'll look at noisen.com, the phpsessid is never injected *at all* into feeds...
I can think of multiple reasons why it might be different between the two. Either way, the bottom line is that it wasn't and while I have patched around it, I'm not happy with it (which is why I haven't yet committed it)
Quote
Added a context variable in Feed.php to say we don't want to insert the session ID. This is tested against in Subs-Template.php. The reason why I was lazy for it, is that (1) just testing for Feedfetcher-Google (or whatever it's called) isn't going to do any good for other feed reader bots, (2) there is VERY little reason to have a session ID in a feed URL
Oh, you've added it, then much as I had. And yes, exactly, there is no reason to have it in the URL, and testing for bots isn't enough because of the sheer variety of feed consumers.

Note that SID containing a & is injected by SMF and Wedge, and I have no idea why ; wasn't used there.
Quote
If you have cookies disabled, then your session won't be active forever. Your feed reader will soon end up trying to access an incorrect session ID anyway.
Sure that's the case. Hence having feeds never contain a session id - and also why I was so adamant about fixing the canonical URL link as well, which also got SID added into it.

The whole thing about using SID in URLs is an interesting one and one I've been unwilling to make a move on because I'm inclined to think that having 'probably accurate' stats about the number of guests is probably slightly more important than having an SEO benefit to it (though having a canonical URL should fix most issues)

If probably_robot were more thorough, we can be happier about leaving it in. On the other hand, note, cookies being disabled will break other functionality anyway. It's a tough one to call :/
Title: Re: Buggy Feed links
Post by: nolsilang on April 16th, 2012, 02:34 AM
Ah.. yes now the feed linked to wedge.org.
Title: Re: Buggy Feed links
Post by: Nao on April 16th, 2012, 08:34 AM
(Post written last evening... Axed because of GF axing me for being late :^^;:)

My current implementation has the index.php remover disabled if no cookie is set. This is (was) because the session ID injector wouldn't correctly identify static cache files as static, and would inject the SID into the names.
So I rewrote it to correctly test for a question mark or nothing at all (this was complicated by the fact I had to ensure there wasn't already a SID in the URL.) I'm not sure this is best. Instead of \\?? after the SID test (beginning of ob_sessrewrite), which is already a bit of a buggy thing (you only need to escape the question mark once... it'll still work though, it's just a waste!), I added (?:\?|(?=")), meaning, "either followed by a question mark or a double quote", so it should always inject to homepage links too. Dunno if it's going to be very fast... So maybe I should just not care and simply insert index.php links to cookieless guests..?! (They'll still be transformed to /? if the remover is enabled! Only homepage links won't.)

Links to the homepage are annoying for another reason... As they're not treated by purls, we always end up with a ";" at the end, the one that's injected by the SID test... (Currently '&' in SVN, which is even worse... Or maybe it isn't so? I remember removing a test against & which was probably done for that reason...?)
Title: Re: Buggy Feed links
Post by: Arantor on April 16th, 2012, 11:50 AM
Quote
This is (was) because the session ID injector wouldn't correctly identify static cache files as static, and would inject the SID into the names.
Yes, that's important. The number of people who got bitten by the 'trick' to alter where $scripturl was defined, was frightening.
Quote
As they're not treated by purls, we always end up with a ";" at the end, the one that's injected by the SID test... (Currently '&' in SVN, which is even worse... Or maybe it isn't so? I remember removing a test against & which was probably done for that reason...?)
Actually, it's worse than that. When this topic kicked off and I threw it at the validator, that error it was giving relates to the fact that there were *malformed* entities in the data, namely amp without the trailing ; in it, where injected by this code.
Title: Re: Buggy Feed links
Post by: Nao on April 16th, 2012, 03:11 PM
Quote from Arantor on April 16th, 2012, 12:06 AM
Then it WILL break.
It will break for servers that don't support both @ini_set and the specific setting that is arg_separator...
Quote
The default in PHP itself is &, as per http://[url]http://php.net/manual/en/ini.core.php#ini.arg-separator.input[/url] and the choice to use ; is for avoiding all kinds of nightmares. QueryString.php, line 81 onwards.
I know, I modified that area while harmonizing the preg_replace delimiters last week ;) Ended up being in my commit of today BTW...
That code block is about providing compatibility with & (which it doesn't do entirely because it doesn't change links in the HTML afaik..??), and fixing other things.
Quote
So we have to leave that code in, unless you plan on fixing every URL to not use ; (and with all the other problems related to it).
Actually, as I said, I'd rather we don't provide support for & links... (Well, inside PURLs if anything!)
Quote
SMF and Wedge using ; is a definite oddity though it does solve so many problems.
Oh, it's so much better than using &... Because we don't have to go through &(amp;)? when doing link checking... Of course, there's the issue with php_security or whatever it's called, but all ";id" calls are now ";in" or something anyway...
Quote
I can think of multiple reasons why it might be different between the two.
That's "multiple" better than me!
Quote
Either way, the bottom line is that it wasn't and while I have patched around it, I'm not happy with it (which is why I haven't yet committed it)
Committed what...? (I already feel a commit conflict coming...)
Quote
Note that SID containing a & is injected by SMF and Wedge, and I have no idea why ; wasn't used there.
Which is why I changed it into a ;...
I suspect that's a leftover from an old version.
Sometimes, you need to re-break things to understand what they tried to fix in the first place.
Quote
Sure that's the case. Hence having feeds never contain a session id - and also why I was so adamant about fixing the canonical URL link as well, which also got SID added into it.
And I really (really) think we should only add SID to the login and register links *on first visit*, i.e. if cookie is never set...
Quote
The whole thing about using SID in URLs is an interesting one and one I've been unwilling to make a move on because I'm inclined to think that having 'probably accurate' stats about the number of guests is probably slightly more important than having an SEO benefit to it (though having a canonical URL should fix most issues)
But how come we can't simply determine that a new session with the same IP is simply the same session...? (Or at least the same person, when it comes to calculating stats...)
If anything, it's actually more accurate to calculate things this way -- different IPs, rather than different sessions.
Posted: April 16th, 2012, 03:09 PM
Quote from Arantor on April 16th, 2012, 11:50 AM
Actually, it's worse than that. When this topic kicked off and I threw it at the validator, that error it was giving relates to the fact that there were *malformed* entities in the data, namely amp without the trailing ; in it, where injected by this code.
Which was due to my currently doing a rtrim() on the final URL and deleting any extra characters like ";"...
Previously, it was first removing & at the end and then removing ; or whatever.
When I noticed that, I simply asked Wedge to remove &amp (without the semicolon) from the URL...
And then I ended up completely removing it. Because it smelt bad.
Title: Re: Buggy Feed links
Post by: Arantor on April 16th, 2012, 07:56 PM
Before I reply, I should note that I've been fighting with DIY endeavours today and am very tired and frustrated (and still only got one of the two items built)
Quote
It will break for servers that don't support both @ini_set and the specific setting that is arg_separator...
You can't set it with ini_set. It's a PERDIR setting, not an ALL setting. You can only set it via .htaccess if your host allows you to do so (since it's a PHP setup thing, if you do it by ini_set, it's already too late as $_GET will by then already have been processed)
Quote
That code block is about providing compatibility with & (which it doesn't do entirely because it doesn't change links in the HTML afaik..??), and fixing other things.
No, no, no it doesn't. Read what the code does. It detects whether arg_separator is ; or not, if it is, it sanitises $_GET based on it, if it's not, it re-parses the query string.

You cannot do this in ini_set, and if you make it rely on having ; set as arg_separator, Wedge WILL break if that code is removed.
Quote
Actually, as I said, I'd rather we don't provide support for & links... (Well, inside PURLs if anything!)
Are you planning on restricting Wedge to hosts that are VPSes or better, or allow users to reconfigure things in .htaccess (specifically PHP settings via .htaccess)?

You don't actually have much of a choice on this. arg_separator is not ; by default, it's &, and since that's set up prior to a PHP script running, you don't have much choice in how that works. Either you restrict Wedge to hosts that allow custom configuration, or you leave the code alone because it works just fine as it is.
Quote
Committed what...? (I already feel a commit conflict coming...)
I did the same thing you did, a $context setting to have it never inject SID into URLs. But since you've committed it, I didn't bother (and I wasn't that happy with it, I felt there should have been something more elaborate)
Quote
Sometimes, you need to re-break things to understand what they tried to fix in the first place.
I know what they were trying to fix, I know exactly what and how they were trying to fix it, I just don't understand why that was there.
Quote
But how come we can't simply determine that a new session with the same IP is simply the same session...? (Or at least the same person, when it comes to calculating stats...)
For the same reason IP addresses are unreliable as sin anyway, things like NAT routing, where an IP address does not relate to a unique computer any more nor to a unique user.
Quote
Which was due to my currently doing a rtrim() on the final URL and deleting any extra characters like ";"...
Previously, it was first removing & at the end and then removing ; or whatever.
Except that I saw cases where the trailing ; was present and cases where it wasn't - in the same page.
Title: Re: Buggy Feed links
Post by: Nao on April 18th, 2012, 05:58 PM
Quote from Arantor on April 16th, 2012, 07:56 PM
You can't set it with ini_set. It's a PERDIR setting, not an ALL setting. You can only set it via .htaccess if your host allows you to do so (since it's a PHP setup thing, if you do it by ini_set, it's already too late as $_GET will by then already have been processed)
As I pointed out in my thought -- I got confused. Twas a bad day for me I guess... (Not that it's much better now.)
Obviously you can't set the separator *after* you started the PHP page, because you already have the query string in the $_SERVER variable, ah ah :)
I was a little too proud regarding that piece of code. Because I'd modified it for Noisen, I 'thought' I knew it by heart. I'd actually forgotten all about it...
Quote
No, no, no it doesn't. Read what the code does. It detects whether arg_separator is ; or not, if it is, it sanitises $_GET based on it, if it's not, it re-parses the query string.
We should always sanitize $_GET anyway. Just assume that ; is always used to separate stuff. Which is the case, anyway...
Occasionally though, SMF uses '&', as in the SID injector, and I don't know what it's there for. I'm assuming this is to 'emulate' enable_trans_sid being set to 1. Because I guess some servers won't allow you to use ini_set, if it's at 1, then it automatically injects & at the end of the SID and then you end up with two different URL styles to deal with -- ?PHPSESSID=...; and ?PHPSESSID=...& -- so they chose to inject & and deal with it later (which I don't believe they do...)
I for one couldn't care less about that, but I suppose we should deal with it overall?
Quote
I did the same thing you did, a $context setting to have it never inject SID into URLs. But since you've committed it, I didn't bother (and I wasn't that happy with it, I felt there should have been something more elaborate)
I hope you liked the name I chose for the setting :lol:
Although I'm never one to refuse SID! But only if it's C64-related ;)
Quote
For the same reason IP addresses are unreliable as sin anyway, things like NAT routing, where an IP address does not relate to a unique computer any more nor to a unique user.
The same could be said of a session ID. If I have my forum running on 3 different browsers, should Wedge say I have 3 users online...? For debugging purpose it's fine. What about Googlebot? Sometimes you get many requests from the same IP IIRC. Because they're all using different sessions, Wedge considers them as multiple bots. I for one would rather see it recognized as a single 'user'.
Quote
Quote
Which was due to my currently doing a rtrim() on the final URL and deleting any extra characters like ";"...
Previously, it was first removing & at the end and then removing ; or whatever.
Except that I saw cases where the trailing ; was present and cases where it wasn't - in the same page.
It should always be there in homepage links (index.php). And never elsewhere.
Title: Re: Buggy Feed links
Post by: Arantor on April 18th, 2012, 07:32 PM
Quote
We should always sanitize $_GET anyway. Just assume that ; is always used to separate stuff. Which is the case, anyway...
Occasionally though, SMF uses '&', as in the SID injector, and I don't know what it's there for. I'm assuming this is to 'emulate' enable_trans_sid being set to 1. Because I guess some servers won't allow you to use ini_set, if it's at 1, then it automatically injects & at the end of the SID and then you end up with two different URL styles to deal with -- ?PHPSESSID=...; and ?PHPSESSID=...& -- so they chose to inject & and deal with it later (which I don't believe they do...)
I for one couldn't care less about that, but I suppose we should deal with it overall?
We always do sanitise $_GET, and we do always assume that ; is used to separate stuff, as noted.

Yes, & is passed into SID, and I think you're probably right that it is related to handling enable_trans_sid.

Dealing with it is the big problem. How, exactly, do we proceed? I'm in no doubt the current situation is not ideal. But I'm not sure the other situations are any better either.
Quote
I hope you liked the name I chose for the setting :lol:
Although I'm never one to refuse SID! But only if it's C64-related
I haven't actually looked, been too busy to properly keep up with all the code changes :/
Quote
The same could be said of a session ID. If I have my forum running on 3 different browsers, should Wedge say I have 3 users online...? For debugging purpose it's fine. What about Googlebot? Sometimes you get many requests from the same IP IIRC. Because they're all using different sessions, Wedge considers them as multiple bots. I for one would rather see it recognized as a single 'user'.
If you have your forum running on three different browsers, it depends whether you're logged in or not. If you're logged in, you should be able to have three sessions but only the one online user. But if they're not all logged in, it is different users per se, and I'd argue that's more accurate because they're three separate experiences of browsing the site.

If you have two different sessions from Google, that's two different bots browsing, and should be reflected as such, despite maybe coming from separate - or the same - IP addresses. Here's the thing, even now you cannot rely on an IP address being reflective of a single bot.

Two bots from Google browsing, with two sessions, will typically have two IP addresses. Though it won't always be that way.
Quote
It should always be there in homepage links (index.php). And never elsewhere.
I was going off the feed validation I ran, I forget exactly what was, and what wasn't, broken