Pretty URL remarks

Nao

  • Dadman with a boy
  • Posts: 16,079
Pretty URL remarks
« on April 10th, 2012, 03:28 PM »
So... Support for pretty URLs is pretty solid, but I still had an issue with Wedge trying to re-parse URLs that were already parsed, and sometimes breaking them as a result.

This turned out to be due to the way I implemented my 'index.php' removal code (which, BTW, is an optional feature). Because $scripturl immediately ignores the filename, the pretty URL regex thus searches for any boardurl and will transform them.
I fixed it by instead doing the transform *after* pretty URLs are handled. (Currently it's done within the PURL code block but I'm not sure this should be associated with them -- one might want to just get rid of index.php without systematically transforming URLs...)

This adds a new problem. Because it's now done through a basic str_replace, I have no way to 'control' how index.php is removed from the page -- could be within a non-linked URL, for instance. Or whatever.

My belief is that if you want to remove index.php from your URL, you will want to remove it from ANY place on the page, even if it's not intended as being transformed. Making it a no-brainer to apply the technique I just devised.

I'd like some opinions on this.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Pretty URL remarks
« Reply #1, on April 10th, 2012, 03:37 PM »
How about, then, we remove all instances of $scripturl from anywhere in the code and only ever use <URL> internally, so when *that's* replaced, we fix it there once and only once?
When we unite against a common enemy that attacks our ethos, it nurtures group solidarity. Trolls are sensational, yes, but we keep everyone honest. | Game Memorial

Nao

  • Dadman with a boy
  • Posts: 16,079
Re: Pretty URL remarks
« Reply #2, on April 10th, 2012, 03:41 PM »
Actually thought of doing it that way first, but then I figured, "argh, it's gonna get dirty"... ;)

But certainly it would be... Interesting to actually *remove* $scripturl entirely from the global list!
There are still 2000+ occurrences of it in the code, and we'd have to make sure ob_sessrewrite is always called... (e.g. Xml and SSI... But it's pretty much the same issue as with PURLs to begin with!)
Posted: April 10th, 2012, 03:40 PM

Also, if someone decides to add a link to the forum and adds index.php manually (or just gets it through an old Google link etc), then these might not get replaced in the final code...

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Pretty URL remarks
« Reply #3, on April 10th, 2012, 04:12 PM »
Hmm, I see what you mean. But it's still got canonical URLs, it's still got everything else it needs to have, so all that it means is that it's an extra variation of URL that's inbound - it's not a killer in any real sense.

I have no issue with manually calling ob_sessrewrite.

Nao

  • Dadman with a boy
  • Posts: 16,079
Re: Pretty URL remarks
« Reply #4, on April 10th, 2012, 11:16 PM »
Does anyone else has something to add? Pretty please? I'm not sure what's best here.
Re: Pretty URL remarks
« Reply #5, on April 10th, 2012, 11:19 PM »
An alternative would be to do it like the svn currently does, only we add [?#] directly after $preg_replace. This is the first thing I did but it will still preventively match all urls until the question mark is not found.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Pretty URL remarks
« Reply #6, on April 11th, 2012, 12:18 AM »
Give it a try and see how it works out :)

MultiformeIngegno

  • Posts: 1,337
Re: Pretty URL remarks
« Reply #7, on April 11th, 2012, 12:57 PM »
Quote from Nao on April 10th, 2012, 11:16 PM
Does anyone else has something to add? Pretty please? I'm not sure what's best here.
If you explain the matter in a newbie way I'll glad you with my opinion.. :P

Nao

  • Dadman with a boy
  • Posts: 16,079
Re: Pretty URL remarks
« Reply #8, on April 11th, 2012, 12:59 PM »
Well, it's very simple...

...Once I figure out how to explain it simply :P
Re: Pretty URL remarks
« Reply #9, on April 11th, 2012, 02:59 PM »
So... After doing some tests... I'm pretty much getting the same performance across all method. They're very variable, the server seems to be unstable in that respect. I was at first a bit surprised but not that much... When trying to match against a resource URL (image etc), Wedge will either stop at 'index.php' (if it's enabled), or at the [?;&#] if index.php is disabled. In both cases it should really be the same performance...

I'm not sure why but it seems that the 'original' solution (the svn one) removes the index.php automatically in the 'wedge.org/index.php' links, even though the regex doesn't seem to match them... Not that it's important.
Re: Pretty URL remarks
« Reply #10, on April 12th, 2012, 12:41 AM »
The more I simplify the code, the slower it gets... Sweet. And it broke this site without me knowing. Sorry.

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Pretty URL remarks
« Reply #11, on April 12th, 2012, 12:49 AM »
The sad reality is that the complex code that is very hard to read and fathom out is the fastest.

Nao

  • Dadman with a boy
  • Posts: 16,079
Re: Pretty URL remarks
« Reply #12, on April 12th, 2012, 09:23 AM »
I'm sure it's just a 'simple' mistake somewhere...
Re: Pretty URL remarks
« Reply #13, on April 12th, 2012, 08:27 PM »
So... I think it's working again?

But right now, it's fairly... Slower, I should say. I'm not exactly sure why -- the regex should be faster (if anything it's simpler...), and I also optimized the underlying PHP, but still, it's up to twice slower than before to do the replacements.

I'm kinda hoping I could make it a bit faster by doing the overall page replacement only once:

- retrieve all content in the page like this: '\btopic=(\d+)|\bboard=(\d+)' (perhaps also \bcategory=(\d+)??)
- thus we have a list of topic and board IDs ($matches[1] has topics, $matches[2] has boards), including stuff that isn't ACTUALLY topic IDs to transform, but we get it as quick as we can. Now we simply do our search & replace regex...
- during the replacement, if we ever find 'topic=' or 'board=' in the match, and we don't have a list of topic/board names, fill the topic/board name cache(s).
- possibly call a new hook at this point...
- test for action=profile, replace with profile. (Has custom URL scheme)
- test for action=, replace with action. (Has custom URL scheme)

Should cater for everything...?! Except for the URL cache. (Which I've always contemplated dropping anyway... If performance is THAT important to you, maybe you shouldn't consider doing pretty URLs in the first place...?! Because that cache table ALWAYS gets huge after a while...)

What do you think?

Currently, this is pretty much how it is done...:
- run the search regex
- loop through all links and find uncached URLs
- call topic handling, loop through all links to find topics and transform them
- call action handling, loop through all links to find actions and transform them
- etc...
- run the replace regex with a callback that will look for the rebuilt URL in the temporary memory cache, and will do the replacement on all links.

I have a feeling that it *could* be made faster, see...!
Do you share my feeling or not?
Posted: April 12th, 2012, 08:16 PM

Funny... I just did a benchmark of my two regexes. And indeed the 'faster' one (in theory) is 4 times faster than the older one.
So the slowing down thingy must be somewhere else...

Arantor

  • As powerful as possible, as complex as necessary.
  • Posts: 14,278
Re: Pretty URL remarks
« Reply #14, on April 12th, 2012, 08:31 PM »
It's certainly possible to do it as a once-run thing, SlammedDime's mod (SimpleSEF) does that if I remember rightly.

I see where you're going with making it faster though, and I think you're right, that it can be made faster.