Geffy said:
I am sure its stuff like that that just scares the bejeebus out of people
Sure as hell did when i first looked at regular expressions.
It is actually quite simple.
Code:
!(^|\s)((.+)://(\w+[^\[\]\s/]+\.\w{2,}\.?(:\w+[^\[\]\s/])?)/?(\w+[^\[\]\s]+)?)!i
Dissected below.
This tells it to match only if there is a space before it, or an \n. Otherwise, we don't do matching, this way text like "blahhttp://x-istence.com/" does not get matched.
The ! is the starting of the expression. It could also be a # symbol, it does not make a difference, as long as it is not one of the reserved ones: []\|(). Basically, use common sense. If you choose one you will rarely use in your pattern, then you won't have to do all the backward slashes to undo what you did. I use ! in this example.
A simple brace, it means everything inside this has to be available in the backwards slashes for us to use, for instance, the space above is available in \\1, so this will be \\2.
Match the <anythinghere>:// part. Straightforward, .+ means one or more characters.
This one is available (protocol only, not the ://) at \\3
Opening brace, this one will be available at \\4
Code:
\w+[^\[\]\s/]+\.\w{2,}\.?
This code is fairly simple. \w+ means to match one of any character at all, but we want to limit that, as we don't want spaces. So we add a bracket, and as the first symbol put a caret (^). This means that the rest should cause it to NOT match. So we don't want to match if there is a [, ], \s (any type of space). We add a + to the end of the stuff in brackets, this means it can match any number of characters (Or in this case, the opposite, not match those).
Next is the \. This is just a way to show that we want to match the literal period.
So far we have matched:
www.somedomain
We want to match a . for com, net, org or any other, the \w{2,} means we want to match 2 or more characters at least, so that means we can also match hello.nl.
The last \.? is to match if there is an extra period in the domain, which is allowed to mean the literal domain, most browsers support this, so that URL's like this will work:
http://osnn.net./index.php
Look at the above for an explenation, all we are doing here is looking for the optional port number, so that matches like
http://osnn.net:80/ will work flawlessly. The ? at the end means it can match 0 or more times. In most cases it will not be matched, and \\5 will be empty, specifying that no port is available.
Close \\4. This is used in our pattern to match just the hostname and port number, to make slashdot style links.
We look for the optional stuff following the domain name, so that things like
http://example.net/test/blah.html is also used when making it into a URL.
Close the entire thing, now \\2 is filled with goodiness, namely the entire URL without the extra space at the beginning which is stored in \\1.
The ! means it is the end of the pattern, the extra "i" is to make the pattern case insensitive. For more pattern modifiers check out
http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php.
I hope this helps just a little bit
.