Tag Archives: parsing

Regex to the Rescue

29 Feb

Problem:

At work, we had a requirement to transform text URL’s into clickable links. Ideally, these links would have been entered in as well formatted hyper links, but that wasn’t the case and we didn’t have the option to implement that fix. Initially, the idea was have end users include the http:// protocol on links so only text would that had http:// would be transformed. This was rather optimistic as end users didn’t follow this training. Users were entering links in all shapes and forms. Some links had no protocol or host. This lead to many links being missed.

Solution:

To solve the problem, I wrote a little regex to capture the links and then reformatted them. The regex is below:

b((https?|ftp)://)?([A-Z|a-z|0-9|-]+[.]){1,4}(com|org|us|net|edu)([A-Z|a-z|./?=_&%|0-9]+)?b

This will match a link with or without a protocol or host specified, an alphanumeric domain with dashes, and a path and query string following it. It only matches on the TLD listed in the middle section. This can be problematic or useful depending on what you want to match. Overall, this was a major improvement that allowed end users to continue with their behavior and still get the result we were looking for. This solution has been rock solid so far in capturing the links entered, but it still has the opportunity to miss certain URL’s. YMMV.