Return to Level1Techs.com

I need help capturing domains using regex


#1

I am trying to create a regular expression to capture domain names and subdomains but am running into issues. I am new when it comes to regex so I might not be able to do what I am trying to do.

I want my regex to capture domains like
http://domain.com
httpS://domain.com
http://app.domain.com
app.domain.com
domain.com
cool.app.domain.com

I want each section of the domain to be in its own group.
http:// group one
app. group two
domain. group three
com group fore

The issue I am running into is that I cant expand the number of groups I have if I had an arbitrary number of sub-domains. My end goal is to strip the http:// portion and the sub-domains and just have the main domain but just can’t get this to work.

This is my implementation I just don’t know how to expand the groups. can anyone guide me?

/(https?:\/\/)?([a-z]+\.)([a-z]+\.)([a-z]+)\i

#2

Are you GREPing through files looking for URIs? If so, I would pass the entire uri match to cut or awk like so:

[email protected]:/tmp$ cat test-doc
This file has a lot of domains. Like http://domain.com. and some more
domains. httpS://domain.com. sometimes they are in an html element like
<a href="http://app.domain.com">link</a>. But other times they are just
free floating. app.domain.com. And they don't always have a period at then         end
cool.app.domain.com but sometime they do domain.com. and then the file is
done.

I think this expression would grab all your matches:
egrep -o -i “(https?://)?([a-z]+.)+([a-z]+)”

[email protected]:/tmp$ egrep -o -i "(https?:\/\/)?([a-z]+\.)+([a-z]+)" test-doc
http://domain.com
httpS://domain.com
http://app.domain.com
app.domain.com
cool.app.domain.com
domain.com

and then just pipe those to sed to remove the protocol:

sed 's!http[s]\?://!!i'

and then to cut:

rev | cut -d '.' -f 1,2 | rev

the whole thing:

[email protected]:/tmp$ egrep -o -i "(https?:\/\/)?([a-z]+\.)+([a-z]+)" test-doc | 
sed 's!http[s]\?://!!i' | rev | cut -d '.' -f 1,2 | rev
domain.com
domain.com
domain.com
domain.com
domain.com
domain.com

or If you need fancier processing, you could AWK instead:

awk 'BEGIN{FS="."}{ for (i=1 ; i < NF+1 ; ++i ) printf "group %i = %s\n", i, $i; 
printf "TLD = %s.%s\n", $(NF-1), $NF}'

and the whole thing:

[email protected]:/tmp$ egrep -o -i "(https?:\/\/)?([a-z]+\.)+([a-z]+)" test-doc |     sed 's!http[s]\?://!!i' | awk 'BEGIN{FS="."}{ printf "match = %s\n", $0; for (i=1 ; i < NF+1 ; ++i ) printf "group     %i = %s\
n", i, $i; printf "TLD = %s.%s\n\n", $(NF-1), $NF}'
match = domain.com
group 1 = domain
group 2 = com
TLD = domain.com

match = domain.com
group 1 = domain
group 2 = com
TLD = domain.com

match = app.domain.com
group 1 = app
group 2 = domain
group 3 = com
TLD = domain.com

match = app.domain.com
group 1 = app
group 2 = domain
group 3 = com
TLD = domain.com

match = cool.app.domain.com
group 1 = cool
group 2 = app
group 3 = domain
group 4 = com
TLD = domain.com

match = domain.com
group 1 = domain
group 2 = com
TLD = domain.com

If you are not using GREP what regex parser are you using? std::regex?