Are you GREPing through files looking for URIs? If so, I would pass the entire uri match to cut or awk like so:
clifford@Office-PC:/tmp$ cat test-doc
This file has a lot of domains. Like http://domain.com. and some more
domains. httpS://domain.com. sometimes they are in an html element like
<a href="http://app.domain.com">link</a>. But other times they are just
free floating. app.domain.com. And they don't always have a period at then end
cool.app.domain.com but sometime they do domain.com. and then the file is
done.
I think this expression would grab all your matches:
egrep -o -i “(https?://)?([a-z]+.)+([a-z]+)”
clifford@Office-PC:/tmp$ egrep -o -i "(https?:\/\/)?([a-z]+\.)+([a-z]+)" test-doc
http://domain.com
httpS://domain.com
http://app.domain.com
app.domain.com
cool.app.domain.com
domain.com
and then just pipe those to sed to remove the protocol:
sed 's!http[s]\?://!!i'
and then to cut:
rev | cut -d '.' -f 1,2 | rev
the whole thing:
clifford@Office-PC:/tmp$ egrep -o -i "(https?:\/\/)?([a-z]+\.)+([a-z]+)" test-doc |
sed 's!http[s]\?://!!i' | rev | cut -d '.' -f 1,2 | rev
domain.com
domain.com
domain.com
domain.com
domain.com
domain.com
or If you need fancier processing, you could AWK instead:
awk 'BEGIN{FS="."}{ for (i=1 ; i < NF+1 ; ++i ) printf "group %i = %s\n", i, $i;
printf "TLD = %s.%s\n", $(NF-1), $NF}'
and the whole thing:
clifford@Office-PC:/tmp$ egrep -o -i "(https?:\/\/)?([a-z]+\.)+([a-z]+)" test-doc | sed 's!http[s]\?://!!i' | awk 'BEGIN{FS="."}{ printf "match = %s\n", $0; for (i=1 ; i < NF+1 ; ++i ) printf "group %i = %s\
n", i, $i; printf "TLD = %s.%s\n\n", $(NF-1), $NF}'
match = domain.com
group 1 = domain
group 2 = com
TLD = domain.com
match = domain.com
group 1 = domain
group 2 = com
TLD = domain.com
match = app.domain.com
group 1 = app
group 2 = domain
group 3 = com
TLD = domain.com
match = app.domain.com
group 1 = app
group 2 = domain
group 3 = com
TLD = domain.com
match = cool.app.domain.com
group 1 = cool
group 2 = app
group 3 = domain
group 4 = com
TLD = domain.com
match = domain.com
group 1 = domain
group 2 = com
TLD = domain.com
If you are not using GREP what regex parser are you using? std::regex?