Return to Level1Techs.com

Trying to make a command that taggs files from html scource


#1

hi i am fairly new to linux scripting and I need help making a command that will search through the contents of an html file(s) and use the selected content of the html to rename the folder associated with the file. Basicly what has happened is that i have downloaded content into folders and i need to pull tagging data from the webpage and append said tags to the end of the folders names to make them keyword searchable. The goal being that the folder will read as “John Doe [person1] [person2]” ect.

I have already used wget to aquire all the html files names as “index 01.html”. Within each HTMl file contains a string labeled ‘<meta itemprop="name" content="John Doe" />’ this name is exactly the same as the current name of the folder so this can be used to locate the folder the needs to be changed. Next in the html is another string ‘<meta name="twitter:description" content=" person1, person2, ect" />’ which contains the taggs I wish to add to the folder.

I have looked all over for how to do this and i believe it involves using grep but all of the examples that i have seen just use grep to perform search and place operations, not search for string then pipe the contents from the string to another application.

Lastly, if the command would repeat the operation for every .html file within the folder that would save me a TON of time.


#2

If you change the name to anything other than index.html then any corresponding scripts or CSS files won’t be able to find it and it will break. Also, it will no longer be hostable.

If any of the stuff you downloaded to use offline was made with Webpack, or similar tool, then this will not be possible as it obfuscates everything.


#3

not looking to preserve the html, rather was planning to just remove it after the command finished. I just want to extract the tags and append them to the name of the folder so it can be searched for using catfish or some other search program. These files will be only for offline use.


#4

So you’re not going to use them, just store them?


#5

indeed, sorry if i was not clear. I just want to extract text and match it with a corresponding folder name in-order to append said text to the folder name so it can be more easily searched for should I need to call up all instances of say “person1”


#6

Ahh lol now I get it. Sorry. :bowing_man:


#7

no problem, i kinda went about explaining it backwards


#8

Ok so I have started building this script and so far I am using this method https://stackoverflow.com/questions/15793452/how-do-i-get-value-of-variable-from-file-using-shell-script as my method for finding values within my html file. The current code that I am using is grep -Po '(?<=<meta itemprop="name" content=").*' index\ 01.html that ouputs John Doe" /> so how would I modify the command to excude the " /> at the end? just want the info between the quotations.


#9

so i managed to fix this issue by piping the output to tr so the command reads grep -Po '(?<=<meta itemprop="name" content=)".*"' index\ 01.html | tr -d '"' but this seems like a messy solution and I am curious how you would do this just in grep.


#10

ok so this is what i have got so far

#!/bin/bash
##The badass tagging script

##Make script loop

##Define varables
htmlfile=$(ls -1 *.html | head -1)
targetname=$(grep -Po '(?<=<meta itemprop="name" content=)".*"'  $htmlfile | tr -d '"')
tags=$(grep -Po '(?<=<meta name="twitter:description" content=)".*"'  $htmlfile | tr -d ',"')
resolvedname=$(find -type d -name "*$targetname*" | tr -d "./")

##Execution Code

echo "Target Name is"
echo $targetname
echo "Resolved name is"
echo $resolvedname
cmd='mv "$resolvedname" "$targetname"'
eval $cmd
echo "Name Chaged To"
echo $targetname
cmd='cd "$targetname"'
eval $cmd
cmd='echo $tags > "$targetname".tags'
eval $cmd
cmd='cd ..'
eval $cmd
echo "Removing Scource HTML"
cmd="rm $htmlfile"
eval $cmd
echo "Repeat"

the issue is that this script is not very flexible and unexpected names is causing issues. Also I intend to make this script loop