Saving info for use with llamaindex

grando · January 19, 2024, 7:19pm

I’ve been playing with LlamaIndex for about a week and now I want to save a bunch of documentation to have it available to query. I was thinking about some different solutions for saving the content of websites and so far the most promising thing I’ve found has been pandoc.

Bash script generated by mistral v0.2 with some minor edits for using pandoc the way I want to

#!/bin/bash

# Check if the user provided a URL as an argument
if [ $# -ne 1 ]; then
echo "Usage: ./save_url_as_markdown.sh <URL>"
exit 1
fi

# Extract the URL domain name for the directory name
url="$1"
url="${url#*://}"
domain="${url%/*}"
filename="${url##*/}"

# Create the output directory
mkdir -p "$(echo $domain | sed 's/\//\/\&/g; s/&/\//g')"
cd "$(echo $domain | sed 's/\//\/\&/g; s/&/\//g')"

# Download the webpage content using wget
wget -q -O "$filename.html" "$url"

# Convert HTML to Markdown using pandoc
pandoc -f html -t markdown "$filename.html" > "$filename.md"
rm "$filename.html"

# Print success message
echo "Markdown file '$filename.md' created successfully in $(pwd)/$(echo $domain)"

This works alright, it saves the markdown in a directory that matches the url of the website it came from so that I still have the original url returned if I ask the llm to reference where it got the info in the prompt. The formatting can be a little hit or miss but it generally seems to work. Ideally I would like this to also save images and embed those in the markdown but at this point I might be asking too much.

I mostly just wanted to reach out and see if people have been using llamaindex or archiving information that they find online. I found the thread below which has some suggestions if others are looking for some other solutions.