Pdfgrep batch rename hell

here is the desired fileformat if conditions are satisfied if not leave the file like it is

yyyy-mm-dd_predetermined-keywords-if-found.pdf

here is the problem i have

1 multiple date formats yyyy-mm-dd, dd-mm-yyyy
2 multiple date values in the pdf, how you pick the “latest”?
3 some pdf do have the date field but the date is in the next line / below i cant get pdfgrep to read the date

What sort of patterns have you tried?

Maybe explain a bit more about what you’re trying to do? I appreciate the attempt at brevity, but I feel like I’m missing a lot of context here.

You could write a bash function to parse out the dates, then compare Y/M/D.

Something like nested switch case?

This, you’d want to separate your date strings and count the length of them, then re-order the dd-mm-yyyy format. If you have 2 char year format, you really won’t be able to programmatically parse that out easily.

pdfgrep supports the -A flag, which allows you to specify lines after the matched line.

shame on me for not reading the docks properly

so far i get this working but i got some edge cases

#!/bin/bash
keywords=("keyword1" "keyword2" "keyword3" "keyword4" "keyword5" "keyword6" "keyword7" "keyword8" "keyword9" "keyword10")

for file in *.pdf; do
    echo "Processing file: $file"
    dates=$(pdfgrep -A 3 "abc |abcd|abcd;" "$file" | grep -oE '202[0-9]-[0-9][0-9]-[0-9][0-9]|[0-9][0-9]-[0-9][0-9]-202[0-9]|[0-9][0-9]-[0-9][0-9]-2[0-9]|[0-9][0-9]\.[0-9][0-9]\.2[0-9]')
    echo "Dates found: $dates"
        if [ -n "$dates" ]; then
        latest_date=$(echo "$dates" | uniq | tail -n 1)
    else
        latest_date=""
    fi
    echo "Latest date: $latest_date"
    keyword=""
    for key in "${keywords[@]}"; do
        match=$(pdfgrep -i "$key" "$file" | grep -oE "$key")
        if [ -n "$match" ]; then
            keyword+="${match}_"
        fi
    done
    echo "Keywords found: $keyword"
    keyword=${keyword%_}
    if [ -n "$latest_date" ]; then
        if [ -n "$keyword" ]; then
            new_filename="${latest_date}_${keyword}.pdf"
        else
            new_filename="${latest_date}.pdf"
        fi
        counter=1
        while [ -e "$new_filename" ]; do
            new_filename="${latest_date}_${keyword}_$counter.pdf"
            counter=$((counter + 1))
        done
        mv "$file" "$new_filename"
        echo "Renamed: $file to $new_filename"
    fi
done

1 Like

You will always have edge cases. Word of advice; if your script does what you want it to for 90%, and you need to spend 10x the hours to just doing it manually for the remaining 10%, just make sure you detect the flawed ones and fix manually. Problem solved.

1 Like

That’s great, but with rename tasks it is very easy to delete a lot of files by renaming them to the same name, or otherwise screw up the names. I’ve seen people lose a lot of data quite a few times.

mv -i can help avoid this, but a mess can still be the result.

IME a good way to avoid these problems is to use a script to generate a script of mv -i commands. Then you can check it, and check it again, or maybe use an editor to fix up edge cases. It may not matter much if the generated script is millions of lines long.

Also, for the “the date is in the next line” problem I’ve used pdftotext (from poppler-tools on ubuntu, maybe poppler on debian) or pdf2txt (python3-pdfminer on ubuntu, pdfminer on debian) to extract data an awk script to put things together. It worked well for getting data from utility bills.

1 Like

Folks… I know this is old… About 7 months and I know gnatfnt probably wont come back to see this but… I found this moderately useful. I was in a different but similar situation where I felt I needed to take the pdf title from the metadata since the filenames werent even close and sort and archive them in an encrypted zip. So I did this in my script. I added some error handling and made it not so dumb

#!/bin/sh

# Set a flag to track whether any files were renamed
renamed_any=0

# Find all PDF files in the current directory
for pdf in *.pdf; do
  # Only process if it's a file
  [ -f "$pdf" ] || continue

  # Extract the title from the PDF file using pdfinfo
  title=$(pdfinfo "$pdf" | awk -F ': ' '/Title:/ {gsub(/^[ \t]+|[ \t]+$/, "", $2); print $2}')
  
  # Check for non-empty title
  if [ -n "$title" ]; then
    # Replace spaces with underscores
    new_name=$(echo "$title" | tr ' ' '_').pdf

    # Check for filename conflict
    if [ "$new_name" != "$pdf" ] && [ ! -e "$new_name" ]; then
      if mv "$pdf" "$new_name"; then
        renamed_any=1
      else
        echo "Failed to rename '$pdf' to '$new_name'" >&2
      fi
    else
      echo "Cannot rename '$pdf': target '$new_name' exists or is the same as the source." >&2
    fi
  else
    echo "No title found for '$pdf', skipping." >&2
  fi
done

# Archive the renamed PDF files into an encrypted zip file if any were renamed
if [ "$renamed_any" -eq 1 ]; then
  zip -P "CHANGEME" archived_pdfs.zip *.pdf || echo "Failed to create zip archive." >&2
else
  echo "No files were renamed; skipping zip creation." >&2
fi

As for my feedback for the OP

I did make some changes to what you did. Minor ones and I will explain them after I output the script here:

#!/bin/sh

keywords="keyword1 keyword2 keyword3 keyword4 keyword5 keyword6 keyword7 keyword8 keyword9 keyword10"

for file in *.pdf; do
    [ -e "$file" ] || continue  # Skip if no PDF files exist
    echo "Processing file: $file"
    
    # Extract dates from the PDF using pdfgrep
    dates=$(pdfgrep -A 3 "abc\|abcd\|abcd;" "$file" | \
            grep -oE '202[0-9]-[0-9][0-9]-[0-9][0-9]\|[0-9][0-9]-[0-9][0-9]-202[0-9]\|[0-9][0-9]-[0-9][0-9]-2[0-9]\|[0-9][0-9]\.[0-9][0-9]\.2[0-9]')
    
    # Check for valid date presence
    if [ -n "$dates" ]; then
        latest_date=$(echo "$dates" | sort -u | tail -n 1)
    else
        echo "No valid dates found, skipping..."
        continue
    fi
    
    echo "Latest date: $latest_date"
    keyword=""

    # Loop over keywords to build keyword string
    for key in $keywords; do
        if match=$(pdfgrep -i "$key" "$file" | grep -oE "$key"); then
            keyword="${keyword}${match}_"
        fi
    done

    # Remove trailing underscore if any keywords were found
    keyword=${keyword%_}
    echo "Keywords found: $keyword"

    # Determine new filename based on dates and keywords
    if [ -n "$latest_date" ]; then
        new_filename="${latest_date}${keyword:+_$keyword}.pdf"

        # Handle filename collisions
        counter=1
        while [ -e "$new_filename" ]; do
            new_filename="${latest_date}${keyword:+_$keyword}_$counter.pdf"
            counter=$((counter + 1))
        done
        
        mv "$file" "$new_filename" && echo "Renamed: $file to $new_filename"
    fi
done

So I am a bit pedantic. I like scripts to be AS PORTABLE as possible which means full POSIX compliance. Its often overlooked . What did I do to achieve posix complaiance? Well I just changed #!/bin/bash to #!/bin/sh to ensure compatibility with POSIX-compliant shells. That really is all you need to do. If you want to be extra pedantic

Also I did change how the keyword handling functioned. Instead of using an array syntax that is specific to bash, a space-separated list was my choice instead, which works with sh.

I think you shouldn’t skip out on error handling if you ever intend to use this script later. So I added error handling for files not being found by placing check [ -e "$file" ] || continue to skip the loop iteration if no PDF files exist, ensuring that the script does not run without valid input.

In addition to the above changes I feel you should use a pipe to pass the content to the grep call for better readability too. Just a little creature comfort here. Additionally, sort -u is used to avoid duplicates directly while obtaining the latest date.

The next change I made was with your keyword building. Instead of concatenating strings with _ and checking for any null characters, I directly utilized string expansion with keyword="${keyword}${match}_" while keeping the existing checks. You might find this useful for future scripts. :wink:

Now moving onto the next thing. Something I make a habit of in my scripts and you may find helpful in later script writing is ensuring that I properly write new filenames. I used conditional expansion ${keyword:+_$keyword} to only include the keyword part in the filename if keyword is non-empty, also compacting the condition and removing a redundant if check that you placed in your code.

The last bit of Error Handling was for you move operation/ The mv command now has an && to echo the rename message only if the move was successful which in my opinion is just another way to ensure the script handles everything appropriately.

If you never come back @gnatfnt hopefully this email notification finds you well :joy:

1 Like

hi there, the script currently handling over 90% of the pdf it was build for



#!/bin/bash

keywords+=(
  "stuff"
)


extract_clean_dates() {
  local text="$1"
  local clean_dates=""
  
  while IFS= read -r line; do
    while [[ "$line" =~ ([0-9]{1,2})[-/\.]([0-9]{1,2})[-/\.]([0-9]{2,4}) ]]; do
      date_match="${BASH_REMATCH[0]}"
      clean_dates+="$date_match"$'\n'
      line="${line#*"${BASH_REMATCH[0]}"}"
    done
    
    while [[ "$line" =~ ([0-9]{4})[-/]([0-9]{1,2})[-/]([0-9]{1,2}) ]]; do
      date_match="${BASH_REMATCH[0]}"
      clean_dates+="$date_match"$'\n'
      line="${line#*"${BASH_REMATCH[0]}"}"
    done
  done <<< "$text"
  
  echo "$clean_dates"
}

format_date() {
  local date_str="$1"
  local formatted_date=""

  date_str=$(echo "$date_str" | xargs)
  
  if [[ "$date_str" =~ ^([0-9]{4})[-/]([0-9]{1,2})[-/]([0-9]{1,2})$ ]]; then
    year="${BASH_REMATCH[1]}"
    month="${BASH_REMATCH[2]}"
    day="${BASH_REMATCH[3]}"
    
    month=$(echo "$month" | sed 's/^0*//')
    day=$(echo "$day" | sed 's/^0*//')
    
    month=$(printf "%02d" "$month")
    day=$(printf "%02d" "$day")
    
  elif [[ "$date_str" =~ ^([0-9]{1,2})[-/]([0-9]{1,2})[-/]([0-9]{4})$ ]]; then
    day="${BASH_REMATCH[1]}"
    month="${BASH_REMATCH[2]}"
    year="${BASH_REMATCH[3]}"
    
    month=$(echo "$month" | sed 's/^0*//')
    day=$(echo "$day" | sed 's/^0*//')
    
    month=$(printf "%02d" "$month")
    day=$(printf "%02d" "$day")
    
  elif [[ "$date_str" =~ ^([0-9]{1,2})[.]([0-9]{1,2})[.]([0-9]{2})$ ]]; then
    day="${BASH_REMATCH[1]}"
    month="${BASH_REMATCH[2]}"
    year="20${BASH_REMATCH[3]}"
    
    month=$(echo "$month" | sed 's/^0*//')
    day=$(echo "$day" | sed 's/^0*//')
    
    month=$(printf "%02d" "$month")
    day=$(printf "%02d" "$day")
    
  elif [[ "$date_str" =~ ^([0-9]{1,2})[-]([0-9]{1,2})[-]([0-9]{2})$ ]]; then
    day="${BASH_REMATCH[1]}"
    month="${BASH_REMATCH[2]}"
    year="20${BASH_REMATCH[3]}"
    
    month=$(echo "$month" | sed 's/^0*//')
    day=$(echo "$day" | sed 's/^0*//')
    
    month=$(printf "%02d" "$month")
    day=$(printf "%02d" "$day")
    
  else
    echo "Invalid date format: $date_str" >&2
    return 1
  fi

  if (( 10#$year >= 2024 && 10#$year <= 2025 )) && (( 10#$month >= 1 && 10#$month <= 12 )) && (( 10#$day >= 1 && 10#$day <= 31 )); then
    case $month in
      "04"|"06"|"09"|"11")
        if (( 10#$day > 30 )); then
          echo "Invalid day for month $month: $day" >&2
          return 1
        fi
        ;;
      "02")
        max_days=28
        if (( 10#$year % 4 == 0 )) && (( 10#$year % 100 != 0 || 10#$year % 400 == 0 )); then
          max_days=29
        fi
        if (( 10#$day > max_days )); then
          echo "Invalid day for February $year: $day" >&2
          return 1
        fi
        ;;
    esac
    
    formatted_date="$year-$month-$day"
  else
    echo "Invalid date components: $year-$month-$day" >&2
    return 1
  fi

  echo "$formatted_date"
}

for file in *.pdf; do
  [[ -f "$file" ]] || continue
  
  echo "Processing file: $file"

  temp_text=$(mktemp)
  pdftotext "$file" "$temp_text" 2>/dev/null
  
  if [ $? -ne 0 ]; then
    echo "Error extracting text from $file. Trying pdfgrep method."
    raw_dates=$(pdfgrep -o -P "([0-9]{1,2})[-/\.]([0-9]{1,2})[-/\.]([0-9]{2,4})|([0-9]{4})[-/]([0-9]{1,2})[-/]([0-9]{1,2})" "$file")
    dates=$(extract_clean_dates "$raw_dates")
  else
    raw_text=$(cat "$temp_text")
    dates=$(extract_clean_dates "$raw_text")
    rm "$temp_text"
  fi

  echo "Dates found:"
  echo "$dates"

  latest_date=""
  while IFS= read -r date_str; do
    [[ -z "$date_str" ]] && continue
    
    formatted=$(format_date "$date_str")
    if [ $? -eq 0 ] && [ -n "$formatted" ]; then
      if [ -z "$latest_date" ] || [ "$formatted" \> "$latest_date" ]; then
        latest_date="$formatted"
      fi
    fi
  done <<< "$dates"

  echo "Latest date: $latest_date"

  declare -a matched_keywords=()
  for key in "${keywords[@]}"; do
    if [ -f "$temp_text" ]; then
      match=$(grep -i "$key" "$temp_text" | grep -io "$key" | head -n 1)
    else
      match=$(pdfgrep -i "$key" "$file" | grep -io "$key" | head -n 1)
    fi
    
    if [ -n "$match" ]; then
      matched_keywords+=("${match,,}")
    fi
  done

  first_keyword=""
  second_keyword=""
  if [[ ${#matched_keywords[@]} -ge 1 ]]; then
    first_keyword="${matched_keywords[0]}"
  fi
  if [[ ${#matched_keywords[@]} -ge 2 ]]; then
    second_keyword="${matched_keywords[1]}"
  fi

  keyword=""
  if [ -n "$first_keyword" ]; then
    keyword="$first_keyword"
    if [ -n "$second_keyword" ]; then
      keyword="${keyword}_$second_keyword"
    fi
  fi

  echo "Keywords found: $keyword"

  if [ -n "$latest_date" ]; then
    new_filename="${latest_date}"
    if [ -n "$keyword" ]; then
      new_filename="${new_filename}_${keyword}"
    fi
    new_filename=$(echo "$new_filename" | tr '[:upper:]' '[:lower:]' | tr ' ' '_')
    new_filename="${new_filename}.pdf"

    counter=1
    while [ -e "$new_filename" ] && [ "$file" != "$new_filename" ]; do
      new_filename="${latest_date}"
      if [ -n "$keyword" ]; then
        new_filename="${new_filename}_${keyword}_${counter}"
      else
        new_filename="${new_filename}_${counter}"
      fi
      new_filename=$(echo "$new_filename" | tr '[:upper:]' '[:lower:]' | tr ' ' '_')
      new_filename="${new_filename}.pdf"
      counter=$((counter + 1))
    done

    if [ "$file" != "$new_filename" ]; then
      mv "$file" "$new_filename"
      echo "Renamed: $file to $new_filename"
      echo "$file -> $new_filename (date: $latest_date, keywords: $keyword)" >> renaming_log.txt
    else
      echo "File already has the correct name: $file"
    fi
  else
    echo "No valid date found in $file, leaving unchanged"
    echo "$file -> NOT RENAMED (no valid date found, keywords: $keyword)" >> renaming_log.txt
  fi
done
1 Like