TheCakeIsNaOH's Devember - xBrowserSync to Archivebox and Misc programming

TheCakeIsNaOH · October 15, 2020, 11:21pm

I am working on a script that downloads bookmarks from xBrowserSync and saves them into an ArchiveBox instance.

xBrowserSync (abbreviated xbs) is a browser extension/android app that is an alternative to the chrome and firefox built-in bookmark syncing. The advantage of it is that it can sync between firefox and chromium browsers, it is open-source, and it is easily self hostable. The FireFox sync can be self hosted, but is a bit of a pain to set up.

Archivebox is basically a self-hosted archive.org wayback machine. It is also open source.

I am using Python, mainly because that is what Archivebox is programmed in. I plan for the script to be run on the machine that the Archivebox instance is on. Secondarily, there is an implementation in Python of the compression algorithm that xBrowserSync uses (lzutf8), available here.

The repository is here:
https://github.com/TheCakeIsNaOH/xbs-to-archivebox

TheCakeIsNaOH · October 15, 2020, 11:34pm

Tasks:

Clean up code, figure out ways to improve it
Integrate more closely with archivebox

Completed:

Check if URL and sync ID are valid
Download encrypted bookmark data
Decrypt data with password
Decompress data
Parse the data (it’s in json)
Select which bookmarks to save
- Blacklist selected folders via regex
- Input list via args
Get list of URLs from filtered bookmarks
Run archivebox with said list.
Write up a readme

TheCakeIsNaOH · October 29, 2020, 12:48am

So, I have the downloading, decrypting and decompressing parts done, and it can now output prettified json.

I’m currently trying to wrap my head around how to parse the nested json, and how to filter it to the correct entries. After that, getting the URLs should be really easy.

TheCakeIsNaOH · October 29, 2020, 2:06am

Here is a sample of the json.

[
    {
        "title": "[xbs] Other",
        "children": [
            {
                "title": "xBrowserSync to Archivebox - Saving bookmarks easily",
                "url": "https://forum.level1techs.com/t/xbrowsersync-to-archivebox-saving-bookmarks-easily/162649",
                "description": "I am working on a script that downloads bookmarks from xBrowserSync and saves them into an ArchiveBox instance.  xBrowserSync (abbreviated xbs) is a browser extension/android app that is an alternative to the chrome and firefox built-in bookmark syncing. The advantage of it is that it can sync\u2026",
                "id": 2
            },
            {
                "title": "xBrowserSync - Browser syncing as it should be: secure, anonymous and free!",
                "url": "https://www.xbrowsersync.org/",
                "description": "Free and open source tool for syncing your bookmarks and browser data between your various browsers and devices.",
                "tags": [
                    "anonymous",
                    "bookmarks sync",
                    "browser sync",
                    "open source",
                    "privacy",
                    "xbrowsersync"
                ],
                "id": 3
            }
        ],
        "id": 1
    },
    {
        "title": "[xbs] Toolbar",
        "children": [
            {
                "title": "testA",
                "children": [
                    {
                        "title": "Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine",
                        "url": "https://archive.org/",
                        "id": 5
                    }
                ],
                "id": 8
            },
            {
                "title": "TestB",
                "children": [
                    {
                        "title": "ArchiveBox | \ud83d\uddc3 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more\u2026",
                        "url": "https://archivebox.io/",
                        "id": 4
                    },
                    {
                        "title": "nestedfolder",
                        "children": [
                            {
                                "title": "xBrowserSync to Archivebox - Saving bookmarks easily",
                                "url": "https://forum.level1techs.com/",
                                "description": "I am working on a script that downloads bookmarks from xBrowserSync and saves them into an ArchiveBox instance.  xBrowserSync (abbreviated xbs) is a browser extension/android app that is an alternative to the chrome and firefox built-in bookmark syncing. The advantage of it is that it can sync\u2026",
                                "id": 11
                            },
                            {
                                "title": "Level 1 Techs",
                                "url": "https://store.level1techs.com/",
                                "id": 12
                            }
                        ],
                        "id": 10
                    }
                ],
                "id": 9
            },
            {
                "title": "TheCakeIsNaOH/xbs-to-archivebox",
                "url": "https://github.com/TheCakeIsNaOH/xbs-to-archivebox",
                "description": "Contribute to TheCakeIsNaOH/xbs-to-archivebox development by creating an account on GitHub.",
                "id": 6
            }
        ],
        "id": 7
    }
]

TheCakeIsNaOH · October 29, 2020, 3:29am

I got the parsing part done, now I need to figure out how to efficiently filter via regex rather than via a “in” statement. Oh, and input the blacklist via args, and return the urls rather then print them. But those are very doable.

If anyone wants to look at my code and point out things I am doing wrong, feel free, because this is my first time using python for anything more than a couple of lines of code, so I am probably missing lots of best practices.

TheCakeIsNaOH · October 29, 2020, 5:23pm

The main functionality is now done.

I’m not directly importing the URLs to archivebox, I’m instead writing a file with each url on a new line. This file then can be imported to archivebox via something like cat ~/urls.txt | archivebox add.

Next up is looking for bugs and ways to improve the code, along with writing up some documentation.

TheCakeIsNaOH · December 11, 2020, 9:51pm

So I wrote at least a bit of a readme.

It looks like the people over at ArchiveBox are ok with adding more formats to import, and there is a nice PR here showing where stuff goes for a new format. So I’m thinking of adding at least the parsing directly into ArchiveBox.

I created a feature request to see if there is any feedback on my plan, and I haven’t gotten any yet, so I’ll probably go ahead when I have the time and create a PR.

TheCakeIsNaOH · December 11, 2020, 9:55pm

Since the ArchiveBox issue kind of stalled, I have been working on some other programming stuff as well.

I figured out how to add a custom parameter to an Inno setup script, specifically to improve the support for silent installs for Lively. It involves figuring out some of the syntax for Pascal, and some more of the anatomy of Inno setup files. PR here:

TheCakeIsNaOH · December 11, 2020, 10:05pm

I also have been working on improving various Chocolatey packages I use. Some are fixing issues, and others are related to “internalizing” the packages (some do both). Internalizing is when the package includes the zip or exe or msi or whatever inside the .nupkg file instead of downloading in at runtime.

PRs that fix issues:

PRs that internalize packages:

TheCakeIsNaOH · December 12, 2020, 5:29pm

So another thing I made is a python script to count the total words in my calibre ebook library.

All it needs is 1. an install of calibre, which you should already have if you have calibre library, and 2. the wordcount plugin, and possibly 3. an install of python, although that is generally already installed as a dependency on Linux.

First, you need to setup the word count plugin, and use it count the words ebooks and put in a custom collumn. Instructions are included in the plugin.

Next, this script can be run:

#!/usr/bin/env calibre-debug

#Run directly if you have calibre-debug installed and on path.
#Otherwise, can run by running calibre-debug, then pasting in the contents
#or by running calibre-debug calibre-wordcount.py

from calibre.library import db
#path change to calibre library directory(point to directory, not metadata.db)
db = db('/home/user/Documents/ebooks').new_api
#paste in search, can copy from calibre GUI search
epub_ids = db.search('')
count = 0

for epub_id in epub_ids:
  #may need to edit #wc to something else if lookupname is different for wordcount
  #wc can also be changed to #pc or similar for page count
  data = db.field_for('#wc',epub_id)
  count = count + data

import locale
locale.setlocale(locale.LC_ALL, '')
print(locale.format("%d", count, grouping=True))

And it will print the total word count in your calibre library. Or at least the total word count of all of the books that have the custom #wc collumn with their word count.

The calibre db python API is really nice because you can just paste in a search from the calibre GUI and it will only select the books from the search.

I think I have a decent library:

TheCakeIsNaOH · January 3, 2021, 12:22am

So to finish off Devember I did a number of things to my choco-remixer project on github.

Figured out linting via PSScriptAnalyzer and updated the formatting and some other liniting recommendations.
Added checksum support so the binaries downloaded are checked against the checksum in the install script.
Added support for a bunch of more packages.
A gigantic amount of refactoring, some of (hopefully) following best practices, others because I thought the code could be improved. The largest amount of the refactoring was moving more code to individual functions.