Using the offline database worked very well. Using the âsynonymsâ it provides, it catches a lot of the variations in the naming. It is far from perfect, Here is a list of missed lookups from a random request with 70 entries:
# Multiple languages in name:
Moriarty the Patriot (Yuukoku no Moriarty)
Hagure Yuusha no Estetica (Aesthetica of a Rogue Hero)
Talentless Nana (Munou na Nana)
# Season number provided instead of season name:
IDOLiSH7 S2
# Random extra information which was not filtered out:
The Melancholy of Haruhi Suzumiya (Broadcast Order)
Katsute Kami Datta Kemono-Tachi E (To the Abandoned Sacred Beasts) + Extras
# Unicode causing issues:
Healin Good Precure
Healin` Good Precure
Lapis ReÖLights
# Random shorthands for shows:
Hxeros
Taimadou Gakuen
Multiple languages
The is widely used, but it is probably mostly solvable as well. Most of them are on the exact form of âlanguage1 (language2)â, all I probably need is to have a special case for that and detect if there is an exact match for both languages in the list
Season information
The season number is the one causing issues. The offline database does provide anime relations, but without any directions it is unlikely that this could be used to figure out which season it is. And while most donât use it, some consistently does.
The slightly annoying part here is that it is used inconsistently. The example from the list: IDOLiSH7 S2
, that is supposed to be IDOLiSH7: Second BEAT!
. However some includes the season number unnecessarily like this: IDOLiSH7: Second BEAT! S2
. This makes it hard to tell what the intend is.
However while writing this, I started to think this might be an issue with anitopy
's parsing of the file names, as the season number is separated from the title. I think the ones that include both are like this Full title - S02E09
and the other ones are like this Base title S2 - 09
. I need to check, but if it is like that it could be possible to detect what was intended by the season reference.
And I also realized that perhaps I can guess the season numbers from the offline database anyway. The related anime does not provide any information of how they are related, but each entry includes the TV airing season, such as âFall 2020â. That could at least be used to order it correctly based on airing date, and while there are a bunch of issues with âseason 0â, specials/movies, etc. I think it is worth a try.
Unicode shenanigans
Ugh, this consistently pops up with strange cases. In most cases it is probably safe to ignore special characters, but some new ones keep popping up. The Healin Good Precure
is listed as Healin' Goodâ„Precure
. And then there is Lapis ReÖLights
which I was wondering why failed, but that is an Armenian full stop, not a colon⊠How does this even happen?
I remember the Unicode specification having several different types of equality, and I think some of it was to handle multiple representations of some characters. I will have to look that up and see if there is anything that might be able to help with this situation. I never hear anyone talking about these aspects of Unicode though, which is a bit strange.
Short hands and other stuff
I donât think there is anything to do here that would work reliably, so it is probably safe to ignore. Some of them are batches containing multiple seasons or specials anyway.
Current progress
This screenshot is spliced, as I wanted to showcase some examples here. First notice the first two entries is actually the same episode, but is treated as different anime. That season of Sword Art Online is split into two cours (meaning TV airing/schedule season), but some lists this differently. Some treat it as one season with 23 episodes, others like AniList here treats it as two seasons with 12 and 11 episodes respectively. The issue is that even from the studios side it is sometimes ambiguous, and even when the show explicitly display episode numbers, the fandom does whatever they please. Only certain high-profile shows uses this cour system, so Iâm not sure if there is someway to get this working reliably.
The third show in this list is another misdetection. This is actually Digimon Adventure (2020)
, which I guess is a parsing issue with anitopy
that I need to check, but the real issue here is that Iâm hiding the information to tell that there is an issue. Iâm not showing the full filename, so it is difficult notice errors, and if I donât notice them I canât fix them.
Progress information
Anyway, as you probably noticed, there is a bunch of different colors now. With the AniList id found for each entry, Iâm able to correlate it all to my anime list on AniList.
To do this I had to grab the information from AniList. One of the reasons I migrated to AniList was that they had an open API to access their entire database and rather than something custom they are using GraphQL. It is not something I have ever used before, but I managed to hack something together. The main issue is that I need to grab my entire anime list each time, which is 26 request with 50 entries each. So I canât just do this willy-nilly all the time. For now Iâm updating it manually, but in the long run it is probably better if this happens automatically once a day or something like that.
But the end result at least is that Iâm changing the background color depending on the progress state in my anime list, and for specific episodes Iâm highlighting them if I have not yet seen it (as seen with âep13â in the screenshot, which is incorrectly highlighted due to the season mismatch). The pictures and colors are great for quickly identifying relevant entries.
Performance issues
Searching through 13.5k anime (which each has 5-10 synonyms) for each entry does cause a noticeable delay. Iâm improving this by first grouping titles like before, only do a lookup for each group, and then regroup the result to combine the rest. However it still takes 2-3 seconds to process a request.
So I need to make an index of some kind to speed this up. The problem is Iâm not sure yet what my final approach will be. If I just need an exact match like I do now, it will probably be very easy to hack something together, either a binary search or just using a hash-map.
For checking if the current title is a substring of any title in the database I found the Aho-Corasick algorithm which appears to be a good solution. It doesnât work in the other direction however. And what if I need something more advanced? I tried the ahocorasick
package however and it is fairly simple to use if I need it.