Nyaa interface for grouping entries and detecting release schedules

Horriblesubs just closed down its operations and while they will not be missed, it was a convenient site. (For those of us outside North America, a lot of anime series are region blocked to the point that even English trailers gets blocked and you have to watch the Japanese trailer instead.)

Well, back to using Nyaa directly again. The two things Horriblesubs had which was nice was a schedule which shows aired today and a better grouping of releases so different video resolutions were grouped. But that is probably not too hard to do automatically.

So the plan is to create a local webservice providing a simple interface to Nyaa. Nyaa provides RSS feeds, use https://github.com/erengy/anitomy to extract information about group, resolution, episode number, etc. and organize the listing a little better instead of just sorting by date.

Next step would be to try to analyze when episodes are released by for a certain show+group and make a sort of “expected” release calendar for those show/group combinations you care about.

It could also be interesting to try to link it together with my anilist account to deemphasize stuff I have already watched, highlight new releases for series I watch, etc.


I have only just started to look into it, so no real progress yet. Anitomy is ported to a lot of different languages so no need to use C++. While it is my preferred language it is not suited for quick and dirty coding. Node.js or Go is probably the more oblivious choices, but I don’t have experiences with them so I will go with Python for now. Once I have a clearer image of what I want to do, I might switch to something else.

Okay XML parsers in Python for the RSS. Searching for quick recommendations for something easy and simple, wait, why are there comments about security here?

Python’s interfaces for processing XML are grouped in the xml package.

Warning

The XML modules are not secure against erroneous or maliciously constructed data. If you need to parse untrusted or unauthenticated data see the XML vulnerabilities and The defusedxml Package sections.

Oh
 Seriously WTF? When is XML NOT untrusted? And the official python libraries just go “well we don’t care, use something else if you are scared”? I’m starting to reconsider my choices here


6 Likes

Id look into what the external entity expansion is exactly. Should be an optional feature as I dont believe XML parsers embedding local data into XMLs you parse is something anybody would expect by default. (I mean let’s hope its a feature anyways, it still would be security risk as a feature as well) But maybe you want to attach a file to it
 Then you might need to. If you do that based on user input and run your server process as root you got a problem.

Looked it up here https://www.linkedin.com/pulse/xxe-attacks-python-django-applications-jerin-jose?trk=related_artice_XXE%20Attacks%20In%20Python%20%2F%20Django%20Applications_article-card_title

So
 not opt in, disappointed. Looks like you can include anything straight into XMLs by default there. I have never really seen that actually used in an API. Though, you can disable that and you’ll get an exception if somebody adds an entity reference of any kind to the XML. If you need it enabled it would require you to also send back the request in the response to actually leak any data though so there is that. You could do many things to avoid sending back unintended data. Create seperate response objects and map the data you intend to send to them before sending it back for instance. Which is commonly done regardless for API versioning/consistency/maintenance reasons.


The rest does not (really) seem that surveir. It just stupid XML layouts that can take an awfully long time to parse. I can make a recursive object and tell any language to serialize it. It will either tell me I cant, or it will serialize until my ram is full, or something in between the two.


Btw, since you considered node. You’ll see a lot of security problems listed when importing almost anything from npm also. :sweat_smile:

1 Like

Given that Nyaa is a torrent site, and we have rules against linking to pirated material, Staff is requesting that you do not provide any links to Nyaa over the course of this thread. That said, this seems like a great devember project and I look forward to seeing the project progress. :+1:

8 Likes

By reconsidering I was thinking of the fact that a standard library, not a third party one, had such a nonchalant approach to security. Aren’t official libraries supposed to be high-quality, bug-free implementations, being an example to the community on how to make good libraries? The tone continues with http.server:

Warning
http.server is not recommended for production. It only implements basic security checks.

Have some fun, but don’t actually use it for anything? Well it is Python I guess.

A parser which downloads files which will read your filesystem and download files from the internet to parse a string is crazy. Seems like it is disabled by default from Python 3.7.1 though, and the defusedxml package actually provides exactly the same interfaces so little reason not to use it.

It is not like I’m making a public accessible web server, so the attack vector is fairly limited. Using a secure parser and as long as I make sure to escape everything before inserting it to the HTML to show in my browser it will probably be fine (famous last words).

RSS feeds contains links, so inserting the data into a link and hoping the user will click on it will probably do the trick.

Thanks for the heads up.

3 Likes

I think you are better off using Python than JavaScript honestly. As you mentioned before, just do your own data sanitation before sending it out and make sure to do data validation before using it when receiving from the web.

Since this is a Devember project, I think the focus is to get more people coding and get the proof of concepts out there. Once you have it working, or enough interesting, then you can go back and add the security and nice to have features. I say this as a CISSP that is focused on security first, but as a one man band and for the sake of self motivation, I think you are okay in your assumptions barring Nyaa being attacked by some state actor to get access to your system in particular.

Careful, it’s thinking like that that landed me three quick contracts the last 6 months of cleaning up worms that came through unauthorized http services some employees set up that weren’t adequately protected. Automated spambots and ransomware is not fun to deal with. :frowning:

1 Like

I have something sort of working now:


And yeah, there are a few issues. Slight differences in naming/punctuation splits up some shows, and some episodes have way too many versions so further grouping might be an idea. And then there is the ep['01', '02'] because here is suddenly a list instead of a string in some edge cases. I have barely started this project and I’m already starting to miss static typing


As suggested, I will not spend too much time on security in the beginning, but I do think it is something worth considering down the line.

2 Likes

Congrats on the quick work.

A suggestion about your grouping could be to use some fuzzy logic. If the Title contains 90% of the same words, then group it together using the 100% match and then weigh the relevance of the matches in order based on how much of the 90% they match. So Tsukiuta, The Animation would be considered a match to Tsukiuta The Animation 2, But the 2 would be the 100% match. Then you can separated the subs based on the creators/type.

I have been doing some minor fixes, but nothing major to this project. I haven’t worked too much on it as there hasn’t been a pressing need (and I’m terrible at forcing myself to do work). It does the job and I have been using it without major issues.
The major shortcoming I have found is that it doesn’t work well for searching older stuff. You start getting movies, OVAs, other kind of specials, batches and the naming is all over the place. It is difficult to know exactly what the different stuff is and group them in a meaningful way automatically. Information about seed count and file size is also a lot more important, which I’m not showing because it doesn’t matter for weekly releases and I need to do it in a way without cluttering the interface.

@Mastic_Warrior I tried to start with a simple Levenshtein distance and trought about making my own implementation which could include special distances for ƍ to ou substitution, etc. But I quickly noted cases which is not going to work:

  • Is the order a rabbit? Bloom (Gochuumon wa Usagi Desu ka? Bloom)
  • Gochuumon wa Usagi Desu ka BLOOM
  • Gochuumon wa Usagi Desuka Bloom
  • Gochuumon wa Usagi Desu ka S3
  • Gochuumon wa Usagi Desu Ka?? | Is The Order a Rabbit??
  • Gochuumon wa Usagi Desu ka ~Dear My Sister~

English translation together with the Japanese name, sometimes different English translations. No way to tell that the “Dear My Sister” means a movie, and “Is the order a rabbit?” being a translation. And the worst part? One question mark means season 1, two question marks mean season 2. Doing a fuzzy match which allows one character to change causes season 1 and 2 being mixed up.
Some changes could be dealt with however. Spaces could probably be ignored without issues. The  full-width question mark could be normalized Maybe the ƍ would make sense as well.

But as I see it, it is not going to work well without manual substitutions. Which will be fine for weekly series you often search for, but not for those one-time searches for old stuff. Perhaps it is possible to do this more automatically by using the anime databases, but there are no local versions of those, so doing that on the fly would be way to slow.
Mapping titles to actual IDs in anime databases in a partial fashion might be useful however
 Hmm


EDIT: Might be able to use this actually: https://github.com/manami-project/anime-offline-database
It provides IDs to online databases, the show name in different languages, and other meta-data. If I could link names to AniList ids automatically, that would be pretty huge, as it shouldn’t be too hard to grab the list of my currently watched shows. And I might be able to do a way better job at disambiguation if I have a list of real titles to compare against.

1 Like

As a study-er of Japanese Language, I see where that could fall apart, a lot.

You may be on to something about trying to find a way to provide IDs. You need something that you could center on to make each major item unique. then you can do a fuzzy match based on how close the additional items relate. Even then, you will have to battle with Hepburn and Kunrei-shiki romanization, let alone dealing with symbols and characters and lack of context.

Using the offline database worked very well. Using the ‘synonyms’ it provides, it catches a lot of the variations in the naming. It is far from perfect, Here is a list of missed lookups from a random request with 70 entries:

# Multiple languages in name:
Moriarty the Patriot (Yuukoku no Moriarty)
Hagure Yuusha no Estetica (Aesthetica of a Rogue Hero)
Talentless Nana (Munou na Nana)
# Season number provided instead of season name:
IDOLiSH7 S2
# Random extra information which was not filtered out:
The Melancholy of Haruhi Suzumiya (Broadcast Order)
Katsute Kami Datta Kemono-Tachi E (To the Abandoned Sacred Beasts) + Extras
# Unicode causing issues:
Healin Good Precure
Healin` Good Precure
Lapis Re։Lights
# Random shorthands for shows:
Hxeros
Taimadou Gakuen

Multiple languages

The is widely used, but it is probably mostly solvable as well. Most of them are on the exact form of ‘language1 (language2)’, all I probably need is to have a special case for that and detect if there is an exact match for both languages in the list

Season information

The season number is the one causing issues. The offline database does provide anime relations, but without any directions it is unlikely that this could be used to figure out which season it is. And while most don’t use it, some consistently does.

The slightly annoying part here is that it is used inconsistently. The example from the list: IDOLiSH7 S2, that is supposed to be IDOLiSH7: Second BEAT!. However some includes the season number unnecessarily like this: IDOLiSH7: Second BEAT! S2. This makes it hard to tell what the intend is.

However while writing this, I started to think this might be an issue with anitopy's parsing of the file names, as the season number is separated from the title. I think the ones that include both are like this Full title - S02E09 and the other ones are like this Base title S2 - 09. I need to check, but if it is like that it could be possible to detect what was intended by the season reference.

And I also realized that perhaps I can guess the season numbers from the offline database anyway. The related anime does not provide any information of how they are related, but each entry includes the TV airing season, such as ‘Fall 2020’. That could at least be used to order it correctly based on airing date, and while there are a bunch of issues with ‘season 0’, specials/movies, etc. I think it is worth a try.

Unicode shenanigans

Ugh, this consistently pops up with strange cases. In most cases it is probably safe to ignore special characters, but some new ones keep popping up. The Healin Good Precure is listed as Healin' Good♄Precure. And then there is Lapis Re։Lights which I was wondering why failed, but that is an Armenian full stop, not a colon
 How does this even happen?

I remember the Unicode specification having several different types of equality, and I think some of it was to handle multiple representations of some characters. I will have to look that up and see if there is anything that might be able to help with this situation. I never hear anyone talking about these aspects of Unicode though, which is a bit strange.

Short hands and other stuff

I don’t think there is anything to do here that would work reliably, so it is probably safe to ignore. Some of them are batches containing multiple seasons or specials anyway.

Current progress

This screenshot is spliced, as I wanted to showcase some examples here. First notice the first two entries is actually the same episode, but is treated as different anime. That season of Sword Art Online is split into two cours (meaning TV airing/schedule season), but some lists this differently. Some treat it as one season with 23 episodes, others like AniList here treats it as two seasons with 12 and 11 episodes respectively. The issue is that even from the studios side it is sometimes ambiguous, and even when the show explicitly display episode numbers, the fandom does whatever they please. Only certain high-profile shows uses this cour system, so I’m not sure if there is someway to get this working reliably.

The third show in this list is another misdetection. This is actually Digimon Adventure (2020), which I guess is a parsing issue with anitopy that I need to check, but the real issue here is that I’m hiding the information to tell that there is an issue. I’m not showing the full filename, so it is difficult notice errors, and if I don’t notice them I can’t fix them.

Progress information

Anyway, as you probably noticed, there is a bunch of different colors now. With the AniList id found for each entry, I’m able to correlate it all to my anime list on AniList.
To do this I had to grab the information from AniList. One of the reasons I migrated to AniList was that they had an open API to access their entire database and rather than something custom they are using GraphQL. It is not something I have ever used before, but I managed to hack something together. The main issue is that I need to grab my entire anime list each time, which is 26 request with 50 entries each. So I can’t just do this willy-nilly all the time. For now I’m updating it manually, but in the long run it is probably better if this happens automatically once a day or something like that.

But the end result at least is that I’m changing the background color depending on the progress state in my anime list, and for specific episodes I’m highlighting them if I have not yet seen it (as seen with ‘ep13’ in the screenshot, which is incorrectly highlighted due to the season mismatch). The pictures and colors are great for quickly identifying relevant entries.

Performance issues

Searching through 13.5k anime (which each has 5-10 synonyms) for each entry does cause a noticeable delay. I’m improving this by first grouping titles like before, only do a lookup for each group, and then regroup the result to combine the rest. However it still takes 2-3 seconds to process a request.

So I need to make an index of some kind to speed this up. The problem is I’m not sure yet what my final approach will be. If I just need an exact match like I do now, it will probably be very easy to hack something together, either a binary search or just using a hash-map.
For checking if the current title is a substring of any title in the database I found the Aho-Corasick algorithm which appears to be a good solution. It doesn’t work in the other direction however. And what if I need something more advanced? I tried the ahocorasick package however and it is fairly simple to use if I need it.

1 Like

Updates:

  • Unicode normalization
  • Performance optimizations
  • Multiple language detection
  • Grab updated anime-offline-database automatically through Github API
  • Graph anime relations with Dot
  • Finding local files

Unicode normalization

Unicode has four normalization forms: NFD, NFC, NFKD, and NFKC. The K normalization forms are slightly fuzzy and changes similar characters into consistent codepoints, which is exactly what I need here. Interestingly there is also some shenanigan’s making strings lowercase, so you are required to perform the normalization twice like this:

name = NFKD(NFKD(name).casefold()) #Lowercase and compatibility normalization

Still, some characters are not normalized, so I still need to some of it manually. I’m also stripping all control characters by looking up information for each codepoint:

return ''.join(ch for ch in s if unicodedata.category(ch)[0]!='C')

This is however very slow.

Performance optimizations

The unicode normalization and control character removal made everything way too slow, so I had to do something about it. I just used a ‘string -> id’ hashmap, it is sufficient and it handles my needs for now.

Multiple language detection

I have set up some regex’es to match these patterns:

  • lang1 (lang2)
  • lang1 / lang2
  • lang1 / lang2 (lang3)

I always forget the RegEx syntax, thank god for sites like regex101 which makes it easy to test and understand what is going on.
All the matches need to match the same anime, there are quite a bit which gets missed either because the pattern is slightly off or because one of the matches fails. But it is does get a lot of them.

Grab updated anime-offline-database automatically through Github API

I found that Github has a simple JSON API, so I have made a simple python script to check the latest release of anime-offline-data, download it if a new release is available and then downloads it and extracts it.

github_url = 'https://api.github.com/repos/manami-project/anime-offline-database/releases/latest'
release = requests.get(github_url).json()
writeBytes(zip_path, requests.get(release['zipball_url']).content)

Graph anime relations with Dot

I have tried to see if I could use the anime relations to figure out the season information. To show the entries as a graph I converted it to the Dot format and used an online viewer GraphvizOnline. If we graph everything connected to Attack on titan we get quite a complicated graph:


For the season number, normally only the TV shows matters. Simplifying the graph makes it like this:

The main problem is spin-offs are not marked. If we check the website each relation is marked with a type, such as ‘sequel’ or ‘spin-off’, but this is not available in the offline database. If those had been there, I would say this approach would be quite useful. Without them, you see examples like this where Attack on Titan is connected to Ghost in the Shell through some random crossover specials. I’m not sure how common those are, but spin-offs are quite common and they would need to be handled.

Another thing worth trying is to look in the synonyms to see if some of them contains the season number. It seems like it is quite common, but there is a lot of garbage in those synonyms as well.

Finding local files

As a small bonus of some of the stuff I tried here, I used the anitopy package to automate cutting out small video sections. I have a long running project about stitching anime slides called Overmix and I have tried making a small collection of slides for testing, but finding the video file, setting up the FFMPEG command takes time, and it is kinda annoying to interrupt watching an episode for too long to do it.
So I have set up a script to look through all my video files, use anitopy to match against title and episode number, and then copy the ffmpeg command to the clipboard with all the settings I need and automatic matching output filename. Much more convenient. I’m copying the command instead of running it directly, as I’m splitting on keyframes and sometimes need to adjust the timestamps to get it right, and editing the ffmpeg command is easier.

2 Likes

Wow you have been really busy. Have you heard about pressures to shutdown Nyaa?

Actually, the 1 hour a day goal hasn’t really been working out for me, it is 16 days of progress but I spend perhaps around 7 hours in total spread over a few days. The code is just sufficient enough to get something working, I have been focusing more on experimenting that proper engineering.

I heard some rumors about Nyaa two weeks ago, but it is not something I’m worried about. Nyaa already shut down once without warning, and two Nyaa alternatives were out within a week (NyaaPantsu and the current Nyaa). The source code for both the new Nyaa and NyaaPantsu is on Github, so it should be even faster to get an replacement up and running. And even though the old Nyaa tracker was gone and only magnet links were/are available from the old Nyaa, it worked surprisingly well anyway.
Most of the code I have is not specific to Nyaa, it is mainly dependent on anitopy and anime-offline-database, so it should be easy to get it working with a new site if needed.

I remember NyaaPantsu and NyaPaantsuCat after the first shutdown. That is when I just stopped using torrents and started using VPNs to get to hosted content. I was also able to get better internet too.

Progress has been very slow the last two weeks, I have finally switched back to Arch Linux from Windows and you never run out of work with Arch Linux, especially in the beginning


I realized that there was an issue with the anime-offline-database update script, while it is updated once or twice a week, the latest release is actually back in August. It is just a single file in the repository, so I just need to check the directory contents instead.
I tried a random attempt at changing the API url to see if it worked and I got this:

{
  "message": "Not Found",
  "documentation_url": "https://docs.github.com/rest/reference/git#get-a-tree"
}

It actually gives a link to the documentation if you make an invalid request, genius!
So I apperently need to request this url: https://api.github.com/repos/manami-project/anime-offline-database/contents and then I get a list of files like this:

  {
    "name": "anime-offline-database.json",
    "path": "anime-offline-database.json",
    "sha": "1798aba3c689aafdc4e28fe400ac79cdd3ed27e0",
    "size": 33019866,
    "url": "https://api.github.com/repos/manami-project/anime-offline-database/contents/anime-offline-database.json?ref=master",
    "html_url": "https://github.com/manami-project/anime-offline-database/blob/master/anime-offline-database.json",
    "git_url": "https://api.github.com/repos/manami-project/anime-offline-database/git/blobs/1798aba3c689aafdc4e28fe400ac79cdd3ed27e0",
    "download_url": "https://raw.githubusercontent.com/manami-project/anime-offline-database/master/anime-offline-database.json",
    "type": "file",
    "_links": {
      "self": "https://api.github.com/repos/manami-project/anime-offline-database/contents/anime-offline-database.json?ref=master",
      "git": "https://api.github.com/repos/manami-project/anime-offline-database/git/blobs/1798aba3c689aafdc4e28fe400ac79cdd3ed27e0",
      "html": "https://github.com/manami-project/anime-offline-database/blob/master/anime-offline-database.json"
    }
  },

As you might notice the download_url I want is actually static so I don’t technically need to do this, but doing it this way I can check if the SHA has changed before downloading a 30 MB file.

One of the reasons I have gotten a lot done in a very short amount of work/time is that I kept it really simple. The web-server only handles a single page with a search argument, I’m using a bunch of JSON files instead of a database, and these JSON files are created using separate scripts/programs instead of some admin web interface.
One annoying consequence of this is that I need to restart the server if I update these JSON files, say update the anime database. Loading and parsing 30 MB of JSON for each request would be a bit much.
So I added a small cache mechanism which checks the last modified and size property of the JSON file and reloads it if it has been changed. Works great, and the webserver is still only is able to read from hardcoded filepaths on my computer.

My next goal is to get some schedule detection done. Again the plan is to create a separate CLI program to analyse and create/add a calendar schedule in a JSON file the webserver can display (if I want that, perhaps I want it to add it to a google calendar instead).

1 Like

Thanks for the break down of how and why. I hope this helps any one that is trying to learn web services and development backends.

So I got started on the schedule detection experiments.

I started with one sample, and saved request to avoid spamming the server with a request each test run of the program. Not because it probably makes a different, but I prefer to remain a little bit stealthy with these kind of suspiciously repetitive requests. (Which I later discovered I had forgot to disable the HTTP request code, so it did it anyway.)

So for one show from a specific uploader at 1080p, these times where reported (all in UTC):

2020-10-06  03:03:38
2020-10-09  16:09:00
2020-10-16  16:06:48
2020-10-23  16:06:31
2020-10-30  16:06:39
2020-11-06  17:06:28
2020-11-13  17:06:34
2020-11-20  17:06:44
2020-11-27  17:06:37
2020-12-04  17:06:53
2020-12-11  17:06:29

Other than episode 1 being a few days late, they are all very close to exactly one week apart. The ep1 is probably caused by the HorribleSubs transition, but never the less, these kind of outliers needs to be ignored. I’m thinking I need to test if the median of these releases actually matches a weekly schedule and warn if something is off here. And then try to remove those outliers by calculating the standard deviation to use as the base for a threshold.

However that one hour shift in the middle bugged me. It finally clicked when I converted the times to my own timezone:

2020-10-06  03:03:38  +00:00  -  2020-10-06  05:03:38  +02:00
2020-10-09  16:09:00  +00:00  -  2020-10-09  18:09:00  +02:00
2020-10-16  16:06:48  +00:00  -  2020-10-16  18:06:48  +02:00
2020-10-23  16:06:31  +00:00  -  2020-10-23  18:06:31  +02:00
2020-10-30  16:06:39  +00:00  -  2020-10-30  17:06:39  +01:00
2020-11-06  17:06:28  +00:00  -  2020-11-06  18:06:28  +01:00
2020-11-13  17:06:34  +00:00  -  2020-11-13  18:06:34  +01:00
2020-11-20  17:06:44  +00:00  -  2020-11-20  18:06:44  +01:00
2020-11-27  17:06:37  +00:00  -  2020-11-27  18:06:37  +01:00
2020-12-04  17:06:53  +00:00  -  2020-12-04  18:06:53  +01:00
2020-12-11  17:06:29  +00:00  -  2020-12-11  18:06:29  +01:00

Now only one episode is off by one hour, randomly in the middle. The timezone also changes in the middle of it all. Of course time is not that simple
 It is Daylight Saving Time (DST)

Looking into it, Japan do not use DST anymore. There were talks about introducing it for the Olympics but that was abandoned. So the shows should air at a consistent interval in Japan. So apparently the English stream release is not tied to the Japanese release, but is done so in a local American time slot. And in the US, DST ends at 11-01 while where I live it ends at 10-25, explaining why the episode the 10-30 is off as it ends up in the mismatch between those two periods.

To get a good estimate, I need to take this into account as math doesn’t care about humans retarded timekeeping and it wasn’t quite clear how to do this properly with the DateTime API in Python. I tried to switch to the Arrow package which promises to simplify and fix the DateTime API, but there still wasn’t an obvious way to handle this. In the end I resorted to just offset the times with the DST offset and then do the reverse offset afterwards.

For extrapolating the times, I used good old linear regression. Which is very straight forward using sklearn (all the packages implementing linear regression sadly seems to be huge machine learning frameworks, but whatever):

	x = np.array(episodes).reshape((-1, 1))
	y = np.array(times)

	model = LinearRegression()
	model.fit(x, y)

	def predict(x):
		in_x = np.array([x]).reshape((-1, 1))
		return model.predict(in_x)[0]

And if I ignore the first episode I get a nice schedule like this:

Episode 1 at 2020-10-02 17:07:34
Episode 2 at 2020-10-09 17:07:27
Episode 3 at 2020-10-16 17:07:19
Episode 4 at 2020-10-23 17:07:11
Episode 5 at 2020-10-30 17:07:03
Episode 6 at 2020-11-06 17:06:56
Episode 7 at 2020-11-13 17:06:48
Episode 8 at 2020-11-20 17:06:40
Episode 9 at 2020-11-27 17:06:33
Episode 10 at 2020-12-04 17:06:25
Episode 11 at 2020-12-11 17:06:17
Episode 12 at 2020-12-18 17:06:09

(This is without applying the DST offset back.) If I ignore the second episode as well, all the times are 17:06:38 because these uploads are surprisingly consistent, all being within half a minute of each other. Even considering it is a bot uploading these, I would still have expected the difference being much larger.

So the basic idea works, now I just need to get it to work reliably for all kinds of uploads, not just the single one I have been testing on. I will expect some series to skip a week due to recap episodes, etc. so these cases needs to be handled as well. The devil is in the pudding after all.

I also need to work on the issue with the season number of anime. I didn’t manage to get a reply from the anime-offline-database maintainer, so I will need to handle it in some other way. I don’t like the idea of having to crawl the entire site, so instead I will have to try to see if I can extract hints from the synonyms for each anime which sometimes contains the season number. It will likely be a huge pain though.

1 Like

I have been improving the release schedule detection a bit, I have automatic removal of outliers and detection of skipped weeks due to specials/etc. I have also improved the evaluation methods for seeing how well the estimation is and more understandable plotting for a visual representation of the fit. I will probably use the numeric evaluation to test the effect of correcting for DST.
It will be great to have this ready for next season, so I will need to improve this some more and make some special handling for when you only have one release out, etc. And then some way to import to a calendar.

I have been up to some other things not really related to the devember project.

3D printing

I have bought myself a 3D printer, I haven’t had time to do much with it yet, but I tried printing out the top part of a dumped character model from the game Trails of Cold Steel:


The first test did not have any hair, since I manually had to fix it in blender which was a huge pain. It is an old game, so the front hair did not have any polygons on the back side.
Printing it appears to still be somewhat of black magic, but the result is certainly better than I expected on this scale. (Each grid in the image is 1 cm.)

Depth estimation using focus stacking

On to the more interesting side-project. I have been wondering if I could use focus stacking to estimate a depth channel for an photograph.
Focus stacking is a photographing technique where you take several images with a shallow depth of field (i.e. a lot of the image is out of focus) at different focus point and then combine the images. This is usually done to improve image sharpness and is especially common for macro photography.

So I spend some time with Entangle, which is an open source application for controlling your camera over USB, but I could not find any way to control the focus on my Sony Alpha 6300. I wanted to make a script to automate the process (using gphoto2 which is the underlying library/CLI) but it seems it is either not implemented or supported for my camera model.

So I had to do this manually, moving the focus ring each time. Even at ISO100 the images were rather noisy, so I had to redo it and this time I took 5 images for each focus position to average out the noise. So I took 5 x 8 images.

For processing the images I used the codebase from my Overmix project (C++, not python) which is a simple panorama application for stitching anime screenshots. Mainly because it has the infrastructure to handle processing a group of images, so I could quickly add a new render for focus stacking and the required UI for this.

My initial was to use edge detection to find the areas in focus for each image. (The areas in focus have sharper edges and should therefore give a higher response.) Then for each pixel I pick the image which has the strongest edge response.
This however gave a rather noisy image. So I added a 30x30pix filter, which picked the image which had the highest edge response inside that 30x30pix region. This results looks like this:


The left result is the unfiltered best image ID, the second is the highest edge response from any image, and the right one is the filtered one. And it does kinda work, the bright areas are close and the darker ones are farther back. You can especially see it on the bottom of the image where there is an obvious gradient.
The stacked image look like this:

For such a simple implementation, it actually works rather well. (I spend a full day photographing and implementing the code.) There are obvious issues in areas where there is a sharp transition from one focus distance to another:
issues
This is due to the very simple filtering. But disregarding that, everything is in focus. (The background is kinda iffy at places though.)

My depth map however isn’t nearly as nice as I was hoping it would be. The issue is that there is a lot of flat areas in the image where you actually can’t tell if something is in focus, so there is not any information to do depth estimation with either. You could probably do some interpolation from known areas to fill these areas out, but I was hoping for something way more detailed.

I was ready to call this a bust, but then I tried to use some old images I took of this same model for trying focus stacking. These were taken with a Hasselblad X1D mark II (a much more expensive camera) and since you can control the focus from the software I have 17 focus steps instead of 8. And a single image from the Hasselblad is less noisy than the 5 averaged images from my Sony. I should just have used the Hasselblad camera from the start (I have one from work at home due to the pandemic).
This is the depth map from I got from that:


The background is very noisy because I intentionally didn’t take any photos where the background was in focus back then, but the rest actually looks very promising. Here is the combined image for reference:

The stacking actually worked very well here, I had used Adobe PhotoShop to do the focus stacking back then and I actually think my result looks better:

My result on the left and PhotoShop on the right. PhotoShop had several areas out of focus and it also has this ugly transition between different images but it does have a lot fewer off them though.

The code is available as part of Overmix and this is the relevant commit/code:

Now I need to get back to the devember project and actually try to get the schedule feature functional before the end of the year


2 Likes

Nice work. Just be careful that Adobe does not come knocking at your door requesting royalties.