I have a small VPS with gitea on it. I have a few mirrors of various projects like kwin, linux kernel and etc.
I want some repositories to be accessible on the open internet because I want to be able to quickly make a branch with some patches for someone.
Since over 4-5 days amazon is trying its best to take down my VPS.
What should I do?
I’m using caddy web server as reverse proxy for my services and I’d like to continue to because it takes care of all of the ssl and security circus for me, however it doesn’t have rate limiting feature.
So far I stopped the attack with iptables:
sudo iptables -A INPUT -s 57.141.0.0/24 -j DROP
But the bots might be back soon from another subnet.
Well, it’s a vps hosted at OVH so I don’t think there is anything I can do.
Amazon itself is effectively running this “attack”
To be more specific, it’s their horribly written crawler going haywire on kwin_x11 repository, causing gitea to run git blame, git diff and etc. on basically all commits and branches in that repository
clownflare
I don’t think I’m going to touch their services with a 10 meter pole after they took down a parody site clownstrike (reference to the bad quality of crowdstrike’s software)
It is because they are not rate limiting themselves to something reasonable, the CPU was pinned at 100% and everything hosted on that machine horribly unresponsive until I banned that entire IP range in iptables.
Hm, correct, I guess I got triggered by seeing “02:28:01.353741 IP 44-217-177-142.crawl.amazonbot.amazon.5088” in the logs all the time and just assumed not all IPs were correctly reverse resolved
you said it was git operations right? Well what version of git were you using? Certain versions had issues with performance which is why specific version of git are recommended. Might want to look into that. And as others have said, the robots.txt file.
Yeah I experienced something very similar on 2 of our websites last year, all of the major crawlers came to visit at the same time even though nothing changed on our sites which were all already indexed. Sites were effectively down for half a day until I banned a dozen /10s, /9s and /8s… insane.
I’m trying to help local businesses fix things like this but they’re stubborn and/or locked by contract to some dev who won’t fix the slowdowns and broken HTTPS certificates.
I even have a local Toyota dealership with a 404 on their homepage. They didn’t get back to me.
It is because they are not rate limiting themselves to something reasonable, the CPU was pinned at 100% and everything hosted on that machine horribly unresponsive until I banned that entire IP range in iptables.
And that is still not ddos. Thats absolutely normal on public internet and you are being tripped up since you have failed to read up and set up standart mitigations on your side the fence. These techniques habe beein in place for more that a decade +.
There are badly behaved scrapers and crawlers out there, but Dos and DDos is entirely different level of fuck you in particular.
set up valid robots file <— bare minimum, if it not up, you have no grounds to complain about being crawled at all
set up access log a keep historical statistics
set up basic rate limiting to keep thing sane 1 rps/ip is sane default for private instances
look up OVH ddos/bot protection services and pricing
consider using cloudare free tier protection services
There is no DIY solution that would proterct you from real ddos, you need to proxy your ingess via cloudfaire or your ISP provided services for that.
Why are you passive aggressive towards me, I think this is common knowledge that if you are doing scraping and notice that the server takes very long to respond you are very likely causing issues and should throttle down.
The issue is not that someone is indexing/crawling, they issue is that someone is doing it in a way that makes the “quality of service” go down.
And as others have pointed out, robots.txt is just a suggestion to the crawler.
Sorry for being rude but I don’t think that’s something I want to spend my time on when I log off of work. Hosting things on the internet is soft of a hobby for me but unfortunately the time is limited.
It doesn’t look like caddy suports it out of the box, you have to install some unofficial extension, which then adds another chore onto my list of checking if its not horribly broken and updating it from time to time. With caddy as a system application from debian repo I can just leave the unattended upgrades do its work.
I could use nginx but then the entire lets encrypt setup feels a bit fragile, loose pieces of software editing seemingly random files on a cron… doesn’t sound as reliable as a webserver with integration built in.
I think I forgot to add some context here - this is a very small server just for me and one or two friends, there are some public files and some public repositories hosted there but this is more similar to hosting minecraft server with friends open to guests than some serious web hosting operation.
Thanks, will look into it.
What I think is happening is that people at those companies just put together crawlers at breakneck speed and the side effects are not their problem.
Oh this is isnt agression, its summation of basics. You premise being under ddos is false, since you lack experience handling this kind of thing. What you got is relatively friendly knock knock from meta.
And as others have pointed out, robots.txt is just a suggestion to the crawler.
And skipping this step is shooting yourself in the foot for no reason. You set ut robots to protect yourself from well behaved crawlers, then you mitigate bad ones.
It makes just as much sense setting empty passwords, since weak ones can be cracked, so why bother setting any?
Sorry for being rude but I don’t think that’s something I want to spend my time on when I log off of work.
Then you will operate blind, without any ability to tell what is actually happening. And you are already having issues with erratic third party traffic.
It not that hard. But if you cannot do it, stop using vps and start using saas, they will handle it for you.
I think I forgot to add some context here - this is a very small server just for me and one or two friends, there are some public files and some public repositories hosted there but this is more similar to hosting minecraft server with friends open to guests than some serious web hosting operation.
Then set up robots as instructed above. This isnt serious webhosting, its barebone basics. There also isnt anything more to do than setting that up or using security proxy service. For your hobby usecase, you can use free tiers. They will block malicious traffic before it reaches your servers.
If you plan on running you infra in the open, this is the kind of issues you will face regularly. Saying its too much work wont help with anything, but thankfully today you can offload some of that to third parties.
Yeah I agree mostly with greatnull. You’re not getting ddosed or even dosed. It’s not really an attack nor distributed. Even if I think the ethics of that kind of mass scraping are dubious (given it’s not indexing but diff‘ing through whole repos).
It should be fixed with robots, and that is easy enough to do. I would be curious to know if it does work though since not all webscrapers will respect this.
I’m also curious as to what kind of VPS you use? It must be a pretty low power one if running git requests bring it to its knees? Or is that a massive complicated repo?
I get the confusion here:
Public service down → checked web server logs → saw requests from multiple IPs → assumed DDoS.
A DDoS (my definition) is a flood of IP traffic from many many many many IP Addresses - so many blocking that blocking via each subnet is a gargantuan ordeal. I’d classify this as an Information Gathering attempt from meta (going off the above).
Running your infra in the public is asking for this type of trouble - from threat actors and big corporations alike.
Not sure what your clients are expecting (and will tolerate), but taking your git repo behind a VPN is another viable option. No more scraping from the public if they can’t access it. Kind of like how self-hosting vault warden with certs is done.