Help keep robots out of my website!

tiger_sty1e · June 7, 2020, 1:59pm

Hello! I am playing around with a vps and I want to host a very simple one-page website. As this is a just for fun project I don’t want it listed in Google or any other search engines.

I did some research and there seems to be 3 ways to handle this…

robots.txt
It’s a separate file placed in the the root folder of a website, it can block legitimate crawlers from accessing the site entirely but the site may end up in the index if other websites reference it.

robots meta tag
Is a meta tag placed on the HEAD part of the html document and can inform legitimate crawlers not to index the website but it can only refer to the HTML document itself, additional images, videos in the server directory can be indexed (?). If you have a robots.txt file blocking access entirely this meta tag doesn’t do anything.

x-robots tag
Configured on the server level as part of the HTTP response and can block all indexing for legitimate crawlers including the HTML files and any pictures, videos etc hosted on the server.

In my case I have just an index.html file on the root directory and no robots.txt file and I have the robots meta tag below configured on the HEAD part of the document. Is this enough to avoid having my site indexed?

<meta name="robots" content="noindex, noimageindex, nofollow, noarchive, nocache, nosnippet">

Thanks!

MazeFrame · June 7, 2020, 10:50pm

In the defcon talk linked below, the speaker goes over how one can confuse scan tools by returning “unexpected” status codes. (Minute 25:44 onward)
AFAIK all browsers will just display content, even when the header says 204 or 404.

StirlingClaire · December 19, 2020, 11:33pm

That sounds like a first to me. Trying to hide your site. Have you thought of a way to make it ready to go first offline with some help and then launch it?