Script To Fix HTML Bugs?

Argon · May 16, 2017, 1:45pm

I personally know that you're not meant to actually type in symbols straight into your HTML, but rather use the symbol entity codes. I.e. I've learned from having typed it so many times that '-' = -, I automatically use the entity code for it now.

Anyway, I've gotta work with about 1,000 or so files of static content, and I wanna know if there's a pretty neat script, front end or backend that can fix any invalid entities, I've noticed from having checked some of the files, that they output the likes of: canâ€™t as an example. I will not be going through each file, searching line by line for invalid characters.

It's bad enough that nearly every other page has a unique layout, don't ask, whoever made the site before is clearly not a very good developer... I mean it's impressive that he didn't stick to a theme of some sort, but a good thing, personally, I think not.

mihawk90 · May 16, 2017, 4:42pm

If you're getting those strange characters it's usually more a problem of file encoding and/or header rather then entities. Browsers are rather relaxed about them these days.

But, for the question itself:

Haven't tried this, but looks promising.

Argon · May 17, 2017, 9:50am

May as well give it a go aye? And only thing I've thought of trying is different char sets to be perfectly honest.

HOWEVER and please don't ask me why, it actually hurts me head, but they have some characters that, as far as I'm aware, they're VERY odd characters to us for web content, example being a collection of the following symbols:

▓ , ֍ , ۞ , ۩ , ܏ , ߷ , ፠ , , ⎈, ↈ, ƒ, ᴥ , , , , , ☬ , ☭ , , , 卐

.... I'm sure you get the picture around about now? .... I mean I know each symbol has it's own meaning and purpose, but half the time, with this current site i'm working on, I just wonder why they don't use a graphic instead, like with the hot beverage symbol....

And just so no one's gonna be silly or childish, yes, I included a swastika, and no, that one is not a nazi swastika but a Buddhism one (so I believe, may be hindi, I'm not an expert on the matter, either way that one is actually a religious symbol).

Argon · May 17, 2017, 9:56am

As for the encoder, I just ran the demo, and it seems to work just fine. The only concern I have is if it'll work before the browser tries to output the HTML, cause when you view the source code, it'll even display in there as canâ€™t.

jak_ub · May 17, 2017, 10:57am

I assume it supposed to be "can't" ?

If the source contains canâ€™t, then it usually means source files were opened and saved at least once with incorrect encoding. That means you have now two issues to solve.

Source character "trenches" usually need to be fixed by manual rules (I do not believe there is a tool that is guessing sequence of read-writes with wrong encodings to restore characters).

You could go by some IDE that would case by case allow you to execute regex/or simple replacement on all files.
Or you would need to write your own script/program.

Argon · May 17, 2017, 11:10am

Correct, I don't know why, I thought I already stated that, if not, it was kinda a simple one to work out, so no biggy aye?

As for those options, well, I'm not sure what the hell to do.

jak_ub · May 17, 2017, 11:31am

For the issue with trenches in source code (proper escaping of all character would be step 2, as escaping trenches would still create trenches lol ) can be fixed with Notepad++:

Argon · May 17, 2017, 12:02pm

Although that seems like a good solution, I could imagine something not working correctly, I mean I may as well give it a go anyway, cause there's still a decent chance that nothing will go wrong aye?

jak_ub · May 17, 2017, 12:06pm

Just always keep originals (e.g. in GIT).

_adrian · May 17, 2017, 1:13pm

I know this was not your actual question, but it's important to know that this is not correct.

You should always prefer the actual character, except in cases where that would be unclear/confusing (e.g., nonprinting characters) and where the character actually needs to be escaped in html context (i.e., <, >, &, and sometimes ', ").

"htmlentity all the things" is from days when no one had their sh*t together with unicode support. It's not been necessary/advisable for decades: declare your character encoding properly and that's all there is to it.

https://www.w3.org/International/questions/qa-escapes#not

mihawk90 · May 17, 2017, 2:57pm

because it saves bandwidth.
Why use a graphic when the font (that's on the system) already has the symbols you need.

GIT even complains when you switch character encoding and uses UTF-8 by default (which is the fallback standard in HTML5 too I think).

While this is true, it still doesn't hurt doing it even now (well, technically bandwidth and stuff, but seriously.. no.)

Argon · May 18, 2017, 10:28am

Touché...

I'm pretty sure you're right.

Long story short, the acual source code of this site if F$@KING DISGUSTING.... I wouldn't even write code this bad if I hated my life, I swear that this website was made by people who've just begun to learn HTML & CSS, like they've read a handful of articles on W3Schools or something, it's actually foul. This is by far, the worst source code I've ever seen. I spent 1 day tidying up 1 file, it was thousands of lines long, BUT..... EVERYTHING... And I mean literally EVERYTHING was in line, they had not used the enter button when writing this code, like at all....

I swear to god, jobs like this will cause me to smoke 50 fags a day at this rate. I mean I don't mind a pretty crappy file, full of br tags, empty divs, silly crap like that, I can handle that just fine, but literally ~5,000 lines of code, all inline, that's not even funny, my eyes are genuinely burning from trying to see what's f$@ked up and what isn't, some tags weren't closed correctly, some had wrong names, some inline css, all pretty disgusting things...

jak_ub · May 18, 2017, 10:57am

I assume you checked the end line characters? Most of the editors should recognize them anyway.
I ask, since putting everything in one line requires an actual "effort" to do so when editing files or at least to have automated process before sending them (and that alone is mean).

Argon · May 18, 2017, 11:17am

I honestly don't know why everything is inline, I mean I just can't see why on earth you'd do that. I mean I could even somewhat understand if it was a 'compiled' version, but this is legit the developers files. Just why?

Argon · May 18, 2017, 2:09pm

Thank F$!K I've just finished rebuilding all of the web pages, it has without a doubt been a painful process, I can certainly say I've learned how to NOT wbe dev. I mean I've gone through 1,000 files within a week and a bit, and I'm honestly SO relieved now that it's all done, I can put that f$@king nightmare behind me and hopefully go onto projects where the previous developers have at least tried to make an effort!

It has been so painful that I built myself a tool to help me validate every individual page.

mihawk90 · May 18, 2017, 3:13pm

It's called compression/uglifying/minifying lol (see bandwidth again).

google HTML beautifier to reverse the process, you don't need to clean it up yourself. Every decent IDE/HTML editor has something like it built in.

Or it's just that your editor doesn't recognize the line endings (like if you open a unix-style line-ending file in Notepad, that happens too).

Argon · May 18, 2017, 4:12pm

Oh no, I mean that was literally the developer version, before having been compressed or whatever, the compressed version is actually nicer, it has less tags. ... They actually sorted out their br tags in that one...

And I would've just used an IDE where it does that, but not only was the code foul to look at, but it was screwed up, closing tags missing, sometimes too many tags, etc, you get the picture? ... I mean I'd be impressed if there is an IDE that could both make it look neater and actually rearrange some of the tags? .... I think that's asking too much, but I don't actually know if such a thing exists personally?

BUT in general, a HTML beautifier would've been useful either way, I genuinely didn't think of that, but then again, it would've been annoying 'cause I'd basically have to use it every 5 seconds... .... Either way, it's done now!