News:

Yeah, fuckface! Get ready to be beaten down. Grrr! Internet ain't so safe now is it motherfucker! Shit just got real! Bam!

Main Menu

look at that mother fucking screen

Started by the last yatto, December 26, 2010, 05:14:38 AM

Previous topic - Next topic

Triple Zero

That's not the way PDF works.

Not only does the PDF format not support word wrap, it doesn't even support line breaks.

That's right. Every line of text on a PDF page is a separate text box with the upper left point specified as coordinates relative to the page.

That is, if you're lucky.

Because there's also things like "kerning", where certain words or combinations of characters need to be spaced closer or farther apart in order to make the letters look more even, and all sorts of things that can break these line boxes into even smaller chunks.

In addition to that, there is no requirement that these text boxes appear in the PDF document in the natural order of the text, because as long as they're on the same page, the top-left coordinates determine where they'll be rendered.

Combine that with captions below figures, big pull-quotes, drop caps, and those little superscript footnote numbers, any kind of PDF-to-Text converter tool needs to do some serious black magic intelligent heuristics in order to even just determine which text boxes make up one paragraph, line, or even a single word.

Now not all PDFs are equally bad in that regard, except you don't know how they are generated, so the only thing a converter program can do it to somehow guess which text boxes belong to the same piece of text by how close together they are, and guessing most lines are nearly (but not always exactly ...) at the same distance from eachother etc etc.

And yes, it is completely stupid and retarded that this jumbled mess makes up what is the most widely used format, and the only format that everybody can read and be mostly certain to look pretty much exactly the same on every computer system and viewer, and that it's probably the best we got right now. But that's the way it is in computerland.
Ex-Soviet Bloc Sexual Attack Swede of Tomorrow™
e-prime disclaimer: let it seem fairly unclear I understand the apparent subjectivity of the above statements. maybe.

INFORMATION SO POWERFUL, YOU ACTUALLY NEED LESS.

Requia ☣

Quote from: Triple Zero on January 14, 2011, 05:59:06 PM
That's not the way PDF works.

Not only does the PDF format not support word wrap, it doesn't even support line breaks.

That's right. Every line of text on a PDF page is a separate text box with the upper left point specified as coordinates relative to the page.

That is, if you're lucky.

Because there's also things like "kerning", where certain words or combinations of characters need to be spaced closer or farther apart in order to make the letters look more even, and all sorts of things that can break these line boxes into even smaller chunks.

In addition to that, there is no requirement that these text boxes appear in the PDF document in the natural order of the text, because as long as they're on the same page, the top-left coordinates determine where they'll be rendered.

Combine that with captions below figures, big pull-quotes, drop caps, and those little superscript footnote numbers, any kind of PDF-to-Text converter tool needs to do some serious black magic intelligent heuristics in order to even just determine which text boxes make up one paragraph, line, or even a single word.

Now not all PDFs are equally bad in that regard, except you don't know how they are generated, so the only thing a converter program can do it to somehow guess which text boxes belong to the same piece of text by how close together they are, and guessing most lines are nearly (but not always exactly ...) at the same distance from eachother etc etc.

And yes, it is completely stupid and retarded that this jumbled mess makes up what is the most widely used format, and the only format that everybody can read and be mostly certain to look pretty much exactly the same on every computer system and viewer, and that it's probably the best we got right now. But that's the way it is in computerland.

The zoom on a Sony supports word wrap on PDFs regardless of these problems.  It's not ideal, but it does keep you from needing to scroll on a device where scrolling is a very bad idea.
Inflatable dolls are not recognized flotation devices.

Triple Zero

Certainly. I'm jsut saying it's not very straightforward to do.
Ex-Soviet Bloc Sexual Attack Swede of Tomorrow™
e-prime disclaimer: let it seem fairly unclear I understand the apparent subjectivity of the above statements. maybe.

INFORMATION SO POWERFUL, YOU ACTUALLY NEED LESS.