| <<<Back 1 day (to 2020/04/27) | Fwd 1 day (to 2020/04/29)>>> | 20200428 |
sarmols | Does any documentation of stext exist? Because unfortunately I could not find any (except commented source code) | 08:03.55 |
pedr0 | hi all, I came across a very puzzling outcome while dealing with a PDF - apparently the rendering mode is set to '3' - namely invisible - yet the text appears on the page. https://pastebin.com/sPcncSpL | 08:32.31 |
kens | Going to need to see the PDF file | 08:32.48 |
pedr0 | Would anybody in here be able to point me into the right direction ? I suspect that the 'ri' operation may affect somehow the rendering but the PDF reference wasn't very clear, at least at my level, about what that entails. | 08:32.51 |
kens | Not just a fragment of it | 08:32.54 |
pedr0 | Oks | 08:33.00 |
kens | The ri is the rendering intent and won't (or should n't) affect the text rendering mode | 08:33.11 |
| Its to do with the precise way that CIE colours are reproduced, | 08:33.46 |
pedr0 | How can I share the pdf ? can I send it over here ? | 08:39.04 |
kens | Put it on a download site is easiest | 08:39.15 |
| Dropbox or similar | 08:39.19 |
| and sahre hte URl | 08:39.22 |
| <sigh> typos... share the URL | 08:39.37 |
| I'm assuming this is a file you're happy to share publicly | 08:39.59 |
pedr0 | It's a public domain document - nothing secret. | 08:42.35 |
| https://file.io/zQ2XeGz6 | 08:42.35 |
kens | ok 1 second while I get a copy | 08:42.47 |
| Hmm, either its a big file or its a slow download..... | 08:43.43 |
ator | _YKY_: what's obvious for one is not obvious for all. the home/end keys serve a different function, jumping to the beginning and end of the current line when editing text. making them jump to the ends of the document could be very disruptive. | 08:44.26 |
kens | Looks like this is a scanned image, with OCR text added in Text rendering mode 3 | 08:44.31 |
pedr0 | I extracted a page from it as an example - in my company we process financial data and we've written some tooling to extract text from PDFs - often I need to get to the nitty-gritty of the pdf format (which is fairly sophisticated). In that specific case I don't understand how it is possible, given the '3 Tr' instruction at the beginning of the stream, for the text to be visible | 08:44.51 |
kens | There is no text afaics | 08:45.08 |
| The content of the page is a scanned image, a bitmap picture | 08:45.18 |
| An OCR package has been used to generate a textual representaiton of that | 08:45.35 |
| Then the same text has been added to the PDF file, in text rendering mode 3 | 08:45.50 |
| So the text isn't drawn, but the picture of the text is still there | 08:46.00 |
pedr0 | AH, that's why the text within the stream is so simple then | 08:46.25 |
kens | Note that this single page is 37MB! | 08:46.29 |
| decompressed its nearly 500MB | 08:46.47 |
pedr0 | Yes, it all makes sense, how did I miss that | 08:46.49 |
kens | Its easy to miss, I'm glad you follow the explanation, its kind of hard to describe | 08:47.06 |
pedr0 | how did you spot that right away ? | 08:47.07 |
kens | Well, firstly, the text when displayed looks liike a JPEG image, there are artefacts behind the text | 08:47.31 |
ator | sarmols: I'm afraid not, the commented source code and headers is the best source for now. | 08:47.32 |
kens | Then the sheer size is a give away | 08:47.42 |
| 37MB for a sinlg epage of text is *massive* | 08:47.52 |
pedr0 | it must be that /Im0 Dod | 08:48.24 |
kens | And oof course, I've come across this technique before (making an image PDF searchable using OCR) so its not a surprise to see it | 08:48.25 |
| Yes, Do is the image operator in PDF | 08:48.41 |
pedr0 | right right, I was so unsure about that /RelativeColormetric ... well at least now I know for sure that does not play a role. So the text is indeed invisible. | 08:49.28 |
| thanks a lot | 08:49.32 |
kens | NP | 08:49.37 |
sarmols | ator: thanks, guess I'll gotta excercise reading code then :) | 08:53.46 |
jklowden | Hi, I'm adding support for regex search to mupdf, and I'm looking for a compile option to make debugging easier. Most public symbols aren't exported, so for example I can't stop on fz_search_stext_page. | 13:34.03 |
ator | jklowden: http://git.ghostscript.com/?p=user/tor/mupdf.git;a=commitdiff;h=1806b2e8964fd0916988ecac8b5e637ebd032cdd | 13:46.35 |
| make build=debug | 13:46.44 |
malc_ | ator: the latest commit made me wonder whether comma after "Note" is needed (just curious) | 14:08.01 |
jklowden | ator: thanks, will check back. Do you want the patch if it works, and if so do I need to sign anything? | 14:10.08 |
ator | malc_: not needed, but neither is it needed, afaik. | 14:29.48 |
| one of those dramatic pause commas | 14:30.07 |
malc_ | ator: well commas are hell, in russian at least. and then there's oxford one... oh well | 14:31.56 |
sebras | malc_: in chinese there is a comma between sentences. only once a topic is finished do you put a period. well... a 。 actually. | 14:39.01 |
malc_ | sebras: i don't know what chinese is :) Wu? Putonghua? Yue? | 14:40.12 |
sebras | malc_: well, the chinese I speak is formally Guoyu (litterally country language). | 14:40.53 |
malc_ | sebras: putonghua for all intents and purposes, right? | 14:41.48 |
sebras | malc_: to a large extent, but using traditional characters of course. | 14:45.35 |
malc_ | sebras: right right... and using that weird non pinyin romanization you mentianed the other day | 14:46.43 |
| <<<Back 1 day (to 2020/04/27) | Forward 1 day (to 2020/04/29)>>> | |