Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2020/04/27)Fwd 1 day (to 2020/04/29)>>>20200428 
sarmols Does any documentation of stext exist? Because unfortunately I could not find any (except commented source code)08:03.55 
pedr0 hi all, I came across a very puzzling outcome while dealing with a PDF - apparently the rendering mode is set to '3' - namely invisible - yet the text appears on the page. https://pastebin.com/sPcncSpL08:32.31 
kens Going to need to see the PDF file08:32.48 
pedr0 Would anybody in here be able to point me into the right direction ? I suspect that the 'ri' operation may affect somehow the rendering but the PDF reference wasn't very clear, at least at my level, about what that entails.08:32.51 
kens Not just a fragment of it08:32.54 
pedr0 Oks08:33.00 
kens The ri is the rendering intent and won't (or should n't) affect the text rendering mode08:33.11 
  Its to do with the precise way that CIE colours are reproduced,08:33.46 
pedr0 How can I share the pdf ? can I send it over here ?08:39.04 
kens Put it on a download site is easiest08:39.15 
  Dropbox or similar08:39.19 
  and sahre hte URl08:39.22 
  <sigh> typos... share the URL08:39.37 
  I'm assuming this is a file you're happy to share publicly08:39.59 
pedr0 It's a public domain document - nothing secret.08:42.35 
  https://file.io/zQ2XeGz608:42.35 
kens ok 1 second while I get a copy08:42.47 
  Hmm, either its a big file or its a slow download.....08:43.43 
ator _YKY_: what's obvious for one is not obvious for all. the home/end keys serve a different function, jumping to the beginning and end of the current line when editing text. making them jump to the ends of the document could be very disruptive.08:44.26 
kens Looks like this is a scanned image, with OCR text added in Text rendering mode 308:44.31 
pedr0 I extracted a page from it as an example - in my company we process financial data and we've written some tooling to extract text from PDFs - often I need to get to the nitty-gritty of the pdf format (which is fairly sophisticated). In that specific case I don't understand how it is possible, given the '3 Tr' instruction at the beginning of the stream, for the text to be visible08:44.51 
kens There is no text afaics08:45.08 
  The content of the page is a scanned image, a bitmap picture08:45.18 
  An OCR package has been used to generate a textual representaiton of that08:45.35 
  Then the same text has been added to the PDF file, in text rendering mode 308:45.50 
  So the text isn't drawn, but the picture of the text is still there08:46.00 
pedr0 AH, that's why the text within the stream is so simple then08:46.25 
kens Note that this single page is 37MB!08:46.29 
  decompressed its nearly 500MB08:46.47 
pedr0 Yes, it all makes sense, how did I miss that08:46.49 
kens Its easy to miss, I'm glad you follow the explanation, its kind of hard to describe08:47.06 
pedr0 how did you spot that right away ?08:47.07 
kens Well, firstly, the text when displayed looks liike a JPEG image, there are artefacts behind the text08:47.31 
ator sarmols: I'm afraid not, the commented source code and headers is the best source for now.08:47.32 
kens Then the sheer size is a give away08:47.42 
  37MB for a sinlg epage of text is *massive*08:47.52 
pedr0 it must be that /Im0 Dod08:48.24 
kens And oof course, I've come across this technique before (making an image PDF searchable using OCR) so its not a surprise to see it08:48.25 
  Yes, Do is the image operator in PDF08:48.41 
pedr0 right right, I was so unsure about that /RelativeColormetric ... well at least now I know for sure that does not play a role. So the text is indeed invisible.08:49.28 
  thanks a lot08:49.32 
kens NP08:49.37 
sarmols ator: thanks, guess I'll gotta excercise reading code then :)08:53.46 
jklowden Hi, I'm adding support for regex search to mupdf, and I'm looking for a compile option to make debugging easier. Most public symbols aren't exported, so for example I can't stop on fz_search_stext_page.13:34.03 
ator jklowden: http://git.ghostscript.com/?p=user/tor/mupdf.git;a=commitdiff;h=1806b2e8964fd0916988ecac8b5e637ebd032cdd13:46.35 
  make build=debug13:46.44 
malc_ ator: the latest commit made me wonder whether comma after "Note" is needed (just curious)14:08.01 
jklowden ator: thanks, will check back. Do you want the patch if it works, and if so do I need to sign anything?14:10.08 
ator malc_: not needed, but neither is it needed, afaik.14:29.48 
  one of those dramatic pause commas14:30.07 
malc_ ator: well commas are hell, in russian at least. and then there's oxford one... oh well14:31.56 
sebras malc_: in chinese there is a comma between sentences. only once a topic is finished do you put a period. well... a 。 actually.14:39.01 
malc_ sebras: i don't know what chinese is :) Wu? Putonghua? Yue?14:40.12 
sebras malc_: well, the chinese I speak is formally Guoyu (litterally country language).14:40.53 
malc_ sebras: putonghua for all intents and purposes, right?14:41.48 
sebras malc_: to a large extent, but using traditional characters of course.14:45.35 
malc_ sebras: right right... and using that weird non pinyin romanization you mentianed the other day14:46.43 
 <<<Back 1 day (to 2020/04/27)Forward 1 day (to 2020/04/29)>>> 
ghostscript.com #ghostscript
Search: