Log of #mupdf at irc.freenode.net.

 <<<Back 1 day (to 2019/06/10)Fwd 1 day (to 2019/06/12)>>>20190611 
pietrop hi everybody, still here banging my head :-)08:44.00 
  is it possible to use mutool to compress a pdf previously obtained through "mutool clean -asdifggg" ?08:44.52 
ator pietrop: mutool clean -z (as in zip)08:46.45 
  pietrop: mutool clean -diz is probably the best combination of flags08:47.32 
  decompress, then zip, to get rid of the 'ascii' filter08:47.48 
pietrop oh sorry, that was in the documentation08:52.30 
  I did not spot it08:52.35 
  did you mean -daz ?08:53.26 
ator no, I meant 'i' to not decompress images (or you'll lose JPEG compression on JPEG images)08:53.59 
pulsarpietro Hopefully I am not asking a silly question, I am learning a bit about text drawings as some of the watermarks I want to remove are not made of images unfortunately. I frequently came across the TJ/Tj instructions and very often the text isn't plain text, but it is encoded somehow. Is it possible to mutool-ize it to be plain text ? I was also wondering how such text is obtained - is the font's charset10:51.03 
kens what do you mean by 'plain text' ?10:51.31 
  For example, Chinese isn't going to come out as ASCII no matter what you do10:51.48 
  Beyond that, fonts may use a custom Encoding,l which means that the character code bears no relation to the glyph drawn.10:52.28 
  Often (but its optional) a font may include a ToUnicode CMap which maps the character codes to the Unicode code point.10:52.57 
  I believe that MuPDF already uses this to create text output10:53.17 
  I'm not sure about exactly how MuPDF does its text extraction, but I imagine it follows similar rules to Ghostscript; If there is a ToUnicode CMap, use it, its the most reliable informaiton10:54.19 
  In the absence of a ToUnicode CMap, if the font is TrueType, try to use the CMAP subtable from the font. If the font is PostScript then try and use the glyph name from teh Encoding or CharStrings dictionary. If its a recognised name, translate to Unicode if its a Unicode name , strip off the prefix, use the numeric portion as a Unicode name.10:55.31 
  If all else fials, treat it as ASCII.10:55.42 
  Oh I forgot, if its a CIDFont, then you can (sometimes0 use the CMap associated with the CIDFont to give you the unicode points.10:56.13 
  I'm afraid that in the absence of a ToUnicode CMap its all guesswork10:56.38 
ator pulsarpietro: the arguments to the text showing operators (TJ, Tj, ', ") are just bytes, interpreted according to the font resource's Encoding entry. This does not have to correspond to any actual text encoding, it can be any arbitrary bytes that happen to produce human readable text using the embedded font.11:00.31 
pulsarpietro I meant something I can read and recognize from the "rendered version of PDF" - they are all in English and we do render them all with pdftotext - therefore I am assuming such mapping (about which I am not very knowledgeable) is present.11:01.16 
ator pulsarpietro: there is a "structured text device" that extracts human readable text, if available11:01.43 
kens As ator said, not really, no. There *might* be a ToUnicode Cmap, but there might not.11:01.43 
ator mutool draw -Ftext11:01.50 
  this is not guaranteed to work, but on well formed files it will generally work.11:02.14 
  it all depends on the input. garbage in, garbage out, that whole deal.11:02.28 
pulsarpietro eheh11:02.34 
kens For example, a subset fotn will often be created where character code 1 is applied to the first character, character code 2 to the second adn so on. So "Hello World" would becom <01020303040506040708>11:03.04 
ator in a *well* made PDF, the fonts all use sensible encodings and have ToUnicode CMaps to allow software to convert the text drawing operations into unicode text.11:03.27 
kens NB you often can't create a ToUnicode CMap or a 'sensible' encoding if the PDF file was created from PostScript :-)11:04.08 
ator but as kens said, it's nowhere near a guarantee that this is the case. plenty of PDF creators never bother to write the information needed to get unicode text back out.11:04.15 
pulsarpietro mmm....I need to some more reading.. but I can use pdftotext and it works fine, which means that the text is printed correctly out, does not that mean that a ToUnicode CMaps is associated with the font ?11:09.49 
kens It might, or it might mean the encoding makes sense11:10.08 
pulsarpietro oks11:10.19 
kens mutool can extract text I think11:10.27 
pulsarpietro I am experimenting with mutool draw -Ftext is is possible to preserve the layout ?11:25.04 
paulgardiner What's the magic incantation to make mupdf-gl save the document?11:25.16 
  A perusal of do_app and do_help within gl-main.c has so far not helped.11:31.07 
kens Hmm, sorry I don't know the answers to these questions. AFAIK mupdf doesn't maintain the text layout when extracting, though it does write the text position information when outputting 'XML'. Its hard to do recreate a rich text layout when emitting ascii, Ghostscript makes an attempt with hte textwrite device, but its not terribly reliable.11:49.40 
ator paulgardiner: 'a' to open annotations and there's a button in the lower right11:54.59 
  paulgardiner: I probably should add a menu bar or something ... but sebras would object!11:55.34 
  to wasting screen space11:55.41 
sebras ator: I would indeed. :)12:07.22 
paulgardiner ator: ah yes. Thanks. That did it.12:20.54 
pulsarpietro @kens: thanks12:53.09 
kens NP12:53.17 
paulgardiner Robin_Watts: some time ago, you pointed out a bug in the signature support code. There's a fix for the problem on my master branch, I believe.12:57.40 
Robin_Watts looking.12:58.03 
  sizeof("STRING") == strlen("STRING") + 1 ?12:59.34 
paulgardiner Great. Ta12:59.36 
Robin_Watts "STRING" is a char *, or an array of chars ?13:00.10 
  I guess it's the latter, but can be seen as the former when required.13:00.28 
paulgardiner I'd never used that before and had to google it.13:01.33 
Robin_Watts ok, as long as one of us has :)13:01.58 
  You're assuming that the data is there as "/ByteRange" rather than /B%34teRange etc.13:02.32 
  but I guess we're writing the dictionary you're searching for ?13:02.51 
paulgardiner Yes, I believe we write it.13:03.03 
Robin_Watts Then lgtm.13:03.10 
paulgardiner Thank you13:04.42 
  cluster seems happy too.13:05.14 
dzonerzy Ehy guys could you check this please? https://github.com/ccxvii/mujs/issues/10519:51.00 
sebras dzonerzy: let me try to reproduce.20:53.16 
  dzonerzy: what version of mujs are you running?20:57.24 
dzonerzy master from git20:59.29 
  to reproduce just write20:59.38 
sebras dzonerzy: e2b59201d5049a8ad509b280e729a871756abc99 ?20:59.43 
dzonerzy /.(?:)/21:00.14 
  this would compile as regex and trigger the bug21:00.25 
  sure e2b59201d5049a8ad509b280e729a871756abc9921:00.49 
sebras dzonerzy: ok. I'm unable to reproduce.21:01.16 
dzonerzy sorry checked now it was the previous commit my bad :D21:01.34 
  it was fixed without no answer to issue thanks much21:01.59 
sebras dzonerzy: maybe ator forgot to close issue 105.21:02.42 
dzonerzy ye thanks much for the fast fix :) i love mujs21:03.06 
sebras dzonerzy: I'll pass that on to ator. :)21:03.20 
dzonerzy I already closed the issue21:04.04 
sebras dzonerzy: thanks.21:04.24 
dzonerzy found a weird behavior21:50.33 
  max size for array is 0x7fffffff21:50.43 
  anyway if you declare var a=[]¬†and then do a[1]=1 , length is increased and become 1, but if you set a[0xffffffff+1]=1 length remains 0 but property is there21:51.36 
  @ator @sebras21:51.51 
sebras dzonerzy (for the logs): I'll leave that for ator to figure out tomorrow. perhaps there's a limit of 2**31-1 elements in an array. I wouldn't be surprised. should mujs detect it and error out somehow? perhaps.21:59.42 
 <<<Back 1 day (to 2019/06/10)Forward 1 day (to 2019/06/12)>>> 
ghostscript.com #ghostscript