Log of #mupdf at irc.freenode.net.

Search:
 <<<Back 1 day (to 2019/06/09)Fwd 1 day (to 2019/06/11)>>>20190610 
pulsarpietro Hi all, I am experimenting the MUPDF library and I've a read at the docs folder, however I couldn't find the API documentation.08:34.24 
kens [ul08:34.38 
pulsarpietro Is it somewhere in the source path ? Ot is it generated a build time ?08:34.50 
kens All the documentation is, as far as I know, in the source tree.Either as PDF or in the header files. I am not, however a MuPDF developer08:35.25 
  Everyone is travsalling back from a staff meeting currently, so it may be a few hours before they are able to respond08:35.50 
  If you stick around, or read the logs at your convenience, someone more knowledgeable than me will reply08:36.16 
pulsarpietro Thanks a lot08:36.53 
  are you guys based in the US ?08:50.11 
kens All over the place :-)08:50.21 
pulsarpietro eheh OK08:50.27 
kens The MuPDF developers are mostly European08:50.33 
pulsarpietro So no chance I can get the right timetable08:50.42 
  ah OK. That's good I am based in Europe too08:50.56 
  *right timezone08:51.01 
kens Well, there's someone around 'most' of the time. One of the MuPDF developers spends up to 6 months at a time in Taiwan, one is in Sweden, one in the UK, some in California08:51.23 
  But when we have staff meetings we are all in the same place. Then we all have to get home, and people often tack a little holiday on the end too, if we are somewhere interesting for the meeting08:52.25 
pulsarpietro Have you ever build the library from sources ?09:02.54 
kens I do fairly frequently, yes09:03.06 
pulsarpietro What's the target you give out ? I am using the default but apparently it fails with an obscure assembler error09:04.04 
  Fatal error: can't create build/release/thirdparty/freeglut/src/fg_callbacks.o: Permission denied09:04.05 
kens You must be using some flavour of Linux, I usually use Windows09:04.36 
Robin_Watts That's not an assembler error.09:04.44 
pulsarpietro Right, I use Debian09:04.44 
Robin_Watts That's a permission error.09:04.52 
kens Morning Robin_Watts09:04.56 
Robin_Watts which suggests that you have a problem with the writability of that file.09:05.16 
pulsarpietro Indeed, a bit of context may help09:05.16 
Robin_Watts That's out of our control.09:05.21 
  kens: Morning.09:05.24 
pulsarpietro CC build/release/thirdparty/freeglut/src/fg_callbacks.o Assembler messages: Fatal error: can't create build/release/thirdparty/freeglut/src/fg_callbacks.o: Permission denied Makethird:372: recipe for target 'build/release/thirdparty/freeglut/src/fg_callbacks.o' failed make: *** [build/release/thirdparty/freeglut/src/fg_callbacks.o] Error 209:05.25 
  ahhrg, bad formatting. Sorry09:05.34 
Robin_Watts is not here for another hour at least, really.09:05.47 
pulsarpietro Assembler messages:09:06.02 
  Fatal error: can't create build/release/thirdparty/freeglut/src/fg_callbacks.o: Permission denied09:06.03 
  Oks, thanks. Will double check my stuff.09:06.14 
kens Well, it 'seems like' the compiler can't create the file, either because the directory doesn't exist, or you don;'t have write permissions in the directory09:06.15 
  Hmm, wait, why is it creating the .o file in the *src* directory ?09:06.44 
  Did you pull the sources from our Git repository ?09:07.10 
kens boots up a Linux09:07.36 
pulsarpietro Yes but wait - I was the silly man in the room09:08.31 
  :-)09:08.39 
kens Ah, well I needed to clone the repositroy in my current setup anyway09:08.55 
pulsarpietro It's all right. I probably started a build as a root user before - while installing the needed packages.09:09.12 
  sorry about that09:09.21 
kens Oh, that would probably not work yes09:09.23 
  Not a problem, I should have a MuPDF available to me in Linux as well, but I upgraded Ubuntu a while back and hadn't gotten round to it yet09:09.54 
  So this at least prompted me to do htat09:10.08 
  cd mupdf09:10.25 
  LOL09:10.30 
pulsarpietro I downloaded the 1.15.0 source tarball, I am not on mainstream. I think it is reasonable to tell you what I am trying to experiment with. I'd like to "open" a PDF and scan for all objects in a given page, gathering information when I want to, this is not for a specific reason as yet but more for my understanding. An initial experiment would be to see if there is a recurrent object in all pages which can be identified as a waterm09:14.59 
  Please tell me if you reckon I am completely crazy and what I am trying to do makes no sense, it would help me as to know that I don't know enough ... it's something I guess09:16.03 
kens I'm 'reasonably' sure you can do that, but I'm no expert on the internals. A watermark might be a single object common to all pages, or it might be a number of (identical) objects one per page. It would be easy to spot the first case, less so for the second. Thgouh an annotaiton would be easuier to see.09:16.28 
  I just pulled the current HEAD from Git and iits compiling now09:17.22 
  Hmm, no GL, so it failed to build09:18.26 
pulsarpietro I needed to install CCcc201809:20.06 
  ops09:20.09 
  libglu1-mesa-dev freeglut3-dev09:20.19 
kens Its OpenGL I'm missing, just installing it09:20.22 
  <sigh> one step forward.....09:22.06 
pulsarpietro libxrandr-dev libxi-dev09:25.28 
  that's is what I needed to install as well09:25.33 
kens I need X1109:25.40 
pulsarpietro would you have, off the top of your head, a good example of a C file which traverses a PDF file and get some data out of it ? :-)09:26.21 
kens Not me, sorry09:26.38 
pulsarpietro np09:26.45 
kens OK looks like that build completed09:28.10 
pulsarpietro ;-)09:37.29 
ator jarindyk2: the mutool run javascript tool can be used to traverse and print the values of form fields fairly easily09:37.49 
  jarindyk2: there's also "mutool show file.pdf form" that prints all the form fields (but I'm not sure if it prints the current value, if not that should be trivial to add, I use it to find out what javascripts and actions are hooked up to the fields when debugging files)09:38.36 
kens Morning ator, was your flight OK ? Looked busy.....09:38.38 
ator kens: very busy, very late09:38.50 
  but we got home eventually09:38.55 
kens :-(09:38.56 
  30 minute delay for us09:39.18 
ator yeah, I saw your flight (and many others) having delays09:39.46 
  and passport control back in sweden was another 30 minute wait in line... that's a first :(09:40.01 
kens 30 minutes at the end of the day is not unexpected really. Of coruse we didn't have to do passport control....09:40.18 
ator kens: no, but usually you have 10hrs of flying to do, 30 minute delay on a 30 minute flight is quite a bit more proportionally :)09:40.56 
kens Well, its a 70 minute flight, but yes09:41.15 
ator kens: oh, that long?09:41.25 
kens We should have a meeting in Sweden. Or at least Copenhagen09:41.32 
  ator most of the time is spent going up and then down.09:41.43 
ator kens: yeah. and circling waiting for a landing slot :)09:41.57 
kens Its 300+ miles, so that's only 30 minutes at cruising speed09:41.59 
  Gatwick is better than Heathrow when it comes to holding patterns :-)09:42.39 
ator pulsarpietro: there are some undocumented features that could help with what you're wanting to do, if you're willing to dig through the source a bit09:44.25 
  I'm working on a new set of API documentation, but it's far from complete yet.09:44.50 
  https://ghostscript.com/~robin/mupdf_explored.pdf can also be a handy bit of documentation. bits of it may be a bit out of date, but as a general overview it should be good.09:45.58 
  section 31.4 "PDF Operator Proecessors" is what I think may help your analyse and find common patterns09:47.15 
  probably easier than at the "Device" level where it's all baked down into low level graphics drawing commands09:47.41 
pulsarpietro many thanks indeed. I will have a look to both. I'd like to dig into the sources but it's probably going to take me LONG time.09:48.20 
  I am still reading a lot of the PDF standards and the book "Developing with PDF" which I finding very useful09:48.50 
kens Hmm, don' tknow that one09:49.28 
pulsarpietro If you think the easiest way to start off is to use the mutool run javascript I'd go for it09:49.30 
ator pulsarpietro: my suggestion would be to read the PDF Reference 1.3 (the last "good" version before Adobe started bloating the format with lots of useless features)09:49.31 
pulsarpietro I need to start somwhere :)09:49.34 
ator it's short and readable, and if you understand PDF 1.3, then you've got most of the basics09:49.51 
pulsarpietro Which is the one I've got here, only reason is that it was the cheapest :)09:50.04 
ator PDF 1.4 adds a horribly complicated transparency model09:50.14 
  PDF 1.5 adds compressed object streams, which are just bleh, but trivial to understand if you already know the PDF format09:50.45 
  PDF 1.6 and 1.7 just add more annotation types and encryption algorithms09:51.01 
  https://www.adobe.com/devnet/pdf/pdf_reference_archive.html here's where you can download the different specs09:51.39 
pulsarpietro cheers09:51.59 
kens ator, ran across this interesting SO question regarding Acrobat and encryption09:53.09 
  https://stackoverflow.com/questions/56507280/trying-to-figure-out-why-a-pdf-is-invalid-for-acrobat-reader-but-opens-fine-in-a?noredirect=1#comment99621141_56507280:09:53.09 
  Looks like Acrobat doesn't like Version 2 encryption handler with a PDF 1.5 file using compressed objects and xref09:53.39 
  GS and MuPDF, of course, are entirely happy with it09:53.50 
  But for creation of encrypted PDF files, it may be something to consider, maybe sebras shoudl look at this with his work on producinfg encrypted PDF files.09:54.55 
ator kens: huh, that's ... sad but not surprising09:55.40 
kens Yeah :-(09:55.48 
ator we don't create compressed object streams, but should we do, that's definitely be something to keep in mind09:56.00 
kens I can't prove the hypothesis, but everyone except Acrobat can open the file, and I can't make a file like that from Acrobat09:56.11 
  Ah, wasn't aware you didn't do compressed streams yet. That's a TODO for me too09:56.30 
ator kens: the whole "new" security handler stuff they added in recentish versions of adobe with the per stream CryptFilt stuff seems not very well thought through09:57.01 
  kens: I hate the very idea, it makes the files just gobble up more memory09:57.16 
kens I'd say the whole encryption thing is not well thought through :-)09:57.24 
ator if you're concerned about transfer bandwidth, web servers can do gzip compression, and if you worry about disk space, stop creating hugely bloated PDF files :)09:57.47 
  kens: yeah. there's just so many ways the information can leak.09:58.04 
kens Totally true,. Also the compressed objects and xrefs really don't save much, generally, which is why I#'ve not put any priority on it09:58.24 
ator IMO it's only there to "enforce" the permissions by making it a lot of work to get around it09:58.43 
  s/O/belief/09:59.08 
kens And tehre are plenty of sites that will do it for you, or tell you how.09:59.08 
ator kens: or just run it through any open source software....09:59.23 
kens Yes, many of them recommend using GS09:59.33 
ator now you can use mutool clean -D (or -E with a new password, using sebras' code) to strip or change the password, but don't mention it to management.10:01.07 
kens :-D10:01.18 
ator convincing them what a pointlessly bad idea it would be to start enforcing permissions in open source software would be ... a waste of everybody's time.10:01.58 
kens Well, GS does enforce permissions, but if you open a PDF file and run it through pdfwrite, you end up with a PDF file with no restrictions10:02.45 
  Assuming you don't need a user password to even open it of course10:03.05 
ator kens: how do you enforce the "no-print" permission?10:03.15 
kens I think we simply refuse to process the file10:03.34 
  Becasuse we re, after all, a 'printer'10:03.43 
ator kens: so how do you run it through pdfwrite in that case?10:03.54 
kens Well in that case you can't of coruse.10:04.05 
ator kens: supporting 'no-annotate' is trivial in GS :)10:04.15 
kens But as you say, its trivial to change the software10:04.15 
  Actually, we don't support no-annotate, and with GS and pdfmark, you can do it!10:04.45 
ator kens: oh! :)10:04.57 
kens I really don't propose to try and deal with that, we use pdfmark internally for too many things. Like passing existing annotations10:05.46 
ator yeah, and as you said, in open source software, it's trivial to remove any such checks anyway, so why waste time making life harder for users?10:07.16 
pulsarpietro hi all, I am trying to use the mutool run trace-device.js but I get an error when trying to do so. I apologise in advance if I haven't spotted a silly mistake I've made ..10:42.58 
  mutool run docs/examples/trace-device.js ~/pdfs/minimal.pdf 110:43.07 
  ReferenceError: 'scriptArgs' is not defined10:43.19 
ator pulsarpietro: sounds like your mutool is too old10:52.48 
  pulsarpietro: what does 'mutool -v' say?10:52.52 
pulsarpietro 1.9a10:58.08 
  I've got this tarball, is it too old mupdf-1.15.0-source ?10:58.35 
ator well, 1.9 is several years old by now11:00.30 
  that tarball is new enough, but that's not the mutool you ran (it should print "1.15.0")11:00.58 
pulsarpietro shall I clone the GIT repo and jump to a branch ?11:01.00 
  oh dear11:01.19 
  forgot about it11:01.23 
  forget about it - I meant to override the path but I may have opened another terminal ...11:01.43 
  sorry about that II11:02.29 
sebras Robin_Watts: did you ever try Alexei Podtelezhnikov's suggestion of including intrin.h and using some pragma?11:29.50 
Robin_Watts sebras: No. It's on my list to try.11:32.03 
sebras Robin_Watts: ok, I was worried that you might forget after the meeting. :)11:32.30 
  (and having forgotten myself, I found the conversation in my inbox just now)11:32.50 
pulsarpietro hello, sorry for hammering here. I am getting a bit lost into the pdf-run/murun.c files. Is there a quick win to show, for each page leaf, the objects referenced in the "Contents" key using the javascript binding ?12:07.02 
sebras Robin_Watts: that result still means that the change is in freetype, right?12:13.14 
Robin_Watts sebras: Yes. We still need to have a slightly changed freetype thirdparty thing until we take a new release from them.12:13.43 
ator pulsarpietro: the "Contents" entry of a page object is a stream, which contains drawing commands12:14.28 
Robin_Watts but I was figuring we'd wait for them to put a commit on their dev branch, then cherry pick that onto a branch that hangs off the last release tag.12:14.29 
  then use that.12:14.35 
sebras Robin_Watts: alright, I'll keep my out for that one, and update the thridparty html accordingly. ;)12:14.40 
Robin_Watts sebras: Cool.12:14.47 
ator pulsarpietro: some drawing commands can refer to other resources (fonts, images, other content streams) that are defined in the page Resources dictionaries12:15.17 
sebras Robin_Watts: seems like it is this one..? https://git.savannah.gnu.org/cgit/freetype/freetype2.git/commit/?id=e13c1f46dc1afb1b2287849be5fa74ef70e0607b12:15.40 
Robin_Watts sebras: Looks like it, yes.12:16.02 
  Do you want to handle that or should I?12:16.23 
ator pulsarpietro: if you want to look at the contents of a page, the 'docs/example/trace-device.js' is one place to start12:16.51 
sebras Robin_Watts: you can compile test this, so it is better if you do it.12:16.54 
Robin_Watts sebras: OK. I'll get a review up after lunch.12:17.06 
sebras Robin_Watts: I was referring to the 2.10.1 or 2.11.0 upstream tag. :)12:17.08 
pulsarpietro Doesn't it reference a stream, or an array of stream. It may be me referring to an older PDF specs though.12:18.25 
  https://web.archive.org/web/20101214132912/ href="http://partners.adobe.com/public/developer/en/pdf/PDFReference13.pdf">http://partners.adobe.com/public/developer/en/pdf/PDFReference13.pdf12:18.25 
  page 620 (or 604 in the actual document)12:18.43 
Robin_Watts Page contents are a stream or array of streams, yes12:19.06 
  The trace device converts those to a printable list of graphics operations.12:20.05 
pulsarpietro the trace-device goes really low level I reckon, right into the stream's commands. Within my silly attempts would like to print all *references* contained in the "Content" section for all page leaves.12:21.34 
  *The silly example I am working at ...12:22.00 
  In other words, I need only to access the Page Objects12:23.01 
ator pulsarpietro: if you run the file through "mutool clean -d input.pdf output.pdf" then you can open up the 'output.pdf' file in a text editor and have an easier way of looking at it12:23.49 
  the '-d' option decompresses all the streams, so you can look at them in the editor without compression getting in the way12:24.11 
  the "mutool show file.pdf pages/1" will display the page object for page 112:25.35 
  mutool show file.pdf pages/1/Contents will show the content stream12:25.56 
  mutool show file.pdf pages/1/Resources will show the resource dictionaries12:26.06 
  etc.12:26.06 
sebras ator: how about the first (two?) commits on sebras/master?13:23.31 
ator sebras: LGTM.13:25.39 
Robin_Watts ator, sebras: http://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=3c75966a611468faeac8d8ef290817df508dd50c13:39.51 
pulsarpietro ator: thanks13:40.00 
ator Robin_Watts: LGTM13:41.42 
Robin_Watts Ta. I'll push it once the (belt and braces) cluster run finishes.13:42.26 
sebras Robin_Watts: LGTM2.13:47.20 
Robin_Watts Ta2.13:47.37 
pulsarpietro ator: can I render a single content item only on the screen somehow ? For instance if I've got an array of Contents which is [ 364 0 R 359 0 R 63 0 R 360 0 R 365 0 R ] and I want to render them one by one on the screen to see they are.13:56.11 
  *what they are13:56.27 
kens Not reliably13:56.40 
  Each content stream need not be independent13:56.49 
  The graphics state may be set up by a prior content stream, and relied upon by a future one13:57.16 
Robin_Watts pulsarpietro: As an example, your complete strean might be "0 0 100 100 re f" say.13:57.31 
pulsarpietro yeah it makes sense13:57.46 
Robin_Watts And your first stream could be "0 0 1" and the second "00 100 re f"13:57.50 
  You can't even assume it'll break at a token boundary.13:58.10 
pulsarpietro My naive idea was that a watermark would be the a listed content among all pages13:58.44 
Robin_Watts pulsarpietro: Are you looking to REMOVE watermarks?13:59.07 
pulsarpietro but it could be anything, "embedded" within all page's streams13:59.19 
Robin_Watts Yes.13:59.23 
pulsarpietro I am playing with it yes13:59.28 
Robin_Watts It's a fairly frequent thing that the first thing a stream does is to clear the entire backdrop with a fill operation./13:59.52 
  (It's not required, but lots of PDF producers do it).14:00.04 
  so any watermark done by prepending a "draw the watermark" bit of content, then the original content wouldn't work.14:00.32 
  so watermarks are more likely to be done using transparency operations AFTER the page content has been drawn.14:00.53 
  But, you can't be sure that the PDF graphics state at the end of the page writing will be sensible.14:01.28 
pulsarpietro Oks - I am completely off track then. I did not expect it to be simple though.14:01.34 
Robin_Watts So you might thing you can do "q" <original PDF content> "Q" <watermark content>14:02.04 
  BUT even that can fail, as the original PDF content might not have matching q/Q counts.14:02.24 
  This is part of the reason that our pdf filter processor exists; so we can sanitize existing streams to make sure they are 'sane' so we can append watermarks etc properly.14:03.23 
pulsarpietro This are complicated stuff looks like.14:06.21 
kens It would be a lot less complicated if PDF producers created decent files14:06.49 
pulsarpietro sorry folks, I run mutool clean -d input.txt output.txt 114:09.32 
  but the output file is unreadable (vim). The input source is uncompressed, not sure if that makes any difference14:10.22 
  and by saying that I mean that I can read PDF objects (tags and this kind of stuff, not streams). I'd like to see the streams' contents (aka the operations)14:11.33 
kens objects are often uncompressed (compressed object and xref streams are a PDF 1.5 feature) so that doesn't mean its an cunompressed file.14:12.09 
  When you decompress it any images and fotns will be uncompressed as well14:12.30 
  Which means you end up with a file with loads of binary in it, this often confuses editors14:12.46 
pulsarpietro is there a specific tool for handling "source" pdfs ?14:14.57 
kens Not sure what you mean.14:15.13 
Robin_Watts mutool clean -difggg input.txt output.txt 114:15.50 
  D'Oh. wait.14:16.03 
  mutool clean -difggg input.pdf output.pdf 114:16.11 
kens was still reading the 'help'14:16.31 
Robin_Watts That takes input.pdf and produces output.pdf from page 1 of it.14:16.34 
  -d says "decompress the contents"14:16.47 
  -i says "don't decompress images"14:16.54 
  -f says "don't decompress fonts"14:17.02 
pulsarpietro Yeah I am reading the documentation14:17.13 
Robin_Watts -ggg says "garbage collect away as many objects as possible14:17.19 
pulsarpietro It does work14:17.29 
  :)14:17.34 
Robin_Watts So output.pdf should be a fairly readable form.14:17.51 
  If you throw in a -s, then we'll "sanitize" the page contents too.14:18.09 
  but then you're not really looking at the source.14:18.25 
ator pulsarpietro: you can add -a to the mutool clean options, to make sure everything is ASCII (but this will sometimes hide stream contents by asciihex encoding them if they have binary data)14:37.40 
pulsarpietro Guys - I'd like to thank you for your generosity. I know it can be a pain to explain things to a newby. I've manage to remove the watermark to my pdf as it is an image listed under XObject and it is applied using transparency. I don't think it is a universal solution but it does the job in my case and more important I got a better understanding of the document's structure.14:54.46 
kens Its always good to have a project to give you a goal14:55.19 
  Its not entirely unusual to have watermarks defined that way, but they are relatively easy to remove, the producers got wise to that and made it harder in later versions of their software14:55.55 
sebras wrt 701182 I was also thinking about named destinations and outlines which I suspect may both leak information.14:58.01 
  to make redactions water tight is a hard problem imho.14:59.06 
ator sebras: garbage collecting everything not referenced anymore after redacting would probably be necessary -- name tables, resources, table of content outline entries, etc.15:01.19 
  name trees*15:01.31 
sebras ator: yes, and since redaction is an internal destructive operation I'd expect that removing anything unused (but not actively redacted) would not pose a problem.15:06.51 
pulsarpietro hello, I've stumbled upon a bizarre behaviour of mutool clean - if I give to it an awkward file name it seems to hang. Try with mutool clean -asdifggg pdffile ./pdf.p59B6bGnE for example.17:13.44 
ator pulsarpietro: try with one fewer 'g'17:21.18 
  the first two are useful, the third one makes things a *lot* slower17:21.44 
  I do not know why it hangs with three g's, that would be a question for Robin_Watts17:22.17 
  the -ggg deduplication can make for some fairly small savings, but at a huge processing cost17:22.54 
Robin_Watts Really? The filename makes a difference?17:22.59 
ator Robin_Watts: no, but -ggg with those other options do17:23.09 
Robin_Watts ator: pulsarpietro seems to be claiming the name makes a difference.17:24.05 
pulsarpietro Nono here the filename makes the whole difference17:24.09 
Robin_Watts I agree that using 3 g's will make it much slower (so much so, potentially, that you might think it's hung)17:24.33 
pulsarpietro I've cut down my doc to be 2 pages so to make that clear17:24.57 
  mutool clean -asdifggg TAKKO_1Q17.pdf.sub ./pdf.p59B6bGnE17:25.05 
  that does not terminate - waited for a few mins17:25.15 
  time mutool clean -asdifggg TAKKO_1Q17.pdf.sub ./a.pdf real0m1.104s user0m1.000s sys0m0.088s17:25.29 
  I can't rule out I am making some silly mistake, but I can't see what at the moment17:25.51 
ator pulsarpietro: it does not treat your second argument as an output filename; since it doesn't end with ".pdf" it tries to parse it as a page number/range17:26.02 
pulsarpietro mutool version 1.15.017:26.02 
Robin_Watts ator: Ah...17:26.25 
pulsarpietro right so the extension does make the difference17:26.32 
ator I just stumbled on a file which hangs (or takes more than a minute) to do -ggg17:26.38 
Robin_Watts ator: pdf_reference17.pdf :)17:26.52 
ator pulsarpietro: it gets stuck parsing your filename as an infinite sequence of "1-1" page ranges17:28.21 
  the page range parsing doesn't do a lot of error checking :)17:28.31 
  Robin_Watts: almost, pdfref17.pdf cut down to the first page by a previous mutool clean :)17:28.59 
 <<<Back 1 day (to 2019/06/09)Forward 1 day (to 2019/06/11)>>> 
ghostscript.com #ghostscript
Search: