| <<<Back 1 day (to 2018/01/04) | 20180105 |
tor8 | Robin_Watts: 3 commits for review on tor/master | 10:58.21 |
Robin_Watts | tor8: Including the random one? | 11:44.26 |
tor8 | all excluding the freetype update one | 11:44.44 |
Robin_Watts | I need to get the fix for cody in, and that's gated on the random one. | 11:44.45 |
tor8 | yes, that includes the random one | 11:44.55 |
Robin_Watts | I see it, thanks. | 11:45.34 |
| yeah, those 3 look good. ta. | 11:46.20 |
| I shall rebase the ones on robin/master then. | 11:46.29 |
| So there are 3 commits of mine on top of yours in robin/master ready for review. | 11:47.47 |
tor8 | Robin_Watts: Fix "being able to search for redacted text" bug. LGTM | 11:49.41 |
| Robin_Watts: "Enable saving of encrypted PDF files." is unchanged, right? | 11:52.05 |
Robin_Watts | yes. | 11:52.12 |
| but it too was gated on the random thing I think. | 11:52.20 |
tor8 | The "Add ascii option to PDF object output." is not what we discussed, and it's broken too. | 11:52.34 |
| the first two LGTM, but hold off on the "Add ascii option" commit | 11:52.59 |
| I thought you were making the 'if not ascii' option print raw unescaped binary strings to save space | 11:55.08 |
| it looks now like 'if ascii' we print ALL strings as <hexstrings> instead, which is not what we discussed (nor do I see the value of such a behavior) | 11:56.12 |
Robin_Watts | tor8: ok, then we were at cross purposes. | 11:57.46 |
tor8 | Anyway, after thinking about it over my vacation, I am happy with encrypted strings going out as your current "Enable saving" patch does it. | 11:59.01 |
Robin_Watts | ok, so I'll put that in and look again at the isascii patch. | 11:59.26 |
tor8 | I thought you were looking to squeeze more bytes out of it by saving unescaped binary strings. | 11:59.51 |
Robin_Watts | tor8: I wasn't aware that unescaped strings were actually an option. | 12:00.12 |
tor8 | where if the 'isascii' is false, fmt_str would not do octal escapes | 12:00.15 |
Robin_Watts | I'll need to reread that bit of the spec. | 12:00.38 |
tor8 | aha. then I see where we may have talked across each other, yes. | 12:00.38 |
| you can have all bytes in a PDF string except '(', ')', and '\' | 12:01.26 |
| parenthesises must be balanced (or escaped) and backslashes must be escaped for obvious reasons | 12:01.45 |
Robin_Watts | tor8: Right, so it's simple enough. I'll put that on the list. | 12:03.26 |
tor8 | so we should end up with 3 ways to write strings: hexstrings, raw strings, and escaped strings. raw strings if !ascii, escaped/hex strings whichever is smaller if ascii. | 12:04.18 |
| and most things should default to ascii, IMO. | 12:04.53 |
| like in the JNI bindings | 12:05.04 |
Robin_Watts | tor8: I believe that's how I have it set up. | 12:07.02 |
Guest66018 | sebras, thanks that cleared some of it up | 15:08.57 |
| i guess what i'm trying to do is figure out why the example code i was given is producing entirely different output | 15:09.31 |
| here is what the code i was given produces: https://pastebin.com/Hyp5auPA | 15:09.39 |
| and here is what mutool show grep produces: https://pastebin.com/Bwmd5Svu | 15:10.49 |
| and I can't figure out how to reconcile them | 15:10.55 |
| ideally I need to produce the same output as the code I was given gets but it appears to be travesing things differently and getting different results | 15:11.25 |
| both of those were examining the same file btw | 15:11.40 |
sebras | Guest66018: the python code you show with the /Pages/Kids/Parent/MediaBox style "paths" are indeed resolving PDF object references and trying to express how these objects are releated. | 15:12.21 |
| Guest66018: in the greppable output, if you look at object 1 (search for :1:) you can see that there is a /Metadata entry in that dictionary. | 15:14.04 |
| Guest66018: its value is 81 0 R which means that is an object reference to object 81. | 15:14.23 |
Guest66018 | right, object 1 maps to the first few paths | 15:14.24 |
sebras | Guest66018: next search for :81: | 15:14.27 |
Guest66018 | but after that is seems to diverge | 15:14.33 |
sebras | in that dictionary you have Length Subtype and Type entries. | 15:14.44 |
| now, in the python output if you look at e.g. /Metadata/PDFStream/Length you can see that it started in object 1, found the Metadata entry, realized that the Metadata entry points to an object which is ia PDFStream and then lists the entries in that stream object's dictionary part. | 15:15.41 |
Guest66018 | okay, i follow that | 15:16.24 |
| so if i am using the mupdf library, i can just extract the dictionary from the object? | 15:16.40 |
sebras | how does pdf_paths.py work in detail and what objects does it start with? I don't know. perhaps object 1 is the /Root object in the trailer of the PDF..? does pdf_paths.py ignore some objects? I don't know. :) | 15:16.46 |
| Guest66018: you can manipulate objects programmatically from C (or Java) yes. so you ought to be able to list the entries in objects. | 15:17.32 |
Guest66018 | pdf_paths does ignore some objects yes | 15:17.53 |
sebras | Guest66018: pdf_trailer() would e.g. give you the trailer that contains the /Root entry which is presumably object :1: in your particular file. | 15:18.03 |
Guest66018 | it starts with a pdfquery.PDFQuery(name).doc.catalog object | 15:18.38 |
sebras | Guest66018: pdf_paths.py also make up fake object names in it's path like "PDFStream". | 15:18.47 |
Guest66018 | then after deque-ing it, walks it from there | 15:18.48 |
| ah, yes, it is doing exactly that | 15:19.16 |
| thanks, this helps a lot | 15:20.37 |
| i may have other questions if that's alright | 15:20.46 |
sebras | Guest66018: ok, so you need to open the document with something liek fz_open_document(), next call pdf_specifics() to access the PDF part of the document, next call pdf_trailer() next you might need pdf_dict_get_key() perhaps to iterate or pdf_dict_gets() if you already know the name of the thing. | 15:20.51 |
Guest66018 | also, "81 0 R" | 15:21.07 |
| 81 is the object index, what is 0 and R? | 15:21.13 |
sebras | 0 is the generation number. think of it like a version number. they used to be used when documents were updated, but recent PDFs don't really make use of them. | 15:22.19 |
| R means it is an indirect object reference. | 15:22.27 |
| you also have pdf_print_obj() (and fz_stdout()) if you want to print an object in its entirety (note that indirect references are not resolved) | 15:23.25 |
| Guest66018: I hope this will get you started. :) | 15:23.45 |
Guest66018 | thanks, this is much better start than where i was | 15:24.39 |
sebras | Guest66018: all of this presumes you are writing it in C. | 15:24.52 |
Guest66018 | yep, in C | 15:24.57 |
sebras | Guest66018: there ought to be similar calls in javascript which you can try using mutool run documented over at https://mupdf.com/docs/manual-mutool-run.html if you like. | 15:25.15 |
Guest66018 | thanks, i'll take a look at that too | 15:31.20 |
hellion | Hello! Has anyone on here successfully added mupdf to heroku for use with rails 5.2 ActiveStorage? | 17:07.18 |
Robin_Watts | hellion: Are you familiar with the works of Gary Larson? :) | 17:08.49 |
| https://c1.staticflickr.com/1/47/153603564_7281ad0588.jpg | 17:09.26 |
| "blah blah blah blah MuPDF blah blah blah" :) | 17:09.45 |
hellion | I am not familiar | 17:13.54 |
Robin_Watts | He drew a cartoon strip called "The Far Side" | 17:14.26 |
hellion | The Far Side....I do know him | 17:14.39 |
Robin_Watts | My point was that most of that question went straight over my head :) | 17:15.10 |
hellion | And, yet here you are....in the #mupdf discussion :) | 17:15.39 |
Robin_Watts | MuPDF was the one bit I understood. | 17:16.01 |
hellion | purhaps I should move over to an ActiveStorage discussion | 17:16.10 |
Robin_Watts | I know nothing about heroku or rails. | 17:16.22 |
| Best of luck. | 17:16.37 |
hellion | thanks! | 17:16.41 |
| Forward 1 day (to 2018/01/06)>>> | |