Log of #mupdf at irc.freenode.net.

 <<<Back 1 day (to 2021/01/13)Fwd 1 day (to 2021/01/15)>>>20210114 
artifexirc-bot <Robin_Watts> @fredrossperry No, the freetype submodule is different. I'll investigate more in the morning,.00:21.01 
  <Robin_Watts> malc_: Yes, stuff has been force pushed, and likely more will be tomorrow.00:21.46 
  <Robin_Watts> If I was you, I'd wait for it to settle down, then we'll help you.00:22.01 
malc_ Robin_Watts: okay. still i'm curious if there is a "blessed" way to do that00:27.15 
artifexirc-bot <Robin_Watts> There are many ways to do it.00:40.18 
  <Robin_Watts> Git fetch, then git rebase is what I'd do.00:40.30 
malc_ Robin_Watts: thanks. i would have done: "git reset --hard origin/master"01:07.42 
artifexirc-bot <Robin_Watts> @sebras ping13:14.11 
  <sebras> @Robin_Watts pong13:17.43 
  <Robin_Watts> @sebras Sanity has hopefully been restored to the cluster.13:17.55 
  <Robin_Watts> So if you want to rebase your stuff, and then push the reviewed commits, they can go in.13:18.18 
  <sebras> so freetype wasn't to blame for the issues?13:20.45 
  <sebras> (I see that it has moved to a new commit, but is still in)13:21.08 
  <Robin_Watts> @sebras There is an issue with the cluster nodes not updating submodules correctly/consistently when they change.13:22.56 
  <Robin_Watts> I've solved it by forcing a deletion. It's ugly but it'll work.13:23.19 
  <sebras> ok, so this needs to be done manually whenever we update submodules (until the underlying reason is found)? i.e. should I be extra wary when updating submodules in the future?13:25.44 
  <Robin_Watts> No. The cluster now contains lines to delete the submodule on every update.13:27.23 
  <Robin_Watts> I've traded slightly more work at the start of each run for reliability.13:27.51 
  <Robin_Watts> at some point someone could try to understand the problem more, but... not now. What I have should be safe.13:28.09 
  <Robin_Watts> at some point someone could try to understand the problem more, but... not me, not now. What I have should be safe.13:28.22 
artifexirc-bot <Robin_Watts> lunches13:28.28 
  <sebras> @Robin_Watts ok, I just needed to know if I should be doing the updates more carefully in the future. I wasn't complaining. 🙂13:33.31 
pedr0 hi all - quick question about the new version of mutool which uses tesseract - is it tied to a specific version of it ?13:47.25 
  I am curious to know if it uses the latest version of it - 4.13:47.43 
artifexirc-bot <KenSharp> We appear to be using V 4.1.013:53.28 
  <KenSharp> Leptonica appears to be version 1.79.013:54.20 
  <KenSharp> To be honest, I'd suggest using the code we ship at the moment13:56.07 
pedr0 Thanks for that, yes I hoped V4 would be in use.13:57.21 
  is there an example somewhere which shows the new OCR feature ?14:06.29 
sebras @KenSharp I had to apply a fix to mupdf to get it work bother with the version of tesseract used there as well as the version on my system.14:14.32 
artifexirc-bot <KenSharp> @sebras you were trying to use a system library ?14:15.14 
  <KenSharp> pedr0 yes there's an example, but I can't help you, not a MuPDF developer 🙂14:15.40 
  <KenSharp> Maybe sebras can tell you14:15.46 
sebras @KenSharp of tesseract, yes. mupdf can do both.14:16.10 
pedr0 where is it ? I can't find it. I can find the file that contains the device 'fz_new_ocr_device' though14:16.33 
artifexirc-bot <Robin_Watts> pedr0: We can work with either tesseract 4 or 5, I believe.14:16.59 
sebras pedr0: we mainly test with the version of tesseract bundled as a git submodule. I have previously successfully built with version 4.x installed on my system.14:17.13 
artifexirc-bot <Robin_Watts> pedr0: source/fitz/ocr-device.c contains fz_new_ocr_device14:18.07 
  <KenSharp> @Robin_Watts there's a tesseract 5 ? I didn't see it on Github14:18.32 
  <KenSharp> ah, pre-release14:18.52 
  <Robin_Watts> Tesseract 5 is the current development version. Stable releases are of 4.14:19.01 
pedr0 to build it with tesseract - do I merely need to follow cat thirdparty/tesseract.txt and build as usual thereafter ?14:59.04 
artifexirc-bot <KenSharp> @pedr0 I'm sorry I can't answer that, let me ping @Robin_Watts15:00.31 
  <Robin_Watts> pedr0: If you're building from our sources then yes.15:01.54 
  <Robin_Watts> Though that file looks truncated. 😦15:02.42 
  <KenSharp> Really ? It looked sufficient to me, but then what do I know 😄15:04.11 
  <Robin_Watts> It ends with "e.g." and no following example 🙂15:04.55 
pedr0 I must be doing something wrong15:05.00 
  make: *** No rule to make target 'build/release/thirdparty/tesseract/src/ccutil/globaloc.o', needed by 'build/release/libmupdf-third.a'. Stop.15:05.00 
artifexirc-bot <KenSharp> @Robin_Watts my copy has an example using wget15:05.26 
  <KenSharp> e.g.15:05.51 
  <KenSharp> wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata15:05.53 
  <Robin_Watts> Ah, the lack of a newline is confusing less 🙂15:05.54 
pedr0 yeah it isn't too clear about the training data, but I must admit it's the first time I came across tesseract and I may not be able to connect dots very well ...15:05.57 
artifexirc-bot <KenSharp> Ph! I was using notepad....15:06.04 
  <KenSharp> Oh! I was using notepad....15:06.14 
  <KenSharp> @pedr0 AIUI Tesseract needs to know which language you are using in order to intelligently guess at the text, so you need to supply it with training data for the language15:07.21 
pedr0 all I did was to clone repos following the file's instructions and running a regular build 'make' no specific flags15:07.24 
  I see15:07.51 
artifexirc-bot <KenSharp> I'm not currently running Linux, but let me give it a try on Windows15:07.51 
pedr0 the file downloaded by wget - where does it need to be ?15:08.06 
artifexirc-bot <Robin_Watts> pedr0: I can't a reference to globaloc anywhere in the Makefiles.15:08.09 
  <Robin_Watts> pedr0: I can't see a reference to globaloc anywhere in the Makefiles.15:08.14 
  <Robin_Watts> What SHA are you on?15:08.29 
pedr0 you meant commit id ? I think I've downloaded the sources from the release page15:09.28 
  and now I think about it I've the feeling you've explained to me all this already in the past :(15:09.48 
  what branch do I need to check-out ?15:10.27 
artifexirc-bot <KenSharp> I've got master checked out15:10.38 
  <KenSharp> Give me a second until this clone of leptonica finishes and I'll tell you what SHA I have15:11.00 
  <KenSharp> My SHA is 4e3019fcb28d52468ddeeda0cd8c5980b3fb10e915:11.26 
  <KenSharp> fixed detection of out-of-range when converting float15:11.46 
pedr0 do I need to place the file downloaded by wget in a specific folder ?15:13.09 
artifexirc-bot <Robin_Watts> pedr0: If you want to work with tesseract, work with a git checkout.15:13.32 
pedr0 I am at 2c51d019b3ac1295c5d64249c638eb778082e6f915:14.23 
  Update freetype to version 
artifexirc-bot <Robin_Watts> You can't mix working with a release archive with working with git.15:14.32 
  <KenSharp> pedr0 that should be recent enough I think, @Robin_Watts would know better than me, but its only a couple of commits behind what I have here15:15.07 
pedr0 sorry for the confusion, yes I am working on a git checkout of mu sources now15:15.21 
artifexirc-bot <KenSharp> My build completed15:15.35 
  <KenSharp> No errors15:15.38 
  <Robin_Watts> Ok. So a git checkout of the sources, and having followed the tesseract.txt instructions, does the build complete?15:15.50 
  <KenSharp> I'd have to boot up a Linux machine to try that15:15.51 
  <Robin_Watts> pedr0: You want to do something like: export TESSDATA_PREFIX=<directory with the traineddata file in it>15:16.49 
artifexirc-bot <KenSharp> is updating my Linux MuPDF15:17.30 
pedr0 I'd suggest to add that line into the tesseract.txt file - even though I am not sure if that path is used at runtime or at build time15:18.39 
  my build completed too, no errors15:18.53 
artifexirc-bot <Robin_Watts> runtime, and it'll tell you when you run it.15:19.48 
  <Robin_Watts> Fab. /me scuttles back under my rock.15:19.57 
pedr0 can I now use the mutool command line utility to OCR a doc ?15:20.36 
artifexirc-bot <KenSharp> I'm going to have to admit to some ignorance, I know what the Ghostscript verison will do....15:21.35 
  <Robin_Watts> pedr0: There are certain operations you can do with OCR, yes.15:23.31 
pedr0 mutool draw -F ocr.text <file>15:24.27 
  I reckon that should do, however I get15:24.38 
  error: OCR Disabled in this build15:24.39 
artifexirc-bot <Robin_Watts> mutool draw -o out.txt in.pdf15:24.46 
  <Robin_Watts> OK. I don't have time to look into that now, sorry.15:25.29 
pedr0 no problem at all15:25.49 
  I needed to set HAVE_TESSERACT=yes HAVE_LEPTONICA=yes to get it to work. Thanks15:45.23 
malc_ Robin_Watts: any news on the force push issue i had yesterday?15:48.55 
artifexirc-bot <Robin_Watts> I think just USE_TESSERACT=yes is what you want. Then it should detect HAVE_TESSERACT and HAVE_LEPTONICA automatically.15:49.35 
  <Robin_Watts> malc_: Yeah, the repo has settled down again now.15:49.50 
  <Robin_Watts> If you want to update your repo to exactly match ours, then indeed, git fetch && git reset --hard origin/master should do what you want.15:50.30 
malc_ Robin_Watts: gotcha, ta15:50.42 
artifexirc-bot <Robin_Watts> BUT that assumes you have none of your own commits you want to keep.15:50.43 
malc_ Robin_Watts: i don't15:51.14 
artifexirc-bot <Robin_Watts> Then no problem.15:51.23 
pedr0 I can't recall of an option that would produce text from a PDF retaining the layout ... achieving a similar result as pdftotext ... has it ever existed or I dreamed about it ?15:54.29 
artifexirc-bot <Robin_Watts> Depends what you want.15:55.20 
  <Robin_Watts> ocr.html or ocr.xhtml may contain what you want.15:55.48 
  <Robin_Watts> Or ocr.stext will contain positional information.15:56.03 
  <Robin_Watts> Or ocr.pdf will be a PDF with the pages as bitmaps, and the OCR detected text invisibly overlaid, so searching/cut/paste works.15:56.42 
pedr0 I tried with HTML but its output is really messy - do you want me to email you the source and the output ?16:18.56 
artifexirc-bot <Robin_Watts> I don't have time to look at this at the moment, sorry.16:37.44 
  <Robin_Watts> basically, deriving "structure" from a mass of "this char here" information (which is what both PDF and OCR effectively are) is a massive research project.16:38.30 
  <Robin_Watts> @sebras Ping!17:05.56 
pedr0 I am sure it is. I was only wondering if you could be interested in the output as it is very messed and may be the source of a problem. No intention to bug. So I'll do the best thing to help: stay quiet :)17:38.08 
artifexirc-bot <Robin_Watts> @ator ping?17:49.32 
  <ator> pong.18:03.44 
  <Robin_Watts> A couple of little commits...18:05.36 
  <Robin_Watts> https://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=68f1520f5c74471fccd36c048808578a9bd1371618:06.06 
  <Robin_Watts> and https://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=a6c46c2696e85cf7da63320475c082cead5705c118:06.15 
  <Robin_Watts> bah. typo in the commit message of the first. will fix.18:06.24 
  <ator> both LGTM18:08.37 
  <Robin_Watts> @ator I'll cluster them to be sure, then push. thanks!18:19.11 
malc_ ator, sebras: https://boblycat.org/~malc/scratch/tspeed.org18:29.30 
artifexirc-bot <sebras> @Robin_Watts pong.22:53.32 
  <sebras> @Robin_Watts ah, I saw the commits. lgtm even though they're probably merged already. 🙂22:55.46 
 <<<Back 1 day (to 2021/01/13)Forward 1 day (to 2021/01/15)>>> 
ghostscript.com #ghostscript