MuPDF IRC logs

	<<<Back 1 day (to 2021/01/13)	Fwd 1 day (to 2021/01/15)>>>	20210114
artifexirc-bot	<Robin_Watts> @fredrossperry No, the freetype submodule is different. I'll investigate more in the morning,.		00:21.01
	<Robin_Watts> malc_: Yes, stuff has been force pushed, and likely more will be tomorrow.		00:21.46
	<Robin_Watts> If I was you, I'd wait for it to settle down, then we'll help you.		00:22.01
malc_	Robin_Watts: okay. still i'm curious if there is a "blessed" way to do that		00:27.15
artifexirc-bot	<Robin_Watts> There are many ways to do it.		00:40.18
	<Robin_Watts> Git fetch, then git rebase is what I'd do.		00:40.30
malc_	Robin_Watts: thanks. i would have done: "git reset --hard origin/master"		01:07.42
artifexirc-bot	<Robin_Watts> @sebras ping		13:14.11
	<sebras> @Robin_Watts pong		13:17.43
	<Robin_Watts> @sebras Sanity has hopefully been restored to the cluster.		13:17.55
	<Robin_Watts> So if you want to rebase your stuff, and then push the reviewed commits, they can go in.		13:18.18
	<sebras> so freetype wasn't to blame for the issues?		13:20.45
	<sebras> (I see that it has moved to a new commit, but is still in)		13:21.08
	<Robin_Watts> @sebras There is an issue with the cluster nodes not updating submodules correctly/consistently when they change.		13:22.56
	<Robin_Watts> I've solved it by forcing a deletion. It's ugly but it'll work.		13:23.19
	<sebras> ok, so this needs to be done manually whenever we update submodules (until the underlying reason is found)? i.e. should I be extra wary when updating submodules in the future?		13:25.44
	<Robin_Watts> No. The cluster now contains lines to delete the submodule on every update.		13:27.23
	<Robin_Watts> I've traded slightly more work at the start of each run for reliability.		13:27.51
	<Robin_Watts> at some point someone could try to understand the problem more, but... not now. What I have should be safe.		13:28.09
	<Robin_Watts> at some point someone could try to understand the problem more, but... not me, not now. What I have should be safe.		13:28.22
*artifexirc-bot*	<Robin_Watts> lunches		13:28.28
	<sebras> @Robin_Watts ok, I just needed to know if I should be doing the updates more carefully in the future. I wasn't complaining. 🙂		13:33.31
pedr0	hi all - quick question about the new version of mutool which uses tesseract - is it tied to a specific version of it ?		13:47.25
	I am curious to know if it uses the latest version of it - 4.		13:47.43
artifexirc-bot	<KenSharp> We appear to be using V 4.1.0		13:53.28
	<KenSharp> Leptonica appears to be version 1.79.0		13:54.20
	<KenSharp> To be honest, I'd suggest using the code we ship at the moment		13:56.07
pedr0	Thanks for that, yes I hoped V4 would be in use.		13:57.21
	is there an example somewhere which shows the new OCR feature ?		14:06.29
sebras	@KenSharp I had to apply a fix to mupdf to get it work bother with the version of tesseract used there as well as the version on my system.		14:14.32
artifexirc-bot	<KenSharp> @sebras you were trying to use a system library ?		14:15.14
	<KenSharp> pedr0 yes there's an example, but I can't help you, not a MuPDF developer 🙂		14:15.40
	<KenSharp> Maybe sebras can tell you		14:15.46
sebras	@KenSharp of tesseract, yes. mupdf can do both.		14:16.10
pedr0	where is it ? I can't find it. I can find the file that contains the device 'fz_new_ocr_device' though		14:16.33
artifexirc-bot	<Robin_Watts> pedr0: We can work with either tesseract 4 or 5, I believe.		14:16.59
sebras	pedr0: we mainly test with the version of tesseract bundled as a git submodule. I have previously successfully built with version 4.x installed on my system.		14:17.13
artifexirc-bot	<Robin_Watts> pedr0: source/fitz/ocr-device.c contains fz_new_ocr_device		14:18.07
	<KenSharp> @Robin_Watts there's a tesseract 5 ? I didn't see it on Github		14:18.32
	<KenSharp> ah, pre-release		14:18.52
	<Robin_Watts> Tesseract 5 is the current development version. Stable releases are of 4.		14:19.01
pedr0	to build it with tesseract - do I merely need to follow cat thirdparty/tesseract.txt and build as usual thereafter ?		14:59.04
artifexirc-bot	<KenSharp> @pedr0 I'm sorry I can't answer that, let me ping @Robin_Watts		15:00.31
	<Robin_Watts> pedr0: If you're building from our sources then yes.		15:01.54
	<Robin_Watts> Though that file looks truncated. 😦		15:02.42
	<KenSharp> Really ? It looked sufficient to me, but then what do I know 😄		15:04.11
	<Robin_Watts> It ends with "e.g." and no following example 🙂		15:04.55
pedr0	I must be doing something wrong		15:05.00
	make: *** No rule to make target 'build/release/thirdparty/tesseract/src/ccutil/globaloc.o', needed by 'build/release/libmupdf-third.a'. Stop.		15:05.00
artifexirc-bot	<KenSharp> @Robin_Watts my copy has an example using wget		15:05.26
	<KenSharp> e.g.		15:05.51
	<KenSharp>		15:05.51
	<KenSharp> wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata		15:05.53
	<Robin_Watts> Ah, the lack of a newline is confusing less 🙂		15:05.54
pedr0	yeah it isn't too clear about the training data, but I must admit it's the first time I came across tesseract and I may not be able to connect dots very well ...		15:05.57
artifexirc-bot	<KenSharp> Ph! I was using notepad....		15:06.04
	<KenSharp> Oh! I was using notepad....		15:06.14
	<KenSharp> @pedr0 AIUI Tesseract needs to know which language you are using in order to intelligently guess at the text, so you need to supply it with training data for the language		15:07.21
pedr0	all I did was to clone repos following the file's instructions and running a regular build 'make' no specific flags		15:07.24
	I see		15:07.51
artifexirc-bot	<KenSharp> I'm not currently running Linux, but let me give it a try on Windows		15:07.51
pedr0	the file downloaded by wget - where does it need to be ?		15:08.06
artifexirc-bot	<Robin_Watts> pedr0: I can't a reference to globaloc anywhere in the Makefiles.		15:08.09
	<Robin_Watts> pedr0: I can't see a reference to globaloc anywhere in the Makefiles.		15:08.14
	<Robin_Watts> What SHA are you on?		15:08.29
pedr0	you meant commit id ? I think I've downloaded the sources from the release page		15:09.28
	and now I think about it I've the feeling you've explained to me all this already in the past :(		15:09.48
	what branch do I need to check-out ?		15:10.27
artifexirc-bot	<KenSharp> I've got master checked out		15:10.38
	<KenSharp> Give me a second until this clone of leptonica finishes and I'll tell you what SHA I have		15:11.00
	<KenSharp> My SHA is 4e3019fcb28d52468ddeeda0cd8c5980b3fb10e9		15:11.26
	<KenSharp> fixed detection of out-of-range when converting float		15:11.46
pedr0	do I need to place the file downloaded by wget in a specific folder ?		15:13.09
artifexirc-bot	<Robin_Watts> pedr0: If you want to work with tesseract, work with a git checkout.		15:13.32
pedr0	I am at 2c51d019b3ac1295c5d64249c638eb778082e6f9		15:14.23
	Update freetype to version 2.10.4.		15:14.30
artifexirc-bot	<Robin_Watts> You can't mix working with a release archive with working with git.		15:14.32
	<KenSharp> pedr0 that should be recent enough I think, @Robin_Watts would know better than me, but its only a couple of commits behind what I have here		15:15.07
pedr0	sorry for the confusion, yes I am working on a git checkout of mu sources now		15:15.21
artifexirc-bot	<KenSharp> My build completed		15:15.35
	<KenSharp> No errors		15:15.38
	<Robin_Watts> Ok. So a git checkout of the sources, and having followed the tesseract.txt instructions, does the build complete?		15:15.50
	<KenSharp> I'd have to boot up a Linux machine to try that		15:15.51
	<Robin_Watts> pedr0: You want to do something like: export TESSDATA_PREFIX=<directory with the traineddata file in it>		15:16.49
*artifexirc-bot*	<KenSharp> is updating my Linux MuPDF		15:17.30
pedr0	I'd suggest to add that line into the tesseract.txt file - even though I am not sure if that path is used at runtime or at build time		15:18.39
	my build completed too, no errors		15:18.53
artifexirc-bot	<Robin_Watts> runtime, and it'll tell you when you run it.		15:19.48
	<Robin_Watts> Fab. /me scuttles back under my rock.		15:19.57
pedr0	can I now use the mutool command line utility to OCR a doc ?		15:20.36
artifexirc-bot	<KenSharp> I'm going to have to admit to some ignorance, I know what the Ghostscript verison will do....		15:21.35
	<Robin_Watts> pedr0: There are certain operations you can do with OCR, yes.		15:23.31
pedr0	mutool draw -F ocr.text <file>		15:24.27
	I reckon that should do, however I get		15:24.38
	error: OCR Disabled in this build		15:24.39
artifexirc-bot	<Robin_Watts> mutool draw -o out.txt in.pdf		15:24.46
	<Robin_Watts> OK. I don't have time to look into that now, sorry.		15:25.29
pedr0	no problem at all		15:25.49
	I needed to set HAVE_TESSERACT=yes HAVE_LEPTONICA=yes to get it to work. Thanks		15:45.23
malc_	Robin_Watts: any news on the force push issue i had yesterday?		15:48.55
artifexirc-bot	<Robin_Watts> I think just USE_TESSERACT=yes is what you want. Then it should detect HAVE_TESSERACT and HAVE_LEPTONICA automatically.		15:49.35
	<Robin_Watts> malc_: Yeah, the repo has settled down again now.		15:49.50
	<Robin_Watts> If you want to update your repo to exactly match ours, then indeed, git fetch && git reset --hard origin/master should do what you want.		15:50.30
malc_	Robin_Watts: gotcha, ta		15:50.42
artifexirc-bot	<Robin_Watts> BUT that assumes you have none of your own commits you want to keep.		15:50.43
malc_	Robin_Watts: i don't		15:51.14
artifexirc-bot	<Robin_Watts> Then no problem.		15:51.23
pedr0	I can't recall of an option that would produce text from a PDF retaining the layout ... achieving a similar result as pdftotext ... has it ever existed or I dreamed about it ?		15:54.29
artifexirc-bot	<Robin_Watts> Depends what you want.		15:55.20
	<Robin_Watts> ocr.html or ocr.xhtml may contain what you want.		15:55.48
	<Robin_Watts> Or ocr.stext will contain positional information.		15:56.03
	<Robin_Watts> Or ocr.pdf will be a PDF with the pages as bitmaps, and the OCR detected text invisibly overlaid, so searching/cut/paste works.		15:56.42
pedr0	I tried with HTML but its output is really messy - do you want me to email you the source and the output ?		16:18.56
artifexirc-bot	<Robin_Watts> I don't have time to look at this at the moment, sorry.		16:37.44
	<Robin_Watts> basically, deriving "structure" from a mass of "this char here" information (which is what both PDF and OCR effectively are) is a massive research project.		16:38.30
	<Robin_Watts> @sebras Ping!		17:05.56
pedr0	I am sure it is. I was only wondering if you could be interested in the output as it is very messed and may be the source of a problem. No intention to bug. So I'll do the best thing to help: stay quiet :)		17:38.08
	symptom		17:38.56
artifexirc-bot	<Robin_Watts> @ator ping?		17:49.32
	<ator> pong.		18:03.44
	<Robin_Watts> A couple of little commits...		18:05.36
	<Robin_Watts> https://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=68f1520f5c74471fccd36c048808578a9bd13716		18:06.06
	<Robin_Watts> and https://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=a6c46c2696e85cf7da63320475c082cead5705c1		18:06.15
	<Robin_Watts> bah. typo in the commit message of the first. will fix.		18:06.24
	<ator> both LGTM		18:08.37
	<Robin_Watts> @ator I'll cluster them to be sure, then push. thanks!		18:19.11
malc_	ator, sebras: https://boblycat.org/~malc/scratch/tspeed.org		18:29.30
artifexirc-bot	<sebras> @Robin_Watts pong.		22:53.32
	<sebras> @Robin_Watts ah, I saw the commits. lgtm even though they're probably merged already. 🙂		22:55.46
	<<<Back 1 day (to 2021/01/13)	Forward 1 day (to 2021/01/15)>>>

Log of #mupdf at irc.freenode.net.