MuPDF IRC logs

	<<<Back 1 day (to 2020/10/29)	Fwd 1 day (to 2020/10/31)>>>	20201030
pedr0	hi all - I am getting this error when compiling 1.18: error: call of overloaded ‘abs(float)’ is ambiguous		09:49.28
	https://pastebin.pl/view/52294434		09:50.05
	Is there any new dependency needed which wasn't needed prior to 1.18 ?		09:50.39
kens	pedr0: there isn't anyone around who can answer that right at the moment, either stick around a bit or check back later in the logs		09:53.26
pedr0	Thanks. I think this has something to do with the CPP bindings which have been introduced, is there a way not to compile them at all ?		09:57.42
	Did this chap here get its way into the 1.18 release ? It does not look like that looking at the bug tracker		09:58.40
	https://bugs.ghostscript.com/show_bug.cgi?id=702715		09:58.40
kens	pedr0: sorry I'm not one of the MuPDF developers and I can't answer any of your questions so far :-(		10:00.48
	There will be developers along in the next few hours who will be better placed to help		10:01.12
pedr0	Yes no problem at all, I just leaving a few lines here so when they log-in they can answer, or tell me that the questions are terribly silly :-)		10:02.27
kens	you'll want either sebras (who is in Taiwan at the moment) ator (who is a late starter) or Robin_Watts_ (who is probably off shooting at clay things right now). One or more of them should be around 'soon'. I'm just filling in to say 'don't give up!' if anyone asks questions :-)		10:04.26
Robin_Watts_	pedr0: That commit did make it to 1.18.0, yes.		10:05.29
*kens*	is surprised to see Robin_Watts_		10:05.39
Robin_Watts_	Too windy to shoot clays this morning, apparently :)		10:05.52
kens	Oh! never thought of that....		10:06.14
Robin_Watts_	pedr0: 1.18.0 includes tesseract and leptonica as dependencies.		10:07.50
pedr0	ok, thanks. Are you guys scraping text from images or anything of that sort ?		10:16.39
	I did not know MULIB had such capabilities		10:16.51
	I still get the same error:		10:17.47
	usr/src/app/w/src/TextFilter.h: In member function ‘bool Character::operator==(const Character&) const’:		10:17.49
	if( abs( this->m_x0 - a.m_x0 ) > 1)		10:17.49
	^		10:17.49
	In file included from /usr/include/c++/6/cstdlib:75:0,		10:17.49
kens	All new :-)		10:17.49
pedr0	from /usr/include/c++/6/stdlib.h:36,		10:17.52
	from /usr/local/include/mupdf/memento.h:187,		10:17.54
	from /usr/local/include/mupdf/fitz/system.h:36,		10:17.56
	from /usr/local/include/mupdf/fitz.h:10,		10:17.58
	from /usr/src/app/w/src/TextFilter.h:8,		10:18.00
	from /usr/src/app/w/src/TextFilter.cpp:14:		10:18.02
	extern int abs (int __x) __THROW __attribute__ ((__const__)) __wur;		10:18.04
	^~~		10:18.06
	That's what I've installed: tesseract-ocr leptonica-progs		10:18.28
	I've the feeling this has to do with the C++ STDLIB I've installed on my system. What do you compile this stuff with ? g++ ?		10:18.59
sh4rm4^bnc	use std::abs in you C++ code		10:19.00
pedr0	that's not my code tough :)		10:20.24
	though		10:20.32
sh4rm4^bnc	oh, mupdf uses C++ now too ? eek		10:20.44
kens	MuPDF does not, tesseract does		10:21.01
	If you want to build Leptonica and Tesseract you need to use C++, same goes for Harfbuzz		10:21.21
sh4rm4^bnc	*harfbutt		10:21.36
*kens*	laughs		10:21.48
sh4rm4^bnc	oh lol leptonica is C++ now too? wtf		10:21.55
kens	But AIUI you ought to be able to write pure C code and link it, however this is distinctly not my field		10:22.15
	I'm not certain about Leptonica		10:22.25
	but Tesseract needs it and so they got pulled in tofgether		10:22.40
sh4rm4^bnc	the versions i used used to be in C		10:22.44
kens	Its possible it still is		10:23.06
sh4rm4^bnc	but dan bloomberg used to stuff so many example programs into his tarball that it went from 4MB to 10MB in a year so i stopped updating it		10:23.19
kens	Robin_Watts_: did the integration for both MuPDF and Ghostscript, so I'm hazy on the details. I tend to lump the two together as they ariived together		10:23.43
sh4rm4^bnc	<pedr0> if( abs( this->m_x0 - a.m_x0 ) > 1)		10:24.07
	anyway the fix is to make this std::abs		10:24.15
kens	IIRC there's a build option (possibly --without-tesseract) which won't try to build them in, or you can just delete he leptonica and tesseract directories		10:24.58
	Obviously that means teh OCR stuff won't be built either		10:25.18
ator	pedr0: um, TextFilter.cpp is not a file that's part of mupdf...		10:25.33
	harfbuzz is implemented in C++ but has no dependency on the C++ standard library, and exposes a pure C api.		10:26.17
	leptonica is still C		10:26.30
sh4rm4^bnc	<3		10:26.35
ator	tesseract is icky C++ though		10:26.39
	neither tesseract nor leptonica should build by default, you need "make tesseract=yes" to enable it		10:27.16
kens	Morning ator		10:27.18
ator	kens: Morning!		10:27.46
kens	I'm glad you and Robin are here, I can stop talking cautiously about stuff I don't understand :-)		10:28.14
ator	thanks for holding down the fort!		10:29.19
	or however that saying goes		10:29.28
kens	Sounds right to me :-)		10:29.57
pedr0	How can I take advantage of the new features that use tesseract/leptonica ? What puzzles me is that I've a system where it compiles no problem, and another system, based on debian 9, which does not.		10:30.42
sh4rm4^bnc	pedr0, i think it depends on the GCC version used		10:31.38
	i have a program that compiled fine with GCC 4.7.4, but threw the exact same abs error with GCC 6.5.0		10:32.16
ator	pedr0: exactly what are you building? the error you show is in a file that is NOT a part of any mupdf or thirdparty sources shipped with mupdf.		10:32.17
	pedr0: to take advantage of tesseract, you pass "tesseract=yes" as an argument to make		10:33.29
pedr0	I just run make		10:37.54
	hang on		10:38.15
	am I a complete fool ?		10:38.28
sh4rm4^bnc	so where's usr/src/app/w/src/TextFilter.h from?		10:39.10
pedr0	The reason why I thought it was is that the only thing that I've done is too upgrade the MULIB library version		10:39.23
	Yeah it was my application source		10:42.40
	what an idiot, why on earth was this not a problem prior to the upgrade of the release .. I do not know.		10:43.03
	Sorry everybody for bothering such a thing.		10:43.28
	*for such a thing		10:43.35
	my English is leaving me		10:43.41
sh4rm4^bnc	tranquilo		10:43.59
pedr0	:-)		10:44.13
	I quite interested in the new OCR/tesseract feature, is it documented somewhere ?		10:44.24
kens	Hmm documentation... Novel concept :-)		10:45.43
sh4rm4^bnc	hehe		10:45.55
kens	Its probably in the release notes		10:46.03
pedr0	:-)		10:49.11
	https://bugs.ghostscript.com/show_bug.cgi?id=702715		10:49.12
	I reckon that is still on the master branch and it hasn't been released as yet		10:49.36
Robin_Watts_	pedr0: As I said earlier, that DID make it to 1.18.0		10:49.55
pedr0	Oh. Sorry, trying to do too many things at the same time.		10:51.50
	Yeah the release note mentions that 'api: Optional use of Tesseract to use OCR to extract text.' Is there an example anywhere ?		11:11.58
ator	pedr0: grep for fz_new_ocr_device		11:14.24
pedr0	Thanks		11:15.30
Robin_Watts_	pedr0: Are you calling at the C level? Or are you calling mutool ?		11:44.16
pedr0	I am interested in both really		11:45.17
	but I am used to navigate through the device's definition in the C files to read the documentation there, which generally good		11:45.48
	can I OCR a file from mutool as well ?		11:45.57
Robin_Watts_	ok, so at the mutool level, formats with .ocr in them will use the OCR stuff.		11:46.10
	mutool draw -o out.ocr.txt -r200 in.pdf		11:46.28
pedr0	Sorry, maybe I am getting confused. Does that mean that simply using the suffix 'ocr' will cause the program to try to OCR the images within a given PDF ?		11:47.58
Robin_Watts_	pedr0: Not quite.		11:48.15
	We look at the suffix to the file given on -o to guess a format.		11:48.31
	or you can specify a format using -F.		11:48.40
	And certain formats, namely: ocr.txt, ocr.html, ocr.xhtml, ocr.stext will trigger the use of ocr.		11:49.06
	Also ocr.pdf will trigger the use of bitmap-wrapped-as-pdf with OCR.		11:49.28
	see the usage message for mutool.		11:49.46
	You'll need traineddata for tesseract to use.		11:50.24
	like eng.traineddata from here: https://github.com/tesseract-ocr/tessdata_fast		11:50.50
	and you can specify the language(s) to use by using "-t eng" or "-t eng+ara" etc.		11:51.17
	default is "eng".		11:51.34
pedr0	I think I get the gist, is it OCR only images or does it try to OCR the whole pdf seen as an image ? it must be the former now that I think about it		11:52.49
Robin_Watts_	pedr0: We render the pdf to a bitmap, images/text/everything. Then we ocr that.		11:53.26
pedr0	I see, thanks a lot as usual, really helpful. I will let you know how it went.		11:54.12
Robin_Watts_	specifically we want to OCR stuff that's text already, cos frequently, in badly generated PDFs, which are way too common, the text might look right, but it has crap unicode values, so cutting and pasting doesn't work.		11:54.37
pedr0	ah I see		11:55.19
	and does it work well generally ? Or does it require a lot of manual intervention thereafter		11:57.18
	no need to answer to that one, I will check myself, it depends on an awful lot of factors I would answer myself :-)		11:58.26
Robin_Watts_	pedr0: It uses the tesseract engine, which is reputedly a good one, but as you say, all sorts of factors involved.		11:58.48
	<<<Back 1 day (to 2020/10/29)	Forward 1 day (to 2020/10/31)>>>

Log of #mupdf at irc.freenode.net.