Log of #mupdf at irc.freenode.net.

 <<<Back 1 day (to 2021/02/22)Fwd 1 day (to 2021/02/24)>>>20210223 
artifexirc-bot <paulgardiner> static fz_text *fit_text(fz_context *ctx, font_info *font_rec, char *str, fz_rect *bounds)10:30.39 
  <paulgardiner> We used to have this function:10:33.02 
  <paulgardiner> ```C10:33.02 
  <paulgardiner> static fz_text *fit_text(fz_context *ctx, font_info *font_rec, char *str, fz_rect *bounds)10:33.03 
  <paulgardiner> ```10:33.04 
  <paulgardiner> I could reintroduce it, but I'm wondering if we already have something similar from the ebook layout code I could exploit to achieve something similar.10:33.06 
  <paulgardiner> We used to have this function:10:34.01 
pedr0 hi all - I am OCR'ing a doc and I get a log of '&#xfffd;' - is this possible at all ? I reckoned that the OCR processing shouldn't return any of those character codes and I kept looking for my mistake but I can't find any (yet). Is my assumption that OCR can't return those correct ?10:34.01 
artifexirc-bot <paulgardiner> ```C10:34.02 
  <paulgardiner> static fz_text *fit_text(fz_context *ctx, font_info *font_rec, char *str, fz_rect *bounds)10:34.03 
  <paulgardiner> ```10:34.04 
  <paulgardiner> I could reintroduce it, but I'm wondering if we already have something similar from the ebook layout code I could exploit.10:34.06 
  <KenSharp> That's a 'replacement character' in Unicode10:45.41 
  <KenSharp> IIRC it means 'don't know what this is'10:45.54 
  <KenSharp> Which may mean the character is in a language not supported by the pack you are using, or it might genuinely mean 'can't figure it out'10:46.19 
  <KenSharp> Either way it seems possible to me, but @Robin_Watts would know more than me10:46.32 
pedr0 thanks for that. Bizarre though as it seems to me that they should all be spaces.10:47.41 
artifexirc-bot <KenSharp> I'd haev to see the file to even make a guess I'm afraid10:48.02 
pedr0 I get stuff like 'Intangible´┐Żassets'10:48.09 
  what file - the PDF ?10:48.17 
artifexirc-bot <KenSharp> But the OCR shoudl skip spaces altogether I think10:48.19 
pedr0 I can sent that over10:48.20 
artifexirc-bot <KenSharp> THe PDF file yes10:48.23 
pedr0 where do you wish that to be sent ?10:48.47 
artifexirc-bot <KenSharp> Maybe there is some grabage in the space character. Noise on the scan or soemthing10:48.50 
  <KenSharp> Umm can you put it somewhere public like dropbox or something ?10:49.02 
pedr0 OKs10:49.08 
artifexirc-bot <KenSharp> The only place we have for files is Bugzilla10:49.12 
  <KenSharp> BTW this may well need to wait for Robin's attention,. I don't have much to do with MuPDF10:49.58 
  <ator> @paulgardiner not as convenient, you'd have to build up the full fz_html_box structure with CSS styles, etc., in order to reuse the HTML text layout and rendering functions10:50.56 
  <ator> and we allow customers to do builds without the HTML stuff in there, to save space10:51.32 
pedr0 https://tmpfiles.org/download/161527/out.pdf10:54.45 
artifexirc-bot <KenSharp> OK got it, let's see....11:01.02 
  <KenSharp> Well I'm inclined to agree, I cannot see anything obvously likely to cause an OCR problem there11:02.25 
  <KenSharp> Its all linework, not an image, so that shoudl render well and cleanly. The original fonts are odd in some way, looks like the FontBBox is incorrect, and there is no ToUnicode information in them, so they don't copy properly at all.11:04.16 
pedr0 exactly.11:05.30 
  let's see if anybody from Artifex gives me a good tip, I do not want to open a bug if I am talking rubbish11:06.01 
  tip = hint :-)11:06.35 
artifexirc-bot <KenSharp> Well it may be soemthing to do with re-assembling the text from the OCR engine back into PDF. I don't recall but I think MuPDF does that with an invisible text layer. THis (obviously) isn't really my area, I could talk a lot baout how Ghostscript uses the OCR engine, but not so much with MuPDF11:07.32 
  <KenSharp> @Robin_Watts did all the work on the OCR stuff, especially with MuPDF so I'm going to defer to him on this one, sorry.11:07.58 
pedr0 no problem at all, thanks a lot for now. At least what I see doesn't seem odd only to myself :-)11:09.02 
artifexirc-bot <paulgardiner> Okay thanks. So it does look like I need to reintroduce that function I mentioned and that pulls in quite few others. I guess I should just go for it and we can sort out if there are any simpler alternatives in review.11:22.00 
  <ator> @paulgardiner gl-input.c has some layout stuff you might be able to reuse11:30.19 
  <ator> @paulgardiner gl-font.c has some layout stuff you might be able to reuse11:30.41 
  <ator> ui_break_lines has the basic algorithm I think you're looking for11:30.58 
  <ator> with the same fz_font structures11:31.04 
  <paulgardiner> Okay, I'll take a look.12:30.09 
missnomer /!\ this chat has moved to irc.crimeircd.net #pp /!\22:32.22 
captain42 D4yis5hJF5xGzhtxsmfSa6H6c1oNRQrvxbcoKvKZ2Cm2EeCLta8IbserUt8D7lPR74g5tQ8uT3sQw8vwBQHGxzSvGET1VwoKcsH2mKVJvbbUk2G2wXduUCql22:43.18 
webpigeonZY gACUHU2mclyVjmYTTLqVzXEXieVCSvL9V9dkbSDAAueC2ukkgQy3IqudaGG0ErME7T5AgCaDn3ywm4VKcT2EmEKUP2JhGV4j18taztPOGvorLZfYc1rj3ffa22:45.06 
Gabemo CjlwO5nbYAgnhmo9sYVK0mwNs4YdxSJLgcXQzF8edzWBJlg11uUHc8IQ0QRU6uMTq1cOZlNP9cS63BKpBauy7iAhN5l2ObZdNa1ovesGfsLeVhnqIgS9iQjo22:48.12 
naosML PRsE68L2UCUbc8yH93baUVR2xKj8DHJh3CcxotO6S8v8SS6RzbkG9PrTa19B0t2zJVwbwmSuX2zHw5brrujgxlCrYIf4ybq0Zsk5niOEgo4FbvD9Cj6XmX6wHXm4AgLIQmCNEzLMuzViIlIZJf6WXfpIF22:58.49 
doaks Q0V7OfG5pgQpiIKP1OTupeemG2DYfIYplmwbFzkEH1TxYi4RGldu23:06.51 
 <<<Back 1 day (to 2021/02/22)Forward 1 day (to 2021/02/24)>>> 
ghostscript.com #ghostscript