| <<<Back 1 day (to 2022/04/28) | Fwd 1 day (to 2022/04/30) >>> | 20220429 |
artifexirc-bot | <qwertynik> Understood a bit about the Type 3 fonts here: https://www.prepressure.com/fonts/basics/type3. With some clarity it is clear as to why they were not converted to black - probably because they are drawn. | 05:15.38 |
| <qwertynik> If the information that the vector has used Type 3 font is available, can a flag be added to state that convert Type 3 font 'drawings' too? | 05:15.39 |
| <qwertynik> Yes @KenSharp detecting the background would be complex. However the goal here is to convert all text to black, remove all drawings and images and output the PDF. | 05:17.00 |
| <mvrhel> Maybe | 05:23.09 |
| <mvrhel> But I am not a font guy. And I am sure it will not work for PDF output | 05:23.37 |
| <mvrhel> or pretty sure anyway | 05:23.46 |
| <mvrhel> Rendering maybe | 05:23.56 |
| <RayJohnston> IANAFG is sort of lile IANAL (I Am Not A Font Guy / I Am No A Lawyer). I am neither 🙂 | 05:26.17 |
| <mvrhel> funny | 05:28.44 |
| <RayJohnston> AFGIAN (A Font Guy I Am Not) is shorter. ALIAN (A Lawyer I Am Not), not to be confused with ALIEN | 05:29.02 |
| <RayJohnston> I suspect that IANAL caught on because of the ANAL part, and how legal interpretations of things tends to be ... | 05:30.37 |
| <qwertynik> Ok @mvrhel. Is this because of a limitation in the PDF format, or, the current ghostscript code does not support it? | 05:31.30 |
| <mvrhel> I believe he pdf format and how the color is set for type3 fonts | 05:33.32 |
| <mvrhel> but that is for Ken to tell you when he gets here in a couple hours | 05:33.48 |
| <mvrhel> I think it is like the uncolored pattern stuff | 05:34.30 |
| <mvrhel> which relies upon what ever color is set in the graphic state | 05:34.45 |
| <mvrhel> as to what gets drawn | 05:34.56 |
| <mvrhel> easy enough to deal with when rendering | 05:35.04 |
| <RayJohnston> Type 3 fonts are painted with generalized vector and image operations. In Ghostscript, they all are painted as text (using the text enumerator), but I suspect that the fact that the image/bitmap and vector operations don't know that it is being painted as part of text | 05:35.08 |
| <mvrhel> hard to pack into the output pdf file | 05:35.09 |
| <RayJohnston> and it gets even more hairy when going to the pdfwrite device which tries to preserve the input graphic state as much as possible. | 05:36.37 |
| <RayJohnston> easy for you, maybe 🙂 | 05:37.04 |
| <RayJohnston> knowing that painting (at some point) came from text, particularly when colorspaces may be pattern or indexed or whatever, doesn't seem simple. | 05:38.02 |
| <mvrhel> we added -dBlackVector recently | 05:38.16 |
| <mvrhel> and that packs into PDF output too | 05:38.25 |
| <mvrhel> but type3 fonts | 05:38.41 |
| <mvrhel> no | 05:38.42 |
| <RayJohnston> what does BlackVector do with colored Patterns and images ? | 05:38.57 |
| <mvrhel> images are not converted | 05:39.09 |
| <mvrhel> some patterns are | 05:39.14 |
| <mvrhel> not uncolored ones | 05:39.19 |
| <mvrhel> well not to pdf output | 05:39.28 |
| <RayJohnston> "some" ... | 05:39.31 |
| <mvrhel> rendered yes | 05:39.32 |
| <mvrhel> yes. if you have a pattern that has an image in it | 05:39.50 |
| <mvrhel> it is not going to be converted to black | 05:39.55 |
| <mvrhel> if it has a vector drawing in it | 05:40.03 |
| <mvrhel> that content will be rendered black | 05:40.10 |
| <mvrhel> it is a difficult problem | 05:40.45 |
| <mvrhel> a customer wanted this | 05:40.50 |
| <mvrhel> and that was the best I could do without charging them a hefty NRE | 05:41.05 |
| <RayJohnston> right, makes sense, becuase BlackVector will capture the vector parts of the pattern as black (I assume) | 05:41.20 |
| <mvrhel> yes | 05:41.25 |
| <mvrhel> but there are uncolored patterns | 05:41.38 |
| <mvrhel> which use the current graphic state color value | 05:41.47 |
| <mvrhel> those don't translate easily to PDF output | 05:41.59 |
| <mvrhel> but render black | 05:42.08 |
| <RayJohnston> yep. bitmapped (uncolored "stencil") patterns don't have a color | 05:42.51 |
| <mvrhel> right | 05:43.24 |
| <mvrhel> off to bed | 05:45.43 |
| <RayJohnston> for raster output, probably a "tagged" output would retain the text tag, but that doesn't help with PDF output. | 05:45.45 |
| <RayJohnston> me, too | 05:45.57 |
| <RayJohnston> me, too (bed) | 05:46.12 |
| <qwertynik> From the conversation so far, it is mostly clear that the usage of Type 3 fonts makes it difficult to change the text's (as it appears to human eye) color to black. | 05:51.01 |
| <qwertynik> Before finding the -dBlackText flag in Ghostscript, was using MuPDF via a Python library to extract the text and then create a new PDF with black text. However, this was challenging considering the different orientation, and other attributes associated with text rendering. | 05:51.10 |
| <qwertynik> But, since the text using Type3 font was extracted as 'text' using MuPDF, wondering if a supplementary flag can be added that states even if it is a Type 3 font rendered 'object', render it is a black - assuming it is text. Do not care about it being a drawing. | 05:51.27 |
| <RayJohnston> @qwertynik what about using -dFILTERIMAGE -dFILTERVECTOR to just leave the text -- will that help you? | 05:52.09 |
| <qwertynik> Yes certainly, that has helped so far. Have been using this command `gs -o op.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dBlackText -f BlackTextGenerationTesting.pdf` so far. And now bumped across Type3 fonts in the PDF 🙂 | 05:53.34 |
| <RayJohnston> so that, with -dBlackVector ?? | 05:54.26 |
| <RayJohnston> and/or -sColorConversionStrategy=Gray ?? | 05:57.53 |
| <RayJohnston> (pdfwrite options are quite myriad and have lots of "interesting" corner cases in themselves) | 05:58.40 |
| <qwertynik> With this wouldn't the other vectors that is to be removed also remain? | 05:58.48 |
| <qwertynik> With this wouldn't the other vectors (that are actually drawings) that is to be removed also remain? | 05:59.03 |
| <qwertynik> With this wouldn't the other vectors ('actually' drawings) that is to be removed also remain? | 05:59.20 |
| <RayJohnston> the "filter" subclass device filters BEFORE any color conversion done by -dBlackVector (I am pretty sure -- @mvrhel, or testing, would have to confirm that) | 06:01.02 |
| <RayJohnston> There are several pseudo devices that can affect the processing: "FirstPage LastPage" that swallow ALL operations for a page, "ObjectHangler" that filters object type calls, and "Nup" that places objects on a "master" page. | 06:05.45 |
| <RayJohnston> There are several pseudo devices that can affect the processing: "FirstPage LastPage" that swallow ALL operations for a page, "ObjectHandler" that filters object type calls, and "Nup" that places objects on a "master" page. | 06:06.00 |
| <RayJohnston> These all happen before any "target" device (including pdfwrite) see the operation. | 06:06.51 |
| <RayJohnston> so, I would expect -dFILTERIMAGE -dFILTERVECTOR to only leave the text | 06:09.15 |
| <qwertynik> In that case, wouldn't the vector be filtered out - in this case the relevant text too? | 06:09.37 |
| <qwertynik> Attempted this command and the required text (using Type 3 font) was **removed** | 06:09.39 |
| <qwertynik> `gs -o op.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dBlackText -dBlackVector -f MIAA-emailer6.pdf` | 06:09.40 |
| <qwertynik> This is the test PDF. | 06:10.09 |
| <qwertynik> https://cdn.discordapp.com/attachments/773567375458828329/969480693627375636/MIAA-emailer6.pdf | 06:10.10 |
| <qwertynik> In that case, wouldn't the vector be filtered out - in this case the relevant text (in Type 3 font) too? | 06:11.23 |
| <qwertynik> Attempted this command and the required text (using Type 3 font) was **removed** | 06:11.24 |
| <qwertynik> `gs -o op.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dBlackText -dBlackVector -f MIAA-emailer6.pdf` | 06:11.25 |
| <RayJohnston> @qwertynik if you are looking to get the text "MERZ INSTITUTE", etc. it is NOT text. It is part of an embedded image. The "Dear Doctor" stuff IS text | 06:20.59 |
| <RayJohnston> The "MERZ NEWS" is also text | 06:25.47 |
| <qwertynik> @RayJohnston Looking to get the highlight portion as text | 06:28.44 |
| <qwertynik> https://cdn.discordapp.com/attachments/773567375458828329/969485371584950312/unknown.png | 06:28.45 |
| <qwertynik> @RayJohnston | 06:29.34 |
| <qwertynik> Yes, that's a part of the image. Not looking to extract this. | 06:29.35 |
| <qwertynik> Looking to get the highlight portion as text | 06:29.36 |
| <qwertynik> https://cdn.discordapp.com/attachments/773567375458828329/969485371584950312/unknown.png | 06:29.37 |
| <KenSharp> The current architecture of the pdfwrite device does not support this feature. The way it is implemented for rendering means that it mostly works for pdfwrite, but not always. | 07:32.09 |
| <KenSharp> Specifically it is known (and expected) not to work for uncoloured type 3 fonts and uncoloured patterns, due to the way colour processing is lazily written. | 07:34.36 |
| <qwertynik> Can a different device and command be used before running this command `gs -o op.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dBlackText -f BlackTextGenerationTesting.pdf` to account for the lazy color processing? | 07:36.49 |
| <KenSharp> No. | 07:37.21 |
| <KenSharp> Or at least, not without rendering the text to an image | 07:37.35 |
| <KenSharp> But then BlackText won't work | 07:37.49 |
| <KenSharp> You might be able to convert the text to outlines using NoOutputFonts, and then use -dBlackVector | 07:38.19 |
| <KenSharp> That's a 2-step process you understand ? First run to pdfwrite with NoOutputFonts, and then run the result through pdfwrite with -dBlackVector | 07:38.57 |
| <KenSharp> Of course the text will no longer be text after that, and the file will be considerably larger | 07:39.19 |
| <qwertynik> Sounds plausible. However, wouldn't this also keep other vectors that are actually drawings? | 07:40.41 |
| <KenSharp> If you start with FILTERIMAGES and FILTERVECTOR along with NoOutputFonts, no | 07:41.07 |
| <qwertynik> I think the text using Type 3 font is also being filtered out with the FILTERVECTOR flag. Will recheck now | 07:43.14 |
| <KenSharp> It shouldn't be, it wasn't previously, was it ? | 07:43.42 |
| <qwertynik> Yes, it is being filtered out. | 07:44.08 |
| <qwertynik> Just re-verified. | 07:44.14 |
| <KenSharp> Well you're stuck then | 07:44.24 |
| <qwertynik> Reposting earlier message in case it is missed | 07:45.33 |
| <qwertynik> Before finding the -dBlackText flag in Ghostscript, was using MuPDF via a Python library to extract the text and then create a new PDF with black text. However, this was challenging considering the different orientation, and other attributes associated with text rendering. | 07:45.34 |
| <qwertynik> But, since the text using Type3 font was extracted as 'text' using MuPDF, wondering if a supplementary flag can be added that states even if it is a Type 3 font rendered 'object', render it is a black - assuming it is text. Do not care about it being a drawing. | 07:45.57 |
| <KenSharp> No, I saw it, but I don't see the relevance, if you want to ask questions about MuPDF you'd be better off in the #mupdf channel | 07:46.18 |
| <qwertynik> Ok sure. Posted here assuming they could be related and it would 'click' some idea. | 07:47.30 |
| <KenSharp> The MuPDF developers mostly don't read #ghostscript and vice-versa, we've generally got enough to do with our own products..... | 07:48.03 |
| <qwertynik> Is supporting such a flag technically possible in Ghostscript? | 07:49.24 |
| <KenSharp> Supporting what flag ? | 07:59.34 |
| <KenSharp> The problem isn't anything to do with recognising text, we know it's text, the problem is writing something other than the current colour as the colour to use for the text. While not destroying the current colour in case it also gets used for soemthign else, like a fill. | 08:03.56 |
| <KenSharp> Is it possible to do that ? ALmost certainly, it's software. But I don't intend to invest the amount of effort required to support that. | 08:04.36 |
| <qwertynik> Ok. Just realized that the generated PDF has most text with its color changed to black color. But for some text the color **isn't changed** - because Type3 font is used 🤦♂️ | 09:20.07 |
| <qwertynik> https://cdn.discordapp.com/attachments/773567375458828329/969528497464815636/unknown.png | 09:20.07 |
| <qwertynik> Ok. Just realized that the generated PDF has most text with its color changed to black color. But for some text, annotated and highlighted, the color **isn't changed** - because Type3 font is used 🤦♂️ | 09:31.39 |
| <qwertynik> https://cdn.discordapp.com/attachments/773567375458828329/969528497464815636/unknown.png | 09:31.40 |
| <Robin_Watts> @chrisl OK, I think what is there now should work. | 15:11.50 |
| <KenSharp> possibly in #ghostscript-tech ? | 15:12.22 |
| <<<Back 1 day (to 2022/04/28) | Forward 1 day (to 2022/04/30)>>> | |