| <<<Back 1 day (to 2013/01/31) | 2013/02/01 |
sebras | Robin_Watts: ~sebras/tmp/torture-test.pdf | 04:04.40 |
henrys | hi ray_laptop | 06:25.06 |
ray_laptop | henrys: hi | 07:16.21 |
| I didn't notice IRC alert... | 07:16.49 |
Richard | Hello. | 07:51.42 |
kens | Good Morning | 07:51.57 |
Guest78590 | I was wondering if anyone could tell me the PHP requirements for Ghostscript. | 07:52.25 |
kens | Nope, absolutely no idea. | 07:52.37 |
Guest78590 | It doesn't say on the site. | 07:52.53 |
kens | Apart fomr the fact that Ghostscript doesn't use or require PHP | 07:52.56 |
Guest78590 | It doesn´t= | 07:53.15 |
| ? | 07:53.17 |
kens | No, it doesn't | 07:53.21 |
Guest78590 | Oh.. | 07:53.26 |
kens | Its a C application | 07:53.36 |
| After compilation its a platform-specific binary exxecutable | 07:53.58 |
Guest78590 | We're going to use ImageMagick when we upgraded our servers and we need Ghostscript for PDFs. | 07:54.06 |
| And as far as I know, we need 5.1.3 atleast for ImageMagick. | 07:54.55 |
kens | No idea, but I can't see why ImageMagick requires PHP eithr | 07:55.12 |
| either | 07:55.16 |
Guest78590 | As an extension it does. | 07:55.30 |
kens | No idea what you mean, ImageMagick is an executable, or at least it was last time I looked | 07:55.55 |
Guest78590 | http://www.php.net/manual/en/imagick.requirements.php | 07:56.16 |
| It says it here. | 07:56.19 |
kens | Means little or nothing to me, I don't speak PHP | 07:56.35 |
| RIght, you have it upsiode down | 07:56.50 |
Guest78590 | Sorry= | 07:57.16 |
kens | PHP requires a certain version of ImageMagick, ImageMagick does not require a particualr version of PHP | 07:57.17 |
Guest78590 | Ahhhhhhhhhh, I see! | 07:57.26 |
kens | (or indeed any version of PHP) | 07:57.28 |
Guest78590 | I see now. | 07:57.40 |
| I understand, thank you. | 07:58.02 |
kens | No problem | 07:58.06 |
Guest78590 | I shall be leaving now then, there is work to be done. Thanks for your help. | 07:58.31 |
kens | You are welcome | 07:58.37 |
Guest78590 | Have a nice day. | 07:58.43 |
| Bye. | 07:58.58 |
kens | Great 2 emails from Aaron this morning | 08:36.31 |
chrisl | Hmm, well, my nice surprise this morning is that the sim that Len put appears to require VS2010 Pro - which I ain't got.... :-( | 08:37.57 |
paulgardiner | Robin_Watts: ping | 10:41.26 |
sebras | paulgardiner: pong. Wrong route to host. ;) | 11:00.47 |
paulgardiner | reboots his router | 11:01.43 |
Robin_Watts | pong | 11:30.05 |
| paulgardiner: Random thought... Can the gesture recognisers on android cope with trickier things that currently? | 13:00.24 |
| In particular, can we get 2 fingered rotate detected? | 13:00.36 |
| I'm thinking of the case where we have a landscape page in a portrait file. | 13:01.03 |
| Rotating the device doesn't help us, cos the page rotates with it. | 13:01.21 |
paulgardiner | I'm certain we can but I don't know if it would require direct recognition or whether the built in recognisers cope | 13:01.23 |
Robin_Watts | Looks like there is hope: http://stackoverflow.com/questions/10682019/android-two-finger-rotation | 13:03.18 |
chrisl | <sigh> So much for a half hour task of moving a computer to a different room, and installing a wireless adapter :-( | 13:03.18 |
kens2 | Well that's half a day in anyone's book | 13:03.52 |
paulgardiner | 4h for a "half hour task"? Sounds about right. | 13:04.23 |
kens2 | Wow, I actually seem to be able to print XPS files | 13:05.25 |
chrisl | Didn't work out - wireless signal too weak, tried moving the router, then the ADSL wouldn't reconnect. Spent most of the time on the phone to tech support, and still not working | 13:05.26 |
Robin_Watts | chrisl: wireless adapters are all shite. | 13:05.41 |
| Have you considered power line networking instead? | 13:06.17 |
chrisl | Robin_Watts: I wanted to ask you about that - I thought they needed to be on the same ring, is that still the case? | 13:06.40 |
Robin_Watts | chrisl: Within the same substation. | 13:06.51 |
| Certainly not within the same ringmain. | 13:07.08 |
kens2 | I have mine on 3 different rings | 13:07.23 |
chrisl | Oh, cool, so that's an good option. I'll recommend it - when they get the ADSL going again! | 13:07.54 |
Robin_Watts | http://www.7dayshop.com/7dayshop-200mbps-mini-homeplug-powerline-ethernet-network-adapter-value-twin-pack?backUrl=L2NvbXB1dGVyLW5ldHdvcmstd2lmaS9uZXR3b3JrLWFuZC13aWZp | 13:07.57 |
chrisl | I really only agreed try wireless because they already have wireless, and I had a space USB adapter. But a *thick* stone wall just kills the signal | 13:09.49 |
| s/space/spare | 13:10.01 |
Robin_Watts | I was really reluctant to try powerline networking as it seemed like a horrid hack, but it's worked in both places I've tried it admirably well. | 13:10.44 |
| whereas wifi is always "not quite strong enough" wherever I use it. | 13:11.22 |
chrisl | I did suggest it might be the better solution but the theory was we could try the wireless free, and if it worked - yay! But it all went horribly wrong | 13:11.55 |
Robin_Watts | http://www.spiegel.de/static/360grad/kamtschatka/ | 13:19.19 |
sebras | http://www.lrde.epita.fr/~adl/dl/autotools.pdf | 13:31.02 |
| search for "AC_DEFINE" and notice that you get hits on page 407-409, but not on the laste line of 406... | 13:31.36 |
Robin_Watts | Ah, that may be a known bug to do with the text extraction device not flushing the last thing. I think a customer mentioned it, together with a crap fix, and we had a better fix. | 13:32.42 |
sebras | not it's not. it's because the pdf generator decided to sometimes emit the underscore glyph and sometimes just draw it as a path! | 13:34.19 |
| however searching for "define" in that document produces hits on pages where the test is likely hidden. I guess we should detect this somehow. | 13:34.51 |
| and ignore those pages. | 13:34.54 |
Robin_Watts | sebras: Ah, well, that's probably best coped with by smartness in the search code. | 13:35.31 |
| similar smartness to 'search for e, find &eactute;" etc. | 13:35.48 |
sebras | yes. | 13:36.28 |
Robin_Watts | morning tor8. | 14:19.07 |
tor8 | hey robin | 14:19.15 |
Robin_Watts | I'm just making notes on the text extraction stuff, what it currently does, what it doesn't do, what we need to do to do realistic extraction to html etc. | 14:20.24 |
tor8 | right. | 14:20.40 |
Robin_Watts | Did you have an opinion on the emails between paul/henrys/myself this morning ? | 14:21.04 |
tor8 | so we currently can assemble lines and columns (page headers and footers end up as their own columns) with styling information | 14:21.07 |
Robin_Watts | tor8: Hmm. is that true? | 14:21.27 |
| We collate chars/spans/lines/blocks. | 14:21.48 |
tor8 | to go from that to something that flows in a webview in the android viewer shouldn't be more than a few days for paul (or you, depending on who's more familiar with android hacking) | 14:21.53 |
| blocks are essentially columns | 14:21.59 |
| broken by empty lines, so sometimes paragraphs | 14:22.08 |
| but paragraphs with first-line indent doesn't get broken into separate blocks | 14:22.29 |
Robin_Watts | OK. I wasn't sure where we drew the line between columns and paragraphs. | 14:22.45 |
tor8 | so to go from what we have to usable html output we'll need to (a) figure out and skip page headers and footers | 14:23.05 |
| (b) split blocks into paragraphs based on first-line indent | 14:23.20 |
Robin_Watts | If blocks are columns, then we need a separate paragraph recognition phase. | 14:23.20 |
tor8 | © merge paragraphs across pages | 14:23.34 |
Robin_Watts | and last line outdent. | 14:23.35 |
tor8 | damn apple autocorrect... | 14:23.42 |
Robin_Watts | tor8: yeah, let me finish my email with the list in, and then we can bash it all into shape. | 14:24.05 |
tor8 | and (d) get images in there | 14:24.21 |
| for A+ rating, we'd want to render line art into images as well | 14:24.36 |
| and tables and fancy math and other formatting will be basically impossible... | 14:24.53 |
| maybe just detect that it's not a simple block of text and drop into "render as an image" mode | 14:25.10 |
| (all the while closing our eyes and pretending that hebrew, arabic and chinese don't exist) | 14:25.50 |
Robin_Watts | tor8: sent. | 14:45.30 |
tor8 | 1f can be fixed by reusing the fz_text_sheet stylesheet between extractions | 14:47.25 |
Robin_Watts | tor8: We already do use the same sheet in mudraw | 14:47.40 |
| and we fail to match styles. | 14:47.48 |
| Try mudraw -tt pdf_reference17.pdf 2-3 | 14:48.02 |
tor8 | sounds like a bug | 14:48.04 |
Robin_Watts | yeah. | 14:48.10 |
| s2 and s3 look to be the same to me. | 14:48.31 |
| s0 == s4 ? | 14:48.37 |
| and the last few spans outputted should have been put together. | 14:49.12 |
tor8 | Robin_Watts: not a bug, they're using different subset fonts | 14:54.49 |
| which doesn't show in the css output | 14:55.01 |
Robin_Watts | Right, so we need to spot that and combine the two. | 14:55.14 |
| Either in the initial collection of spans, or in some 'analysis' phase, or in the output phase. | 14:55.36 |
tor8 | yes. will go a lot slower to compare styles then though, so probably in a later analysis phase | 14:55.44 |
Robin_Watts | tor8: Some things I forgot to put in there: | 14:55.58 |
| We should spot left/right/centred alignment. | 14:56.11 |
tor8 | Robin_Watts: all those things (and paragraph detection) would need to go as a post-process pass | 14:56.32 |
Robin_Watts | possibly we should spot align left/justify/align right too (not sure if css can do that? I bet it can) | 14:56.44 |
tor8 | since then we can look at the column width and detect deviations (such as first-line indent and right alignment) | 14:56.57 |
| Robin_Watts: oh, it can :) | 14:57.11 |
Robin_Watts | tor8: I agree. But we should add them to the list. | 14:57.25 |
tor8 | I had a prototype that did the text/span/block assembly independent on the source file order | 14:57.37 |
Robin_Watts | The idea was to get everything into the list and then think about what we can/can't do. | 14:57.40 |
tor8 | it worked very slowly O(n^2) | 14:57.59 |
| sadly, it was more reliable to just use the file order. too many odd cases to handle :( | 14:58.17 |
Robin_Watts | right, so we need to find a faster algorithm. Or we need to allow for a post analysis phase to fix it up or something. | 14:58.43 |
tor8 | the plan for the current text structure was that it'd be easy to do further sorting and analysis | 14:58.45 |
Robin_Watts | right. | 14:58.54 |
| We should also cope with things like 'SAMPLE' being printed diagonally across each page. | 14:59.19 |
tor8 | drop all non-orthogonal text into the "graph" mode | 14:59.40 |
Robin_Watts | i.e. we still want to extract everything and not get sucked into the 'oh, the whole page is a chart' thing. | 14:59.53 |
tor8 | so basically I think we could render the page to a nominal resolution, and then exclude the text we extract from the page and the rest can get segmented out and extracted as images | 15:00.11 |
Robin_Watts | tor8: Right, except if we do that, we end up extracting all the textual labels on charts as text. | 15:00.37 |
| which is wrong. | 15:00.42 |
| I think we should spot charts by area, and not extract the text from them. | 15:01.10 |
tor8 | and then detect areas where text is too close to non-text (such as tables and overlays) and drop those from the text device | 15:01.13 |
| so blocks of text without line art or images nearby will make it to text, and everything else gets rendered | 15:01.45 |
Robin_Watts | tor8: yeah. | 15:02.02 |
tor8 | since everything in the text device has a bbox, filtering out chars/spans that intersect graphs should be easy | 15:02.23 |
| speaking of RTL, our current text extraction doesn't do an RTL fixup pass. we ought to fix that for copy/paste and search. | 15:02.58 |
Robin_Watts | tor8: Yes, I have an idea for another data structure that should help with this. I am writing an email about it now. | 15:04.46 |
| (or maybe it's just a different view onto the data structures that we already have - we can talk about it) | 15:05.02 |
henrys | catching up on the reading. I'll be traveling today - picked a bad day to take a day off... | 15:15.53 |
Robin_Watts | henrys: I suspect we'll be bashing the list around for a while before it's complete. | 15:16.23 |
| The key thing is to get your opinion on whether trying to see if we can get a reflow prototype in the android app within a couple of days is worthwhile. | 15:17.10 |
| I think (in the absence of any showstoppers) it should be feasible to get something working, and it won't detract from the preparation of the quote. | 15:17.39 |
chrisl | Robin_Watts: did you get very_sleepy to profile cust 532's simulator? | 15:17.53 |
henrys | it is if it doesn't slow down the other part. | 15:17.59 |
Robin_Watts | And it makes our quote much more compelling that we're not starting from nothing. | 15:18.01 |
| henrys: I'm sure it won't. | 15:18.15 |
| chrisl: I did. | 15:18.24 |
| but I had problems with very sleepy, so ended up using the VS2008 Team edition profiler. | 15:18.46 |
henrys | Robin_Watts:also we are bound to learn something from doing it that will affect the todo list. | 15:18.52 |
Robin_Watts | henrys: indeed. | 15:18.59 |
chrisl | Robin_Watts: Can you remember how? I'm only getting data for the main thread, not the rip thread | 15:19.09 |
Robin_Watts | chrisl: With what? VS or VS? ;) | 15:19.26 |
kens2 | henrys did you have a 'feature list' for the xpsprint thingy ? | 15:19.35 |
Robin_Watts | old versions of Very Sleepy didn't support threads. | 15:19.44 |
chrisl | Robin_Watts: With very_sleepy | 15:19.47 |
kens2 | I have somethign that, when given a printer name and a XPS file, will print it | 15:19.52 |
chrisl | This is the latest version - downloaded today | 15:20.07 |
Robin_Watts | Are you launching a new process from VSl or attaching to a running one ? | 15:20.21 |
chrisl | I did it from the command line | 15:20.35 |
henrys | kens:no but I was going to check that code in next week I didn't want to put it in before the release. | 15:21.12 |
Robin_Watts | Try running Very Sleepy then using 'File -> Launch' ? | 15:21.15 |
henrys | kens2 ^^^ | 15:21.34 |
| why? | 15:21.39 |
kens2 | because you assigned it to me at last meeting | 15:21.53 |
chrisl | Robin_Watts: I had problems with that because the sim depends on some custom environment variables | 15:22.11 |
henrys | kens2:oh that's a good reason | 15:22.30 |
| let me draw up some list of stuff | 15:22.42 |
kens2 | :) | 15:22.47 |
Robin_Watts | chrisl: So modify the setsim.bat to be setsleepysim.bat. | 15:23.06 |
| To launch very sleepy rather than msvc with the same env vars. | 15:23.27 |
chrisl | Robin_Watts: actually do you have a paid for VS2010? | 15:23.38 |
Robin_Watts | I do. | 15:23.55 |
henrys | you know I have google alerts for Ghostscript, Adobe, Mupdf a few other things and XPS - I get email from notices from everyone except XPS, nobody talks about it. Is it alive? | 15:24.01 |
kens2 | barely breathing, but it *is* at the heart of WIndows 8, more so than Windows 7 | 15:33.06 |
| Microsoft really push you at it, but as long as hey have a legacy path, I think nobody is interested, andif they don't have a legacy path, nobody will buy it.... | 15:34.08 |
chrisl | kens2: just saw the bug about 64 bit vb - wasn't there something similar a couple of months ago that turned out to be down to using the wrong calling convention, or something? | 15:56.42 |
kens2 | chrisl could be I don't recall, but I know almost nothgin about VB | 15:57.00 |
chrisl | I don't think the previous one was VB, maybe .net or something - I'm not having much luck searching bugzilla, though | 15:57.37 |
kens2 | I vaguely recall something similar but I think it was C | 15:58.06 |
Robin_Watts | pops to postoffice. | 16:04.53 |
Robin_Watts | sends wife to post office instead. much better. | 16:10.34 |
henrys | what does one do at the post office in 2013? | 16:22.09 |
kens2 | send letters ? | 16:22.25 |
chrisl | Could be sending something to me..... | 16:22.44 |
Robin_Watts | "Never underestimate the bandwidth of a station wagon full of tapes" | 16:22.51 |
henrys | ;-) | 16:23.02 |
Robin_Watts | bandwidth of post > bandwidth of ADSL in some cases. | 16:23.19 |
| tor8: Are you editing my list of text extraction issues etc to include yours/ correct mine? | 16:25.40 |
| paulgardiner: Could you spot anything in there that I'd missed? | 16:26.05 |
paulgardiner | Not really. The only thoughts I had was that someone might wish to be mostly viewing in normal mode and everynow and then see a reflowed version of the current screen contents. Also that you get a lot of benefits even if not running pages together and removing headings and page numbers. | 16:28.35 |
Robin_Watts | paulgardiner: Right. We should possibly separate page mode and non-page mode into different task lists. | 16:29.27 |
paulgardiner | I'd guess the main call for reflow is when the text is too small with complete lines visible. Having page numbers still present might even be advantageous | 16:29.36 |
Robin_Watts | Also, maybe we should allow people to select a section and then say "reflow this" ? | 16:29.48 |
paulgardiner | Robin_Watts: yes, that would be good. | 16:29.49 |
| I'm less clear how the unpaged mode would work in the app, but I'm sure we can suss it out | 16:30.15 |
henrys | It probably would be useful to see what foxit is doing, does anyone have a copy? | 16:36.48 |
Robin_Watts | I do not. | 16:36.57 |
mvrhel_laptop | the native google app does reflow | 16:38.14 |
Robin_Watts | native google app where? | 16:41.10 |
| on android? in chrome? | 16:41.22 |
mvrhel_laptop | uh the one that comes with the nexus | 16:41.43 |
Robin_Watts | AIUI there is no 'standard' PDF app for android. every device is different. | 16:41.47 |
henrys | yes there is when you install mupdf you get to pick which app to run | 16:42.08 |
Robin_Watts | right. The nexus might have a PDF capable app built in, but that isn't standard across devices. | 16:42.47 |
mvrhel_laptop | it is on google nexus device Robin_Watts | 16:42.58 |
Robin_Watts | I don't have the same choices on the transformer or on my phone. | 16:43.01 |
mvrhel_laptop | not general android devices | 16:43.30 |
Robin_Watts | so what's the name of this app that's standard on nexus' ? | 16:44.02 |
paulgardiner | Polaris Office maintains page boundaries | 16:44.33 |
mvrhel_laptop | hold on let me see what it is called | 16:45.10 |
paulgardiner | Adobe Reader has mapped "Reflow text" to "make unusable" | 16:46.11 |
Robin_Watts | Repligo's is pretty horrid. | 16:48.27 |
mvrhel_laptop | Robin_Watts: so it is called Quickoffice HD | 16:48.44 |
Robin_Watts | perfect, thanks. | 16:48.52 |
mvrhel_laptop | Not sure if you can get it at the android store or not | 16:49.01 |
Robin_Watts | paulgardiner: Can you get polaris office to reflow at all? | 16:50.41 |
paulgardiner | Polaris offfice seems to reflow fine for me | 16:51.11 |
| Well, on the one doc I tried | 16:51.23 |
henrys | well I'm off check in later this evening | 16:59.38 |
Robin_Watts | have a nice trip. | 17:06.59 |
mvrhel_laptop | bye henrys | 17:07.27 |
Robin_Watts | I wonder if it's worth us asking Raph is their interest is in "Reflow this page" or "Reflow this entire document" ? | 17:08.33 |
henrys | include that in the roadmap and this evening I'll bring raph on to the roadmap email recipient list. Or you guys can do it if you want. I just thought once we had something reasonably coherent he should be included. | 17:14.00 |
mvrhel_laptop | Robin_Watts: so the native one only does the page | 17:14.08 |
| nexus native that is | 17:15.19 |
Robin_Watts | mvrhel_laptop: Thanks. | 17:15.51 |
tor8 | Robin_Watts: http://tronche.com/gui/x/xlib/utilities/regions/ | 17:45.27 |
Robin_Watts | As a possible implementation for MarkedRects or UnmarkedRects you mean? | 17:46.00 |
| Do you have an opinion on whether MarkedRects/UnmarkedRects are useful? I had a discussion with Paul on the phone and I think I convinced him. | 17:47.07 |
paulgardiner | Yes you did | 17:47.33 |
Robin_Watts | I don't think the implementation for MarkedRects/UnmarkedRects is hard. | 17:47.47 |
tor8 | I've done a region implementation once, ages ago. I think it'd be useful to have, as per your mail. | 17:48.40 |
| but I'd pick 3 different "region" objects, one for each type of operation. | 17:48.59 |
| region as I implemented was just a list of rects that got merged if possible when adding new rects to the region so the rects in the list would never overlap | 17:49.50 |
Robin_Watts | Suppose I have line art, and text and an image that all overlay the same area. What goes in the structure ? | 17:50.09 |
tor8 | three structs, one for each type | 17:50.21 |
| then you can check for overlapping areas of different types | 17:50.46 |
| or did you have something else in mind? | 17:50.51 |
Robin_Watts | I was imagining that MarkedRect would basically be a set of struct { fz_rect rectangle, int flags} | 17:51.18 |
tor8 | right. that'd work too, I think. | 17:51.33 |
Robin_Watts | where flags & 1 => text is here, flags & 2 => line art is here, flags & 4 => image is here | 17:51.47 |
tor8 | it'd be less obvious how to implement the standard region functions, since you'd have to avoid merging rects with different flags | 17:52.16 |
Robin_Watts | Surely we'd want to merge the rects ? | 17:54.25 |
tor8 | well, you want accurate flags for the areas | 17:54.43 |
| so if you have an old rect with text and add a lineart that partially overlaps, we'd want three out | 17:54.58 |
| text, text+lineart, lineart | 17:55.06 |
Robin_Watts | I'd have thought we wanted to make the whole area text + lineart. | 17:55.19 |
| Imagine a graph with text labels on the Y and X axis, and then some lineart axes. | 17:55.57 |
tor8 | then a background image or "WATERMARK" image would infect all the text on the page | 17:56.05 |
Robin_Watts | we want the whole chart region to be 'text + lineart'. | 17:56.11 |
| Right. So we deliberately spot and exclude background images. | 17:56.25 |
| and watermarks :) | 17:56.38 |
tor8 | I'm thinking given accurate region flagging, analysing and categorising what you want to extract would be easier from that in a separate step | 17:56.54 |
Robin_Watts | Sending the text from the chart and the line art separately would be bad. | 17:57.02 |
tor8 | so you would see that the text regions from the chart overlap the lineart | 17:57.23 |
| or lie close enough depending on the font size | 17:57.33 |
Robin_Watts | tor8: I am not convinced yet, either way. I could be swayed either way with a decent argument. | 17:58.13 |
| but something in this ballpark sounds correct to me. | 17:58.26 |
tor8 | we want a region implementation for performance reasons, which will merge rects to make future insertions and lookups faster | 17:58.47 |
| or we'll hit O(n^2) | 17:58.57 |
| and we'll definitely want to check text for overlap and find contiguous regions of both text and lineart to extract into blocks or as images | 17:59.31 |
| so I think a good first step would be creating regions of text and images from a device | 17:59.48 |
| I think it'd be easier to have three separate region sets, one for each type | 18:00.03 |
| but that's just a gut feeling without having actually implemented any algorithms on it | 18:00.22 |
Robin_Watts | tor8: I was proposing that we put the rectangles in from 'blocks'. So there won't be that many on a page, and they will tend to be non overlapping. | 18:00.26 |
tor8 | Robin_Watts: right. that could do for a start, but line art will be different. each path will create its own rect there. | 18:01.05 |
Robin_Watts | Line art and stuff, I was imagining we'd union the regions if they overlap. | 18:01.11 |
tor8 | L shaped graphics would hurt us there | 18:01.27 |
| like borders around a page | 18:01.34 |
Robin_Watts | but that may be bad.... indeed. | 18:01.35 |
| Yes, ok, so scratch that. | 18:01.41 |
| but I don't think we'll end up with stupidly many regions on a page. | 18:02.07 |
tor8 | we could snap the rects to a relatively coarse grid | 18:02.09 |
| say 12pt or something similar | 18:02.24 |
Robin_Watts | do a 'fuzzy' union of rects. | 18:02.29 |
| something like that. | 18:02.39 |
tor8 | Robin_Watts: yeah. I'd say implement regions and make the merges fuzzy | 18:02.55 |
Robin_Watts | that sounds better than the coarse grid to me, yes. | 18:03.12 |
tor8 | but we're in agreement on a general road forward at least | 18:03.12 |
Robin_Watts | yeah. | 18:03.16 |
tor8 | if we have a good region implementation, that could be useful to create the "blocks" level of text extraction | 18:03.57 |
| but what we have now generally works okay, we don't have to rewrite that when there's so much else we need to do | 18:04.27 |
Robin_Watts | yeah, the overall shape of the algorithm I have in mind builds on what we have. | 18:04.47 |
tor8 | looking at the blocks to find and split out paragraphs would be a good first step | 18:04.58 |
Robin_Watts | and it should allow us to code something 'simple' and yet to have the scope to expand it to add more heuristics/smartness later. | 18:05.24 |
| I'm trying to sketch it out now. | 18:05.32 |
tor8 | take the median of the left column (and look at the standard deviation in case of right/centered/left adjusted text) and find deviations | 18:05.32 |
Robin_Watts | median what of the left column ? | 18:05.51 |
| median line gap? | 18:06.52 |
mvrhel_laptop | bbiaw | 18:07.57 |
tor8 | left column | 18:16.01 |
| or min of the left column | 18:16.09 |
Robin_Watts | median minx position of the left column? | 18:16.27 |
tor8 | that was my thought, yes | 18:17.06 |
| should get the left coordinate of the column | 18:17.20 |
| then we can find first-line indents | 18:17.25 |
Robin_Watts | My sketch here has a step "Attempt to detect paragraphs within blocks. If we find them, split so that blocks represent paragraphs rather than columns." | 18:17.40 |
tor8 | and median line height should let us discover paragraph breaks with blanks between | 18:17.45 |
Robin_Watts | I hadn't gone into details, but yes, I think something like you describe should work. | 18:18.02 |
tor8 | Robin_Watts: yeah. we could insert those as post processing steps when closing the text device. | 18:18.09 |
Robin_Watts | that's what I have, yes. | 18:18.18 |
tor8 | and also add justification to the styles | 18:18.28 |
| I'd make a separate pass for every bit that needs detecting, easier to toggle things on and off later if we need/want to | 18:19.01 |
Robin_Watts | tor8: Yes, each step in this sketch is hopefully separable. | 18:19.26 |
Robin_Watts | sends sketch email to tor8/paulgardiner for demolition :) | 18:43.23 |
tor8 | whole page images should go to a separate place | 18:45.02 |
| for say scanned documents | 18:45.07 |
Robin_Watts | tor8: I think that if you attempt to reflow a scanned document and you get a blank page... then that's fair game. | 18:45.31 |
tor8 | or a cover page | 18:45.53 |
| should render a screen sized version and drop into the doc, IMO | 18:46.16 |
malc_ | tor8: whole bbox/irect/rect thing is strange: a) union_bbox disappeared, b) bbox_device doesn't take bbox but a rect.. | 18:46.22 |
Robin_Watts | malc_: Be aware that there are more API changes to come before the release, I think. I hope that we'll move to pass by reference rather than value for matrixes and rects. | 18:47.38 |
malc_ | Robin_Watts: why? (do you hope that, that is) | 18:48.42 |
tor8 | malc_: bboxes should only be used for pixmaps | 18:48.44 |
Robin_Watts | malc_: Cos I've written the patch, and it's awaiting review. | 18:48.58 |
malc_ | Robin_Watts: let me rephrase, what's the benefit? | 18:49.15 |
Robin_Watts | It gives noticable speedups on embedded systems, and doesn't hurt on windows. | 18:49.16 |
malc_ | ah | 18:49.21 |
| (couldn't care less about windows though) | 18:49.31 |
Robin_Watts | s/windows/x86/ | 18:49.39 |
malc_ | couldn't care less about x86 either :) | 18:49.55 |
Robin_Watts | specifically it helps on ARM. I tested on beagleboard and saw a 5-10% increase in speed, but I suspect it'll be even better on lower ARMs. | 18:50.46 |
malc_ | Robin_Watts: gcc sucks that much when passing structs by value? weird (then again not familiar with the ABI) | 18:51.18 |
Robin_Watts | More testing is required to know if those figures are entirely representative. | 18:51.24 |
| malc_: It's not gcc's fault. | 18:51.33 |
malc_ | Robin_Watts: what then? | 18:51.45 |
Robin_Watts | (though any question that begins "does gcc suck..." is answered by "yes" when talking about ARM) | 18:52.06 |
malc_ | on to the greener pastures (hills) then | 18:53.11 |
Robin_Watts | malc_: If you pass a matrix by reference that's 6 FP load/stores on every call, then assuming you pass it on into a function to do something with it, (like a matrix multiplication say) that's another 6 stores | 18:54.17 |
tor8 | malc_: passing matrices by reference also allows a reduction in the number of matrix multiplications | 18:54.22 |
Robin_Watts | loads/stores even. | 18:54.26 |
| and what tor8 said. | 18:54.33 |
| It reduces the number of fp ops overall. | 18:54.47 |
| tor8: At the moment, what I'd sketched there was only for single page operation. | 18:55.32 |
| For multiple page operation, we'd need to modify that a bit, and at that point cover pages may become more of an issue. | 18:56.07 |
tor8 | Robin_Watts: we could do multiple pages by stacking coordinate spaces | 18:56.36 |
| might hit fp precision issues on hundred-page-plus documents though | 18:57.14 |
malc_ | tor8: sorry, don't see how multiplication fits into this | 18:59.05 |
Robin_Watts | Stacking would be bad. Imagine 2 pages with 2 columns on each page. | 18:59.52 |
| flow should not proceed down across the pages. | 19:00.08 |
| multiple pages 'just' requires us to solve: headers/footers (e.g. page numbers, titles being spotted and removed), and joining paragraphs across the page divide. | 19:01.04 |
| I strongly suspect that we don't need multiple pages at least initially. | 19:01.35 |
| no one else seems to do that, so as long as we at least don't rule it out from future consideration, I think we're OK. | 19:02.07 |
Gustavo | Hi.....I am wondering if it is possible with Ghostscript to read the coordinates of a graphic object such as Rectangle (for instance re), Line etc.... | 19:34.08 |
tor8 | Robin_Watts: right. | 19:48.55 |
| Forward 1 day (to 2013/02/02)>>> | |