• 1374
  • More

Dev diary #2 - Introducing Chantal

2022 was so bad in terms of junk emails and noise that I started the Virtual Secretary, a Python framework to write intelligent email filters by crossing information between several sources to guess what incoming emails are and whether they are important/urgent or not. When I'm talking about junk emails, it's also Github notifications, pings on pixls.us (thank God I closed my account on that stupid forum), YouTube, and direct emails from people hoping to get some help in private.

Having become "the face" of darktable, mostly because I'm one of the few to bother providing user education and training instead of just pissing code, I didn't see that coming, and I wasn't prepared. A lot of people now mistake me with the front desk, which doesn't help abstract thinking on coding matters, let alone taking time to actually produce art. The problem is all the time lost dealing with info/noise/input is not spent solving problems, and time is the only thing for which you cannot get a refund.

After a while, I figured it would be nice to extend the Virtual Secretary with a machine-learning classifier, which would guess in what folder incoming emails should go, by extracting the content of the emails already in said folder. It's actually much easier to implement than what I thought, but the time-consuming bit is to write text filters to clean-up the input (because garbage in, garbage out, especially for spam emails which are generally improperly formatted).

But the ultimate goal, in my wildest dreams, was to build an autoresponder for people asking questions already answered on one of the many websites I have contributed to over the years. It's a constant frustration to see that all the pages of doc I have written over the years are lost in Internet limbo. On FLOSS-centric forums, benevolent guys also tend to experience the same kind of fatigue : repeating again and again the same info, linking the same pages, to never-ending hords of newbies who don't know what to look for. Just look at Reddit darktable : every 14 days, someone else asks why the lighttable thumbnails don't look like the darkroom preview. Even discarding the amount of frustration and angryness here, the number of man-hours lost in repeating is outstanding. Just because information is lost.

The true problem of search engines is you need to know what keywords to look for. Which is circling back to the fact that newbies don't know the slang. So they don't know what to look for. They don't have any entry point in the matrix. Except other humans. Which sucks for the ones having to do the work, usually for free.

After merging a neural layer of word2vec word embedding (big words to say it's unsupervised machine learning finding how words are contextually related in sentences, that is finding syntactical structures, synonyms and the likes), as a first step in my email classifier (which is now up to 92 % accuracy), I wondered if this wouldn't been usable to build a context-aware and synonym-aware search engine, able to look past exact keywords.

Turns out a couple of guys from Bing had the same idea in 2016, and published their maths, so I implemented them. Then proceeded to add a web interface on top. That gave birth to Chantal, the AI you are kindly asked to bother before bothering me. The current version is trained against 101.000 internet pages from my own websites, darktable & Ansel docs, along with some reliable color-science ressources. It indexes 15.500 pages in French and English and can process search queries in either or both of these languages. One of its mean features is to propose you a list of keywords associated to your query, so you can refine/reorient/try things you wouldn't have thought of before.

Hope that helps.

That work showed me how poorly indexable many websites are. To account for the lack of XML sitemap on forums.darktable.fr and color.org, I had to write a recursive crawler. But even then, many pages don't have description meta tags and a proper date tag. It means you need to use regular expressions and indirect methods trying to identify the metadata, and manually tune the HTML parser to extract the actual content part of the webpage (discarding sidebars, menus, asides and advertising if any).

Then, you get to love Q&A forums like Stack Overflow, where proper questions start a thread, proper answers follow, and the best answers are selected by the community. "Thank you" and "me too" messages are explicitly forbidden in the conditions of use. On forums like pixls.us or forums.darktable.fr, proper technical information gets lost in the middle of semi-technical rambling, life stories and bros bonding over tales of software, in a continuous thread where nothing distinguishes relevant from irrelevant, accurate from inaccurate, and gross misunderstandings of color theory. From a machine crawling perspective, there is very little to exploit here, and investing time on such platform is a dry loss.

Comments (2)
    • I agree with most, understand you, all of us have experienced that problems.

      Will see how this AI project evolves and if it helps to find info. I am not all that fan of it, not good experiences with that kind of assistants.

      Me as a user have the same problem from the opposite site.

      It is no easy to find good info about darktable or other subjects too much info is about as bad as no info

      If you let me an provide an advice, I will say the way to go is a good book about Ansel.

      I know it is hard to write a good book, but you have the ability for it, and s book can be sold easily.

      I know you have provided many written info and many video tutorials, but it lacks organization for a newbie. That is great if you want to know about how a module works, but not as good for someone that wants to learn how to use Ansel and get good results.

      Forums lack good organization and persistence, as good info soon get lost under lots of new info.

      The book should cover all aspects from basic install, configuration, import and development, from s photographer pov and not from a color expert view, with examples different kinds of photos .

      Too Technical and detailed info better linked to your original articles.

      May be you need help from someone that has good in-depth knowledge of Ansel/DT as a photographer.

      You have enough material and have the teaching capabilities to write it.

      That would be a great source of organized info and may be a good money income.

      • I have written 2 books in my life, and the number of hours to put is insane, and for topics like software where the life expectancy of the information is so short, it's not worth it… A book would get outdated too soon.

        Have you checked the growing "workflow" section on the website ?

      Login or Join to comment.