Skip to main content

Introducing spellchecking

Β· 5 min read
Simon PrΓ©vost

Spellchecking can be simple but useless, or complex but effective.

Adding an autocorrector was one of my long-term tasks in Accent. As Accent is a tool oriented for developers, it is often they (us) who write the texts and translations. It is therefore important to limit the chances of having typos or grammatical errors in our projects without using an external tool or relying on a browser extension.

image

Here is the path I took to add this functionality in Accent.

Goals​

Must be right πŸ”Ž

We want to have valid corrections and as few false positives as possible. We also want a corrector which identifies errors of interest to the user.

Must be quick ⚑️

We want to run the corrector on a text but also on an entire project which contains 1000 strings. We don't want to have feedback in 1 hour, we want it to be instantaneous.

Must be integrated into Accent πŸ’»

That means free, without additional configuration.

Research​

As a developer, my first instinct is to search for existing solutions. Implementing a spellchecker from scratch was not on my todo.

Hunspell

Corrector based on a custom dictionary format to provide suggestions mainly for unknown words. It’s the tool by browsers (and OS) to underline words with an error with the classic red squiggly line.

  • πŸ”Ž Lots of false positives, only individual word corrections so quite limited and useless for the majority of the text in Accent
  • ⚑️ Very quick. You can launch a hunspell process which listens to stdin and returns the corrections in stdout. Easy interface for an Elixir program. More on that later.
  • πŸ’» Easy, we bundle hunspell in the Docker image and everything is done.

Observation: The usefulness is quite limited given the very basic functions of hunspell. Same thing for other tools like enchant, ispell, aspell.

AI

In search of the ultimate spellchecker, I of course ventured into the world of artificial intelligence.

  • πŸ”Ž Very complete in correcting mistakes. Even to explain mistakes. Unbeatable for the quality of information.
  • ⚑️ Unless you are running Accent on a machine with an extraordinary GPU, correcting 1000 strings at the same time is not really viable for an AI. 😬
  • πŸ’» Very complex to have something usable. We don't want to depend on OpenAI even if it improves the "speed" aspect.

Observation: Speed is too big a barrier to the adoption of artificial intelligence for this task.

In addition, the AI assistant functionality in Accent already exists so the user can themselves have a "correction" shortcut integrated with OpenAI ✨ πŸ˜€.

LanguageTool

Looking for more complete solutions, Google directed me to LanguageTool and Grammarly. The latter is proprietary software so we pass. LanguageTool is open-source with an easy way to run a server or command-line! Multiple language support too!

Now I’m excited!

  • πŸ”Ž Very complete in correcting mistakes. The way LanguageTool works is essentially a series of "rules" written by a human. Given the quantity of "rules" per language we can almost speak of artificial intelligence.
  • ⚑️ Very fast even with a large amount of rules. 1ms to validate a text with the HTTP server. We also don't want to depend on a private HTTP API.
  • πŸ’» Running a separate HTTP server is not really possible and the command-line takes a long time to initialize for each validation (500ms) 😭

Observation: The solution offered for LanguageTool does not integrate well with Accent.

Solution​

I knew that LanguageTool was the best solution for Accent. But the fact that it was in Java was a bit of a barrier to easily use it in our Elixir code.

Behold documentation! -> https://dev.languagetool.org/java-api

What if I made my own library in Java that exposes exactly what I want and it integrates well with Accent?

ChatGPT:

In Java, write a program that listen to stdin and sends the input in an external Java package. Write the result to stdout.

It works! Here is the entire process:

  • When Accent boots, we start a GenServer which calls the .jar.
  • This Java program listens to stdin and returns data through stdout.
  • The GenServer receives a message with the language, the text and some information to filter the placeholders and markups to send it directly to the Java program.
  • The GenServer reads the stdout of the program to get the result. By using a GenServer, we ensure that the interface with the Java program will always be maintained and that if a crash occurs, it will restart automatically.

In Elixir: LanguageTool.check("fr", "Hello!")

  • πŸ”Ž LanguageTool validates syntax, grammar, unknown words and can even validate ambiguities between the translator's spoken language.
  • ⚑️ Very fast, we pay for the slow initialization of the JVM only once, during startup.
  • πŸ’» We install JRE on the docker image and copy the .jar into the final image, Very simple!

Examples​

imageimageimageimage

Note: The Java program is written in Kotlin and is part of the open-sourced repository on GitHub.

Conclusion​

I think that Accent is the only tool that integrates this well with a robust spellchecker. Even if LanguageTool is perfect for our use case, there is still work to be done for this feature to be fun to use: Inline auto-correct button, "ignore this error", etc. Stay tuned for more awesome features!