Nicole Filz, Saif eddine Hasnaoui, Sarah Alaghbari
March 8, 2024
Deciphering the nuances of communication can be quite challenging, not least understanding the intent of a written text. However, there are technologies that demonstrate the ability to accurately understand content written by a human. One of those is the elevait suite which helps users organize their documents by automatically labeling them, allowing for an easier distinction of document types, for example of Invoices, Curricula or Plans.
What lies behind such seemingly magical features is Natural Language Processing (NLP), an intersection of linguistics and computer science which focuses on the understanding between computers and human languages, and attempts to gain meaningful insights from textual data.
The basis for such NLP systems is a process called text annotation. During text annotation, data is streamlined for machines by providing clear markers for relevant information, which add valuable context and meaning to the raw text data. Through careful analysis and tagging performed by humans, it is then possible to derive patterns and to apply those insights to unknown text documents, in order to categorize them and to identify recurring entities such as names or places.
At elevait, we have developed a custom Text Annotation Tool (TAT), which allows us to train and frequently retrain our models - because just like us, an AI never stops learning, and new domains require new training. Are you wondering what was our approach to develop this tool, and what challenges we faced? We’ll tell you, showing two different views: the UX integration and the development perspective! ✨
As a first step, we collected and analyzed the user requirements in order to understand the software requirements in detail. On the one hand, we discussed the requirements with the annotators with various expertise. The requirements and workflows for the Text Annotation Tool (TAT) were then collected and prioritized. On the other hand, we analyzed similar tools and carefully evaluated their advantages and disadvantages.
This exchange allowed us to identify possible workflows and some individual needs could already be taken into account in the development. Based on this knowledge, we visualize the interface using small sketches to place the various components based on the user interface. The layout of the individual elements was described and their basic interaction shown. In addition, we again collected feedback from experts on the user-friendliness of the interaction from a small team and ensured that it fulfilled the requirements. The feedback was integrated into the development at this point.
After some functions were implemented into the interface, it was tested with users for the first time. The feedback received was carefully analyzed and integrated into further development steps. This iterative process and the exchange with the users enabled us to continuously improve the software and ensure that it meets the needs and expectations of the users in the best possible way.
Since we were lucky enough to be working closely with the future users of the TAT, we could easily establish a routine of frequent discussions and exchanges with them. First, we collected and analyzed their requirements and current workflows, in order to comprehend and prioritize their feature requests. Of course, we also tried our best to take special requests into account, for example the possibility to translate the text to annotate. 🤝
Some core features requested by the users were:
Next, it was time to talk visuals. Therefore, we analyzed similar tools regarding their advantages and disadvantages (you wouldn’t want to reinvent the wheel all over again). Based on those insights and the requested features, we then created a mockup which is a purely visual version of the user interface mockup.png, showing the rough layout of the UI elements and their basic interaction. The users then provided us with valuable feedback during a further exchange round, which we gratefully integrated into the development.
We established a continuous feedback loop with the users. This iterative procedure enabled us to continuously improve the software and ensure that it met the user needs and expectations in the best possible way.
Now that we know some of the user’s pain points, the question remains: which annotations can be performed inside the TAT? In general, the tool supports two types of annotation tasks:
The labels and classes to be used can be defined by authorized users, before as well as during the annotation process. In this way, the TAT stays generic and can be flexibly used for various use cases and domains.
Individual text elements are annotated by first selecting the text and then assigning the label. After these annotations have been made and confirmed, a review is carried out. This can either be done manually by a reviewer or automated.
Since it became clear during the interviews that one big pain point was the high effort necessary for manually reviewing annotated data, we developed an auto-approval mechanism that relies on the agreement among the annotators. Therefore, admins have the possibility to specify a number of equal annotations that can be regarded as reliable. Once this number is reached, the according annotations will be automatically approved and don’t need to be manually revised. 🥳
This number of required annotations can be specified upon the setup of the annotation tasks. Additionally, during this setup step, the task type needs to be selected, as well as the data sources and users who should perform the annotations and reviews respectively. A bunch of tasks will then be automatically created by mapping data and annotators - without any additional manual assignment effort.
At the very end, all finished and approved annotations can be exported and used for training purposes.
Let’s take a dive into the development now. During the implementation, we stumbled upon some fascinating challenges that needed to be tackled: enabling the communication between different components, handling diverse data, and providing a way for users to translate the text to annotate.
The text annotation tool is decomposed into numerous small and dedicated components, each handling only one specific task. This approach is adopted to address the challenges associated with testing, debugging, and maintaining large components. These smaller components are designed to be reused across different contexts throughout the tool in question and throughout the whole application.
Challenge: The decomposition introduces a communication challenge between components, particularly when they dynamically alter the states of variables based on user-defined actions.
Solution: Reactive programming using NgRx. NgRx is a state management library for Angular applications inspired by the Redux pattern. It provides a predictable and centralized state management approach, making it easier to manage the state of the application in a scalable and maintainable way.
Key Concepts are:
In essence, annotating involves the assignment of attributes to word(s) or classes to documents. However, these attributes and classes extend beyond mere data; they represent nodes interconnected through relationships, forming knowledge graphs. These knowledge graphs enable the integration of data from diverse sources and formats, presenting a unified view that facilitates improved decision-making and analysis.
Challenge: Alongside using knowledge graphs, we encounter the task of handling diverse data such as documents themselves, users, permissions, and so on.
Solution: We are using 2 databases, each for its own purpose, and data models.
Within our dynamic annotation tool, documents could be provided through different channels, such as emails or invoices, over which we don’t have control.
Challenge: Annotating documents can often pose difficulties, particularly when dealing with content in an unfamiliar language.
Solution: In response to the diverse linguistic challenges posed by documents, we introduced a translator. This feature offers annotators the possibility to choose the language they are familiar with, enabling the seamless translation of the text within a split-view interface. This translator is integrated as a service within our infrastructure to streamline the annotation process, ensuring accessibility and efficiency.
Are we done with the tool then? Not quite yet. There remain some implementation details yet to be tackled. One of those lies in navigating the challenge of word splitting in an annotation task, which poses a unique challenge and demands a specific approach. Our quest is to employ a machine learning-based tokenizer that's good at understanding words based on their context. While the conventional strategy relies on spaces and special characters as natural word dividers, the landscape becomes complicated when confronting entities like emails, phone numbers or even addresses. In these instances, the conventional rules fail, demanding a more sophisticated mechanism capable of recognizing and preserving the integrity of these specific entities. This complexity shows why it's important to create a service based machine learning tokenizer that follows regular language rules, and can also adjust to different types of information, making sure our annotation process runs smoothly.
Apart from this, we ran an internal testing session where some developers clicked their ways for the tool. The findings combined with our latest round of feedback revealed multiple venues for future development of the TAT, such as annotation statistics, improving the accessibility of the tool, and maybe even elements of gamification.
With this independent implementation, it is possible to make customizations individually. Missing functionalities in existing solutions can be added at any time during in-house development. In addition, it is possible to respond individually to the specific requirements of users.
But for now, we are positive, the TAT may become a central training tool at elevait and ultimately helps to bridge the gap between technology and human expression. After all, the benefits we can gain from NLP go beyond words.
Title Image for Blog Choosing VPN for a growing Start-Up
Read more...blog title image: A look into the development challenges of our text annotation tool
Read more...Title Image Blog post with title text: extracting structured data from pdf plans
Read more...Titelbild elevait as Company Partner for IPCEI-CIS Project AIDED
Read more...Titleimage with the logo of ALLPLAN and elevait titled by new technology partnership of ALLPLAN and elvait
Read more...title image blog post out of domain detection
Read more...Title picture of small robot, generated by AI, sitting on a roulette wheel
Read more...