Tommi's Scribbles
Your AI is only as good as your data

- Published on 2024-10-04
In our team, we have a saying: stool in, stool out. This is especially true for AI related work. Often, people have a misconception that a RAG based AI will magically deliver great results. However, this is not the case.
Misconception One: AI can clean up your badly formatted internal knowledge
If your internal knowledge is hard for people to read and understand, feeding it to AI will likely not yield any better results. Inconsistent language, misaligned metadata, and bad formatting all reduce the quality of a RAG-based solution. While Google Gemini now supports massive context windows which could allow skipping indexing and searching, the results with bad data even in such a scenario are not there. Not to mention you will break the bank with the massive amount of tokens you feed in.
This is good news for the data engineers out there. Doing an ETL step before doing RAG can greatly improve the quality of responses you get out of a RAG based AI.
Misconception Two: AI can fill in the gaps or do negative inference
It is easy to think AI is capable of independent thinking. That is, if you have two documents, where one says "We have operations in Canada and France", and the other one says "Our HQ is in the USA", some people expect a RAG based AI to be able to reliably answer a question of "Do you have operations in the USA" or "Do you have operations in Mexico".
While a human with the above information is able to infer the missing pieces of information, the RAG based AI does actually not have the data necessary to answer the posed question. This is again great news for all the data people out there, as having a data savvy engineer can help point out the gaps and suggest reformatting of source material to deliver the results needed.
Misconception Three: AI fill fix your bad documentation
The final misconception I would like to point out here is related to sometimes unrealistic expectations that a RAG based LLM will fix bad internal documentation. As an example, a troubleshooting document A might have a side note "To reset X, login to App A and click X". Meanwhile, troubleshooting document B describes in detail how to use App B, including resetting X. Now, when you run a query "How to reset X", you have two bits of conflicting information, and results are likely suboptimal; especially in a case where Apps A and B are not interchangeable (two ways to do the same thing).
To resolve such situations, you need a skilled person debugging unexpected AI performance with understanding of how the interactions in the source data can be solved, especially when the internal data has a lot of semantical similarity or is stale.
Closing thoughts
Hopefully, the above helps you understand that AI is not a magic superpower that will fix poor quality internal data. I also hope the above helps you understand, that while your out of the box or first attempts of using AI with RAG might not deliver you immediaty perfect results, by employing skilled individuals you can get the value and perfromance wanted with some tuning.
AI is by no means a silver bullet, but it is also not a sprint. Treating AI integration as a marathon will get you much better results.