The AI Last Mile: Why Imperfect Data Matters More Than Bigger Models

The president at JBS Dev, a company offering services in strategic technology, Joe Rose believes that it is important to clear up the myth when using generative and agentic AI tools. “One common mistake when trying to build out any type of AI is thinking that your data needs to be perfect first,” he asserts.

As an article published in AI Fieldbook describes, it comes as no surprise that both vendors and consultants advise that you should have a lot of data in lakes and have a plan for several years for data transformation, respectively. Consequently, executives find it difficult to make sense of it all. However, according to Rose, the tools available have never been better to work with poor-quality data. “LLMs can do an incredible job understanding even a half-written prompt,” he continues.

It makes perfect sense. When you have access to this tool, then it makes sense to make use of it. You must do so with the appropriate precautions in mind. As models are inherently unpredictable, the need to deal with any poor quality outputs is something to be considered. For text or category data, the resilience factor applies here. “People are… used to ‘we build it, it works, we forget about it,’” notes Rose. “That’s just not how these systems work.”

Also Read : Physical AI Is Bringing Humanoid Robots to Real Factories

In terms of flawed data sets, Rose provides an example from a client company working in the healthcare field. In order to migrate to another system, the company wanted to reconcile the bills they had. Some records were in PDF form, others were image files, while the procedure could be written under the doctor’s name, who was mentioned in the patient’s name and vice versa. The generative AI was able to recognize the clean data through a simple prompt, from performing OCR on the images and text extraction on PDF files, and further agentic techniques were used, like comparing a customer’s record against an insurance plan to check whether the bill was accurate.

“You start to stack different use cases upon each other,” explains Rose. “The point I’m trying to make here is that it’s not always perfect – you still require a human in the loop. But what you should be saying is, ‘we started off at 20% automation, and then 40%, and then 60, 80%’, and growing that over time.”

In the coming years, according to Rose, talks about these kinds of models will revolve around costs and portability. “I think the conversation will change quite a bit where, rather than thinking about radical innovation and capabilities of a certain model, the focus will become more on ‘how do we ensure that the cost becomes more sustainable so that we don’t have to keep building data centres at this rate?’” he adds.

“Where the discussion really lies in the last mile would be in regards to ‘how do we run them on a laptop or on a phone instead of having to run them in a data centre?’ Models are built using a set of data, which is effectively all pages on the internet, and other data sets. The idea is that there’s not a lot of data left out there

Rose is excited about the discussions at AI & Big Data Expo, where JBS Dev will be present; another controversial opinion Rose is planning to share is that people should stop buying from SaaS providers since it can be done by yourself. “I can assure you it is easier than you think,” Rose says. “Everyone has some sort of cloud service, and this is where I would recommend starting, considering the tooling capabilities offered by the cloud provider, especially the three biggest ones… All you need is there.”