
The model was never the problem.
Across enterprise AI deployments, the same root cause keeps surfacing: bad data. According to Informatica's 2025 CDO Insights survey, 43% of organisations cite data quality and readiness as the number one obstacle to AI success. It beats out lack of technical maturity. It beats out skills shortages. It comes first, consistently, because it determines the ceiling for everything else.
Choosing your model before you've sorted your data is like hiring a brilliant analyst and then handing them a filing cabinet full of shredded documents.
How do I prepare documents for a RAG knowledge base?
Most enterprise AI systems rely on Retrieval Augmented Generation, or RAG. Instead of relying solely on what the AI model was trained on, you feed your own documents, policies, reports, and data to RAG, and the model extracts the relevant information when a query is made. Ask a question and you get an answer drawn from your actual content.
The problem is that "feeding in your documents" sounds simple and rarely is.
Documents in the real world are messy. They are PDFs with two-column layouts that are extracted as scrambled text. They are scanned images with no machine-readable content at all. They are Excel files where the meaningful data sits in merged cells, or Word documents where headings are formatted as bold paragraphs rather than proper heading styles. They are SharePoint pages with navigation menus that get ingested alongside the actual content.
When these documents enter a RAG knowledge base, the AI has no way to distinguish a page number from a footnote, or a table header from a table cell. The AI simply processes the input it is given. If the input is flawed, the AI will confidently synthesize answers from that flawed data: garbage in, garbage out.
Preparing documents for a RAG system means solving at least four distinct problems before a single query runs.
Extraction: Getting the raw text out of whatever format it lives in (PDF, Word, HTML, Excel, PowerPoint) in a way that preserves the logical structure, not just the character sequence.
Cleaning: Removing noise. Headers and footers that repeat on every page. Navigation elements. Watermarks. Boilerplate legal text that has nothing to do with the substance of the document.
Chunking: Dividing the cleaned text into segments the AI can retrieve. Too small and the chunks lose context. Too large and the model struggles to identify what's relevant within them.
Metadata tagging: Attaching information about each chunk (the source document, the date, the author, the relevant department or product line) so the system can filter intelligently and answers can be traced back to a source.
None of this happens automatically. All of it takes time.
What percentage of AI project budgets should go to data preparation?
Most teams budget a sprint or two for "data clean-up" at the start of an AI project, then move on to what feels like the real work. The assumption is that data preparation is a preliminary task rather than a primary one.
Informatica's analysis puts it plainly: a bulk of the effort in enterprise AI goes into data preparation, including RAG pipeline construction, feature selection, prompt engineering, and governance. The actual model work is a small portion of the total. Programmes that succeed treat data readiness as the primary investment, not a precondition that gets checked off before the real work begins.
Teams that neglect this initial phase often only realize the inadequacy when the system is already in the costly testing stage, which is the most expensive time to implement a fix.
Why do messy PDFs and scanned documents break RAG systems?
Scanned documents are images. Without optical character recognition, an AI system sees a JPEG of text, rather than text itself. OCR has improved significantly, but it introduces its own errors: misread characters, broken words, merged columns. These errors compound downstream.
Even digitally-created PDFs cause problems. PDF is a presentation format, designed to make documents look right on paper rather than to be read by machines. A PDF with a two-column layout might be extracted with the columns interleaved. A table might be extracted as a list. A document with callout boxes might have those boxes extracted in completely random positions relative to the main text.
The AI has no way to flag when this has happened. It processes what it receives and produces answers that sound plausible, because that is what large language models do. The errors are invisible until a user receives a confidently wrong answer and you go looking for why.
How do I clean and structure data for generative AI ingestion?
Practical data preparation for a RAG system involves four categories of work, typically in this order.
Audit first. Before touching anything, catalogue what you have. How many documents? What formats? How old? Who owns them? Are there duplicates? Are there conflicting versions of the same policy or product specification? An audit takes time but prevents you from building on a foundation you don't fully understand.
Standardise formats. Where possible, convert documents to formats that extract cleanly and preserve logical structure. For PDFs and scanned documents, invest in a proper extraction and OCR pipeline rather than a one-click converter.
Define and enforce metadata. Decide what metadata matters: date, source, author, topic, access level. Enforce it consistently. Without metadata, your AI system can't filter by recency or restrict access by role.
Version and maintain. A knowledge base is an ongoing commitment, not a pre-launch checkbox. Documents change. Policies are updated. Products are discontinued. A RAG system built on stale content will give answers that were accurate six months ago and are wrong today.
What is 'AI-ready data' and how do I know if my organisation has it?
AI-ready data is content that can be reliably extracted, chunked, and retrieved in a way that supports accurate, traceable answers. Most organisations sit somewhere in the middle of this spectrum, and a useful diagnostic is asking four questions.
Can you extract clean, structured text from your most important documents without manual intervention? If the answer involves significant human review of extraction quality, the data is not ready.
Do you know which version of each document is current? If multiple versions of the same policy document are in circulation, the AI will have no way to determine which one to trust.
Do your documents have consistent metadata? If documents lack dates, authors, or topic tags, retrieval won't be able to filter by relevance.
Is there a process for updating the knowledge base when documents change? Without one, you're building a system that will degrade over time with no visible signal that it's happening.
The uncomfortable truth about model selection
Every month brings a new model benchmark. Every month, someone argues that the new frontier model makes the previous one obsolete. These conversations dominate a disproportionate amount of strategic attention in organisations that are new to AI.
Swapping one frontier model for another, or moving from one embedding model to another, is unlikely to produce the improvement teams are hoping for if the underlying data problem hasn't been addressed. A better model will synthesise bad answers slightly more fluently. It will still synthesise bad answers.
The Informatica survey puts it plainly: the real supercharger for AI is data management. The model is just the last step in a pipeline that depends entirely on the quality of what flows through it.
Fix the data, and most models will perform well. Ignore the data, and no model will save you.



