Establish a data strategy today
For a long time, we’ve been saying that data is very important. Some even claim that “data is the new oil”. With the rise of AI, this statement is not an exaggeration. If we work with AI, we should familiarize ourselves with current popular dataset repositories like Hugging Face and Kaggle. The datasets here may be useful for us even if they are not our own data, depending on our use case. However, my point is that every company should have a process in place to generate and centralize their own high-quality data.
High-quality data is data that is relevant, accurate, consistent with our needs, and stored in a usable format. AI systems perform significantly better when they are trained or evaluated using high-quality data. The most common reason for being unable to build or improve an AI system is often simple: we don’t have the right data. In many cases, we could have it, but we haven’t been storing it.
If frontier models (OpenAI GPT, Gemini) are already trained on the entire Internet, why do we still need high-quality data?
Frontier models might not be cost-effective for certain tasks. Some problems can be solved more efficiently with smaller, local models. For these models to perform accurately, high-quality data is essential. This data can serve as a knowledge base for a RAG (Retrieval-Augmented Generation) system or be used to fine-tune the models.
We may also want to tailor responses to specific business contexts, so the system operates with our own data and delivers maximum value. For example, a chatbot that answers questions about our company or products depends on high-quality data to function effectively.
AI systems are often criticized for being non-deterministic and for producing hallucinations. One way to reduce the risk of these behaviors is through systematic testing. This involves scientifically comparing their outputs with the expected correct answers, and to make such comparisons possible, we need high-quality data.
Although it is not absolutely necessary, data becomes much more valuable when it is labeled. Labeled data means we have pairs of (datapoint, classification). Examples in the cybersecurity space include:
(jira_issue, security_problem): pairs consisting of the summary and description of each Jira issue, along with the description of the related security problem, if any.
(log, indicator_of_incident): pairs of logs from different devices and systems, together with indicators of incidents or attacks identified in those logs.
(source_code, security_vulnerability): pairs of code snippets containing security vulnerabilities.
(security_question, security_answer): pairs of questions from internal or external customers about security topics, along with their respective answers.
Creating these datasets from scratch when implementing a use case is costly in terms of time and money. Therefore, it is important to establish processes to collect this data as early as possible, so it is available when needed or in the near future.
If you plan to produce and work with labeled data, something I strongly recommend, you can use open source tools like Label Studio. This software provides various functionalities for managing labeled data, including features that facilitate the labeling of unlabeled datasets.
To obtain the maximum value from AI, every company should have a data strategy. This strategy should define what data to collect, how to curate it, and how to store it. I’m not a data specialist, but it’s clear that a lack of high-quality data can limit the value a company derives from AI, so organizations should start gathering this data if they haven’t already.

