- Data is of no use if it does not come with a promise of quality
- Data needs both technical and semantic cleanup
- Semantic cleanup needs domain expertise
There is a trend among big industrial manufacturers to consolidate the data into data lakes. And that is a good move. This will enable organizations to harness the data to drive many of the business insights and has the potential to significantly impact the top and bottom line of an organization in a positive way.
Data lakes are different
Data lakes are different from data warehouses as data lakes allow storing data in raw form and having varied formats. One of the purposes of the initiative is to have the ability to run various analytics, data science/AI algorithms on a broader set of data to get better insights. The data lake is usually low-cost to assemble compared to data warehouses as the data can be dumped into the data lake with no or very limited processing. However, that also creates a problem!
Technology is the easiest part to handle in data lake initiatives now as multiple cloud providers give the capability. We have all sorts of data storage capabilities from structured to unstructured data. There is a data storage option for a given format of data.
Anyone doing even any rudimentary data science knows and understands the importance of data quality. Is data clean and deduplicated? Are the data points unified? Without that it is at best “Garbage in Garbage out” and this wisdom is not new.
With any initiative towards data lake, it is equally important to put a data governance strategy in place which should continuously ensure that the data is of high quality and remains high quality. It is not a one-off exercise but a continuous effort that has to keep going. Think of it like a filtration plant. And even if data passes through the filtration plant and remains in storage for a long time it again needs to be passed through the filtration engine to ensure that the data quality remains intact.
“Data lake success needs both technical and semantic data quality, period”
Data quality also should not be dealt with from a myopic view of just data cleaning which is limited to filling missing values, doing enrichment, and deduplication. I would term them as technical clean-up. These are important but equally important for the data cleaning engine is to have a notion of semantic cleanliness in place.
Semantic cleanup needs domain knowledge and purpose-built data quality engines. The engine should understand the domain and the relationship between the objects to be effective. For example in the Industrial OEM world, equipment and parts are two important categories of objects and the right classification in one or another is important for many analyses to be meaningful.
In the real world, we would expect OEMs to have curated catalogs in place which can be fed into the system but real-world warriors know that this is hardly the case. There are though gladiators in the system who with their tribal knowledge know how to fit things together. Institutionalization of that knowledge is very important and for the same reason solutions or platforms that can handle the semantic notions as first-class concepts become important for any analysis to be effective.
The rules/insights can then be captured into automated DS algorithms to make them scalable. Automation is successful only when it is built with both technical and semantic inputs.
Evaluate your data lake initiatives and make sure data governance and quality in both technical and semantic dimensions exist as core elements.
Contact us today if you have any questions or would like to set up a conversation with our data team.