In a recent conference, AI pioneer Andrew Ng argues that AI development needs to focus more on data, less on models.
As optimizing AI models produces diminishing returns, producing “consistently high quality data” leads to much higher accuracy. Producing high quality data is not only the key to AI performance, it is the key to succeeding in digital transformation. Even if your data lake was filled, if the data is of low-quality and inconsistent, your data would never produce anything useful. As a result, the first thing every organization should do when working with data is to ensure data quality and consistency.
Factors impacting data quality for industrial data
This quality of data becomes even more evident when business work with industrial data. Industrial data is noisier and can be affected by the following factors:
- Inconsistent data schema. Industrial data has minimal standardization. Every vendor and operator can define their own schema. If you are a large manufacturer, you are likely dealing with hundreds of data schemas from your equipment and edge data sources.
- Changing data schema. Data schema can change over time. For example, an operator could change the data schema of a robot without realizing how it could break the downstream analytics.
- Unreliable data. Industrial data comes from physical data sources like sensors and on-prem equipment. These can malfunction and become mis-calibrated, causing data to become “bad data.” Unreliable network connectivity can also corrupt data packets.
- Sparse information. Physical data sources typically contain long periods of “normal” data followed by a few occurrences of “abnormal” data. For example, the temperature of an equipment when sampled at 5-second intervals, may be present normal data for many months before a sudden abnormal event occurs.
Ensuring data quality for industrial data
To ensure optimal data quality and consistency, data quality management should be built into the data pipeline and applied to the raw data. This ensures that consistently high-quality data is delivered to data warehouses and analytics. These steps include:
- Transformation. Data is transformed from their raw schema into a unified schema. This ensures that the diverse data schemas from equipment and edge data sources are homogenized, making it easy to analyze these data downstream using the same query. In addition, data should be contextualized by adding meta-data that gives meaning to the data.
- Validation. Data should always be validated against a set of validation rules, such as:
- Does the data packet include all required fields?
- Do all data types match?
- Are all data within range?
- Does data arrive at the expected data rate?
- Cleansing. Data cleansing algorithms can detect abnormal or mis-calibrated data that are very difficult to detect manually. However, data cleansing often requires domain expertise, so it is desirable to use a data cleansing solution that allows domain experts to customize algorithms based on the particular properties of the data.
- Lineage. Data lineage is the process of tracking the changes in the data schema over time. This would allow users to see when the data schema was change, what was changed, so they can understand the impact of the changes.
Every organization today should focus on producing consistently high quality data as the first priority. Once consistent, high-quality data is generated and made available, using this data effectively in analytics and AI becomes an achievable objective.