Data catalogs are becoming increasingly important as the amount of information held by an organization grows. Metadata about all data assets from various data sources can be stored and shared in a data catalog. Data storage locations, data owners, data types, and other descriptive and statistical information can all be found in the metadata. Any analyst, developer, or data scientist looking for readily usable data assets should start their search in the data catalog. The foundations of a productive data catalog are automation and cooperation. By adding business metadata to technical metadata, automation ensures that users have a more complete picture of an asset. Since not all metadata can be automatically enriched, it is also important to provide a straightforward interface where users can enrich some metadata on their own, discuss it, and contribute to the collective body of knowledge.
Suggested Business Terms
When a user adds something new to the data catalog (like a table from a relational database), they can improve the metadata by using a company-wide business glossary to label the entity. Depending on the company’s internal procedures, the meaning of a given product code may vary. Regular expressions and comparison to reference data sets make it simple to automate the detection of business terms. However, this strategy lacks adaptability. The regular expression is static and always applies to the same set of data (as described by the same metadata), regardless of whether or not the user has interacted with it. Without the need for explicitly predefined business term assignment logic, machine learning-based business term tagging can actively learn from user input and suggest terms based on previous user actions. Business term assignment suggestions are inferred by the system based on the degree of similarity between items in the data catalog.
Machine learning allows the data catalog to adapt to new users and suggest relevant tags.
Finding New Connections
Recognition of similar items in a data catalog is just the beginning of what machine learning can do. In most organizations, numerous systems house data that is logically related. However, few mechanisms exist for formally recording such interconnections between disparate data sets. It can be challenging to get a clear picture of which data is related when there are numerous undocumented ETL jobs chained together. Envision a team of analysts beginning a project with access to a predetermined dataset. Employee turnover has likely resulted in a dearth of context for the data, making it difficult to interpret some of the codes present in the data (e.g., their data set contains a code value that might refer to its description, but they have no idea where to find it). The data catalog can determine if the value of the code is a reference to another data set in the company through a process called “relationship discovery.” This will provide the analytics team with more context for understanding the data and aid in the production of reliable reports.
The data catalog uses similarity detection to find different kinds of relationships besides foreign keys. A machine learning-enabled data catalog will be able to identify relations between tables in different environments and duplicate data caused by multiple overlapping extracts, both of which are important in understanding the overall data lineage and structure of systems.
Machine learning for MDM optimization
Any typical master data management use case also presents a great opportunity for the application of machine learning. Mastering is a crucial part of any reliable MDM solution. Acquiring data from various sources, matching records representing distinct instances of the same entity (e.g., same person, same company, same product), i.e. bucketing, and then merging these buckets into a single, superior record (i.e., the golden record) constitutes mastery. The guidelines for bucketing and making a golden record are typically set by hand. Due to regulatory requirements, especially when dealing with personal data, it is common for the rules to need to be deterministic and completely transparent (i.e. a white box). In addition to matching, users must have the option to manually separate duplicate records. All these requirements mean that the rules usually need to be configured up front and designed carefully.
Analyzing the Efficiency of Rules
In practice, managing large data sets is a repetitive procedure. The process begins with consolidating and standardizing data from various sources. The system is run on an empty dataset after the initial rules have been prepared. After examining the outcomes, rules are modified and the procedure is repeated. Many false positives or negatives could be produced by the output. The engine may, for instance, improperly match records and fail to match others that should be matched. If the rule is reliable enough, matching can be done automatically; otherwise, it can be aided by generating matching proposals that must be resolved by hand. Businesses can save a lot of money by switching to automated processes. Manual splits are more costly from a business perspective, so conservative matching rules are the norm. In addition, you can set up explicit overrides that prevent the merging of specific records.
Data stewards manually label the justifications for their corrections, resolve matching proposals, and address data quality issues. Feedback from users can be gleaned from these various interactions. Due to the dynamic nature of data, the initial rules may no longer be applicable after some time in production. It’s possible that a matching process will suddenly benefit greatly from receiving filled values for an attribute that has previously been empty. It would be great if the system could recommend changes to the rules, such as the addition of a new rule, the removal of an old rule, or a modification to an existing rule. There could be duplication of effort or the rules’ execution order could compromise efficiency. Once this is done, the system should be able to aid in making the process as efficient as possible.
Anomaly Detection and Machine Learning
Large-scale data processing jobs, crucial to most data governance architectures, are routinely scheduled and triggered by a wide range of events. These tasks typically involve the transfer and transformation of data, and they frequently occur as interconnected series of tasks. Maintaining constant vigilance over the jobs’ progress is crucial. One reason this matters is that it ensures semantic correctness. There are tasks that can still be completed even if an unexpected event occurs. Imagine a job that must process a full day’s worth of transactional data from a regional office of a multinational corporation. There are typically a lot fewer transactions at this branch on Wednesday than on Friday, and none at all on the weekends. In a given week, the job performed on Friday produces fewer results than anticipated, while the job performed on Saturday produces more results. Even if the task succeeds, its peculiar behavior is surprising. When it comes to predicting reasonable time series (i.e., time series with a trend and seasonality), machine learning excels. In most cases, business processes exhibit predictable enough behavior to warrant the use of machine learning models for outlier detection.
Ataccama ONE employs AI to identify anomalies in incoming data loads, such as sudden spikes or drops in data volume, statistical outliers, or unexpected shifts in other data characteristics. The user’s reaction to reported anomalies is used to refine the solution over time.
Maintaining High-Quality Data
The evaluation of a set of manually crafted data quality rules at predetermined intervals is another example of a use case for anomaly detection. Success rates over time are generated by these rules, giving us a time series for each rule. In many cases, we can set a fixed percentage threshold per rule to trigger an alert to the relevant data stewards if the result falls below that level. A system that learns what should and shouldn’t be considered an anomaly from user interaction would be useful, however, because the behavior of the results might not be that simple. It goes without saying that it is significantly more challenging to comprehend and foresee data discrepancies by observing factors other than quantitative variations in data feeds. Since there may be more complex relations between the various data sources, it is important to learn, identify, and explain the common causes of sudden changes in data quality in this context. To do so, the solution needs to generate hypotheses for users to test and either confirm or reject based on their input.
With the help of AI and machine learning, the Ataccama ONE Data Quality Management module can automatically monitor data quality and make configuration recommendations based on historical trends, allowing businesses to make better decisions in less time.
Continuously Tracking More Data
The same is true for master data management use cases: anomaly detection can be put to good use. Delta batches of records that are processed on a regular basis have their aggregate numerical characteristics (such as the number of records or the distribution of attribute values) closely monitored for the detection of anomalies. In the event that previously predictable behaviors of these features suddenly change, the data steward will be alerted to take action, and the system will be taught from this instance interactively to improve its accuracy.
Advantages of Data Management with Autonomous Driving Systems
It is not enough to simply have a large amount of data at your disposal; you also need to be able to quickly and easily access the specific data you need in order to gain actionable insights. Because of this, modern businesses require a data catalog driven by artificial intelligence. Machine learning is becoming increasingly important not only in the organization and maintenance of your data, but also in the management of records. Your data will be correct, easily searchable, and organized thanks to the use of machine learning in data management. Is your company curious about the benefits of our Self-Driving Data Management & Governance platform? Request a demonstration by contacting us.
I’m a passionate individual with a diverse range of interests encompassing reading, writing, blogging, technology, and games. 📚✍️🌐🎮 With a penchant for exploration and a constant hunger for knowledge, I immerse myself in these fields to continuously expand my horizons and hone my skills.