Thomas George, Chief Data Officer, Global Investment Banking, Societe Generale Global Solution Centre India. Thomas has around 20 years of experience in Data Quality Management Initiative, having helmed multiple leadership roles within technology functions. In this blog post, he discusses more about data lineage through a simple example.
Regulators under the supervision of the Basel Committee on Banking Supervision (BCBS) have been encouraging global banks to publish the legitimacy of data transformations to prevent financial fraud. While it’s important to follow the regulators’ guidelines, it’s equally important for us as a financial institution to turn the table to be trailblazers, to manage the data flow for the organisation’s needs and benefits. One such topic is data lineage.
Let’s understand more about data lineage through a simple example.
Think of a water distribution system – it has a centralized location from where the water points originates for each household. The map gives all the details of main water origin point, transfer locations, secondary storage points and water pumps, etc.
If someone wants a new connection, the map helps in giving a clear indication of a water point with required pressure. Now think of a scenario, if there are multiple origination points, storages and water pumps, it becomes complicated and the need for a map becomes even more critical.
This applies to the numerous data attributes that’s generated in an organisation. The ‘Data Lineage’ kind of works like a map to show where the data comes from (source), where is it flowing to (consumption points and storage) and what happens to it along the way (transformation).
There are two ways data lineages is mapped in any organisation:
- Business data lineage (also called as functional lineage): Journey of data without fine-grained technical details (high level)
- Technical lineage: Journey of data with fine grained technical details of application, table/schema details and transformation
Another term that gets confused with ‘Data Lineage’ is ‘Data Provenance’, which is used in the context of business data lineage as well as for identifying the origin along with the process that affect the creation of data at source. This will be another discussion which we will keep it for later.
For now, let’s understand about ‘Technical Data Lineage’.
Many organisations especially banks have moved from technical lineage to business data lineage, as it’ easier to do and can be done using expert judgement. Technical lineage becomes complicated because of duplicate, legacy, proprietary, disparate systems and among others.
Technical lineage is still in nascent stage, and we are yet to see a successful implementation of this topic in any industry. Some solution providers who claim to have done successful implementations are Talend, Informatica Metadata Manager, MANTA, to name a few.
Some key benefits of technical lineage include:
- Data consumers collect huge data but rarely use it in process or insights. Why waste compute and storage for keeping this data – such cases can be identified with technical data lineage
- Each data consumer reaches out to sources, not knowing they have the data in data warehouse, causing duplication
- When moving to cloud data lake, mindful use and transfer of data is key as it directly affects the bottom line. Just like when moving houses you discard things not being used for six months. Same should apply to data migration to cloud. Get rid of data attributes that not used in an organisation for long and the technical lineage can be used to decide which data to dump and migrate. The unused data element can be kept in the data lake as cold data
- Lineage is an important element for data virtualization (capability to manipulate data without having to transfer the data or know where the data resides). For this to be efficient and effective, while set up, there is a need to know what data resides where
- Root cause analysis and impact analysis remain incomplete. This is like the fixing of leaks in plumbing. We always think the source has been identified, but rarely is the case. If we are able to trace the flow from source to the consuming point, then we will have it covered. Surprisingly this holds good for data as well
Approach to data lineage
Steps involved in doing the technical lineage:
- Identify the source – System, schema and format
- Identify transfer – where all the data is transferred to and possibilities for illegitimate manipulation of data
- Identify transformation – the process it goes through and any enrichment that gets done
- Identify consumption points – data users
Ways of doing Technical data lineage:
- Manual lineage is where experts sit together and find out what data goes where and what transformation is being done. With almost 2000 systems this becomes a herculean task with quite a lot of capex investment
- Pattern based uses AI/ML solutions to look at data and arrive at the suspects and exact logic, which then needs manual intervention to just check and confirm. Less capex investment with relatively good accuracy
- Tagging, not so great, involves tagging of each attribute which means addition of a column for each attribute. Lot of coding and optimization needed to avoid performance impact
- Parsing is relatively good provided we find a solution that can read through the different programming languages used in the organisation. Even though the best approach, my quest for such a solution has not yielded any good results
With these above points, I would say that there is a need for technical data lineage. The implementation should be a combination of pattern based and manual solutions which would go above and beyond what regulators need and will give us control over data and hence its value. We are running helter-skelter to manage data, and this should change.
Above all, as go-green is the call out today for a better tomorrow, with data lineage you build your karma with gratitude for future generations to work on CRM, REG, CPLE, etc.