Devember 2021 - Web Application Hosted data catalog using Hadoop database objects with a Java/script

My Data Governance project at work got shutdown, so I’ve decided to make it anyway. Ideally, a user of this web application would be able to catalog and index any number of any kind of datasets that they want.

Primary project will consist of 4 portions:

Web Interface
Data Ingest
Database
Python/Java based data science tool (Data Query tool). I will not be making this I will be borrowing one.

Minimum Product:

  • Accessible web-interface
  • Functional hadoop data catalog
  • Python data parsing to metadata defined hadoop catalog entries
  • User interface for importing data analysis tools

Stretch goals:

  • Alpha Invitation request form
  • Additional database adapters to allow for easy integration
  • Docker Container for self-hosting
  • Compatibility testing for as many tools as possible
  • Metadata defined data/metadata tool suggestion

Technologies I’m going to use:

  • HBase
  • Solr
  • Python3
  • Java/Script
  • Flask
  • Docker

Stretch Adapters

  • Access Controls
  • Data Domain Sorting AI
  • Microsoft Azure SQL Server
  • Microsoft SQL Server
  • Oracle SQL Server

Fortunately I already have all the tools available, so I’ve already started with provisioning my Linode. I anticipate to get this done in about 1 month.

Day 1.
I’ve started setting up the web server today, so I’ll finish that probably tomorrow. DNS is configured, but I haven’t enabled porting yet, and need to do records.

I have a website template to rename as login/commerce portal.

To do:

database!

java data science tool adapters

Successful day so far. Services now configured, waiting for website cutover from ngenix blog until website landing page complete. Pip added because debian thinks I really don’t need it… Webapp SFTP update sync settings complete.

ODIn

Still need to figure out https settings to require it.

New web interface added, some css reworks may be in order. Content to be done this weekend.

So far I have been working Python parsing scripts. Some are tested and mostly working and some are still to do.

Language List:

  • XML - complete
  • CSV - complete
  • HTML - in progress
  • JSON - not started
  • GIS - not started
  • … - not started

I’m sure there are probably other languages that are used to contain metadata, but I can hardly think anymore today. My next step will be to combine them into one script using xml as in intermediate language, and let all the functions operate fully parallel and asynchronous.

I really should have worked on this more during the holidays, but progress is progress.

Metadata Dictionary Development complete. It can pull column name information to provide additional context about the data contained in the column. For now this is an Excel spreadsheet using XLookup, but it is formatted so that it can work agnostically across different data warehouse environments. As a tool it can be used to add additional context and any other tagging or CMS (Content Management System) tool.

The worksheet consists of 2 tables:

  • The Extract and Load Table - In this case I am querying an Oracle Database so I extract SchemaName, Table_Name, Column_Name, DataType, Comments. Each record is then has Table_Name and Column_Name parsed into Terms. (6-10 Columns should work in most cases. Term01, Term02…)

  • Metadata Dictionary - This table contains 4 columns: Term, Context, Term: Context, Source

For each unique term a new record is added to the Metadata Dictionary.
This is where all of the magic happens.

In the context column, put whatever will give the user an idea of what the data in the column is, how it is used, where it came from, and anything else you can think of to make your data more useful. I often put format for things dates, surrogate keys, or because I want to use this to create new views, relational associations.

Some things to consider about this method of Metadata Management:

  1. It takes a bit of up front time to define each of your terms, especially if you have many columns with freeform text.
  2. If you choose to use a similar system in a large environment such as a your local electrical company, you will also want to enforce Naming Standards, but this topic branches into the advantages of Data Warehouses verse a Data Lake. Suffice to say, it will be far more time consuming to implement such a system in a Data Lake.