Friday 11 August 2017

Schemaspy6 a data model extraction tool


schemaspy.png




SchemaSpy is a database schema and data model extraction software. A single java jar schemaspy-6.0.0-rc1.jar which can generate beautiful static html files with graphics and data when pointed to database schemata.

When you download the jar file for schemaspy from http://schemaspy.org all you have to do is run a one line command like below to get a website full of information about your database schema! No kidding. This is how the generated static html website looks like: http://schemaspy.org/sample/index.html

# run on my ubuntu desktop to extract a data model for an oracle schema

$ java -jar /home/kubilay/schemaspy/schemaspy-6.0.0-rc1.jar -t orathin -db your_sid -s your_schema -u your_schema -p your_password -o /home/kubilay/schema/tst/your_output_dir -dp /home/kubilay/drivers/ojdbc7.jar -host your_host -port 1521 -noviews -loglevel finest

As a result you will get your /home/kubilay/schema/tst/your_output_dir full of static html files and images the jar file will generate using Grapvhiz. Click on index.html and start browsing your database schema. Check out the relationships tab where it has a very nice drawing of your data model. The information schemaspy generates for a database schema is really very rich and so easy to get. A database schema documentation for your database, ready to publish and share.

Well done to John Currier the first developer of this tool now also available on Github. Reading on schemaspy.org, I see that JetBrains recently showed some interest in the tool and is helping with its further development. I just can't praise enough this tool.

Some prerequisites for the tool to run are:


Enjoy your data model findings!





Thursday 10 August 2017

To schema or not



This is my first blog post in my new blog and I would like to start with a fundamental topic. To schema or not? Coming from a database background I can’t ignore the schema-on-write and schema-on-read debate often posted in light of new developments in the field. The debate which asks if you should know your schema before you store your data or not?

Some definitions first -source Google search
Schema-on-write
The data must all be cleansed, transformed and made to fit in that structure before it can be stored in a process generally referred to as ETL (Extract Transform Load). That is why it is called “schema on write” because the data structure is already defined when the data is written and stored. 
Schema-on-read
Schema on read refers to an innovative data analysis strategy in new data-handling tools like Hadoop and other more involved database technologies. In schema on read, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in.
They sound like two opposites each suggesting the choice has to be one. Which approach would you choose? Would you build a schema first and save your data or just save and try to build the schema as you read the data later on. A safe answer here would be to say something like: ‘one size does not fit all’ or ‘depends on the use case’. You can’t choose only one and argue is better than the other one. Each has it’s own use case. Right?

Before you make a choice let’s look a bit closer to the phenomena of schemata.

Schemata on-write are data models which help us understand and communicate the meaning of data, ahead of collection. They help us understand what is it that we are looking for to find in data. With schema-on-write data is defined, shared and agreed upon in the enterprise before the data is collected. By looking at the abstractions of data models of schemata we can quickly know what we have and what we don’t. The schema abstractions, when established ahead of data entry, enable us to cross reference other schemata of different domains and discover even more correlations and links to the data that we intend to collect. The metadata the schema-on-write schemata provide us a path of analysis and classification of data prior to any querying of data. This is a very useful knowledge to have when building data extraction engines and integration interfaces. With schemata on-write we can easily tell if our data is reference data, master data, metadata or granular transactional data ahead of any analysis. With schema-on-write, all this knowledge is available to us before we start analysing our data. Schema-on-write is quick to produce data analysis results, is governed, coherent and has a shared understanding across the enterprise. It is a good thing to have, but it comes at a price, is rigid and inflexible.

The schema-on-read option is postponing the fundamental questions to be answered, for later. Schemata on-read are a collect first and analyse later approach. A method which gives infinite flexibility, where you don’t have to collectively agree or decide ahead of time on what you can collect and define as data. Is the approach which lets you put everything in it’s original format, structured and unstructured, as a data domain - in a data lake. You don’t have to understand your data, just store it and look at and try to understand it later. This approach gives you tremendous ability to write fast, vast amounts of data, obtained from any kind of data source. As the schema is not defined and understood and agreed upon across the enterprise like in schema-on-write, schema-on-read is an ungoverned data anarchy. Schema-on-read is an approach where the understanding of the data is postponed to read-time. The understanding of the meaning of data is provided by the data analyst only during data read time. A drawback of this approach can be that it takes time to establish coherence in the enterprise on to what is the meaning of useful and actionable data. Schema-on-read also lacks the ability of providing you metadata ahead of analysis as it will take time to extract and analyse and catalog the vast amounts of data.

To help you choose what is best, next is a list of what is bad about each of these approaches, so that depending on your use case, you can choose the least bad for your data.

Schema-on-writeSchema-on-read
Impossible to store something which is not defined in the schema Not knowing your schema ahead may lead you to collect trash data. Because you can always think about the schema later. This is a collect everything approach.
Unstructured data has to be understood first and get structured. Of course, what good is just binary if not analysed. You don’t have to understand it just store it approach, store everything, structured or unstructured, it doesn't matter.
The predefined data models, tables and columns in the schema inhibit creativity and flexibility. Very subjective depends on the analysis and understanding of data from the analyst point of view. One can make invalid assumptions on to what is data. No cohere enterprise wide meaning on what is data.
Difficult to introduce new structures to the schema, takes time. Time to deliver from the data lake is slow.
Data is governed, coherent and well understood and agreed upon. No data governance, no headaches with the data format and data structure