Schematographer

Sunday, 17 September 2017

The Wonders Of Data Modelling!

If you need to create a database, you need a data model first. Don't even think of looking at an existing database, without studying the data model first. The data model is the map of a database. It is the first user manual you have to look at before trying to understand the data in the database. The data model will tell you a lot about what you will find in the database, for example the tables, columns and other entities. No matter what type of database, relational or other types, you will need a data model to start with. This post is about the data models you find in relational databases like Oracle, MS SQL Server, DB2, MySQL and many others.

A data model is a diagram, a schematic representation of an understanding of the data requirements in a process. If you are in the business domain, your data models will describe the data structures and their relations with each other, in your business processes. For example customers and orders. Data models are the manuscripts- the plans of a database. Therefore, the data models are created before the databases and not the other way around. You have to grasp the requirement and the problem first before attempting to solve it with a data model and then a database. To create a data model you will need to have a problem to solve first, hence you will need a problem statement.

The creation of a data model doesn't happen all at once- it is created in stages. The process of its creation usually comes in three stages. After you have read the requirement in the problem statement, you first create a conceptual data model. Then you evolve it to a logical data model, which is sometimes called a relational schema. Lastly, the final stage is the physical data model, a model ready to be implemented in SQL and in relational databases.

Now, let's walk through on how to create a data model for a business problem statement, say in a commercial business which sells products to customers.

Problem statement

Customers buy products of the business, via orders. All orders must be recorded in the database indicating which product was bought by which customer.

Here are the possible data models which convey understanding of the above problem statement:

From left to right, you can see the evolution of the data model as it solves the problem statement. The conceptual model is very 'high level'. It barely records the entities of Customers, Orders and Products involved in the problem statement. The lines indicate some sort of relationships between the entities. The second stage, the logical model, puts the unique identifiers of future records in these entities- the primary keys. The lines indicate relationships a bit clearer and you can see the primary keys of Customers and Products posted in Orders as foreign keys to construct the relationship. Chen's notation is used to show the participation of the entities in the relationships here. For example, a product can be ordered several times. Finally, the physical data model is the one which is ready to be scripted and put in the database. It has physical attributes and their data types, such as customer name and product name of data type VARCHAR.

Below are some more techie comments on the three stages of data modelling:

What is conceptual data model?

It is the data model which records the data requirements of the business processes. It is technology independent, captures initial requirements in a business process. Say you are a business selling products to customers as seen above. The conceptual model, at a very high level, describes objects or entities which will store data to help you accomplish your business process. Conceptual data models are straightforward simple box-and-line diagrams.

What is logical data model?

This model is a bit more specific. The logical data model is still technology independent. The logical model tries to identify entities, their relationships with other entities and their primary keys and foreign keys. The logical data model is a bit more involved with the specifics of the problem statement and a single conceptual data model can require multiple logical data models.

What is a physical data model?

It is the data model for transforming the logical data model into physical data model. It is the data model which organises the data into tables and user accounts and security. The physical data model describes all entities and the data types of their specific attributes and columns. The physical data model can also have technology specific elements such as special data types and other constraints used only by a certain technology or even vendor. The physical model is implementation ready. Finalised to the greatest detail and ready to be picked up for implementation by DBAs or database developers.

The data models are living schemata. There is no final data model. They evolve and they extend with time. Data models are great instruments which help us understand the meaning of data and databases.

Friday, 11 August 2017

Schemaspy6 a data model extraction tool

SchemaSpy is a database schema and data model extraction software. A single java jar schemaspy-6.0.0-rc1.jar which can generate beautiful static html files with graphics and data when pointed to database schemata.

When you download the jar file for schemaspy from http://schemaspy.org all you have to do is run a one line command like below to get a website full of information about your database schema! No kidding. This is how the generated static html website looks like: http://schemaspy.org/sample/index.html

# run on my ubuntu desktop to extract a data model for an oracle schema

$ java -jar /home/kubilay/schemaspy/schemaspy-6.0.0-rc1.jar -t orathin -db your_sid -s your_schema -u your_schema -p your_password -o /home/kubilay/schema/tst/your_output_dir -dp /home/kubilay/drivers/ojdbc7.jar -host your_host -port 1521 -noviews -loglevel finest

As a result you will get your /home/kubilay/schema/tst/your_output_dir full of static html files and images the jar file will generate using Grapvhiz. Click on index.html and start browsing your database schema. Check out the relationships tab where it has a very nice drawing of your data model. The information schemaspy generates for a database schema is really very rich and so easy to get. A database schema documentation for your database, ready to publish and share.

Well done to John Currier the first developer of this tool now also available on Github. Reading on schemaspy.org, I see that JetBrains recently showed some interest in the tool and is helping with its further development. I just can't praise enough this tool.

Some prerequisites for the tool to run are:

java 7 or 8 installed https://java.com/en/download/
You must have graphviz installed http://www.graphviz.org/ (I installed graphviz 2.38.0-1-saucy amd64.deb)
jdbc drivers (ojcdbc7.jar) installed http://www.oracle.com/technetwork/database/features/jdbc/jdbc-drivers-12c-download-1958347.html (I installed Oracle thin drivers)

Enjoy your data model findings!

Thursday, 10 August 2017

To schema or not

This is my first blog post in my new blog and I would like to start with a fundamental topic. To schema or not? Coming from a database background I can’t ignore the schema-on-write and schema-on-read debate often posted in light of new developments in the field. The debate which asks if you should know your schema before you store your data or not?

Some definitions first -source Google search

Schema-on-write
The data must all be cleansed, transformed and made to fit in that structure before it can be stored in a process generally referred to as ETL (Extract Transform Load). That is why it is called “schema on write” because the data structure is already defined when the data is written and stored.

Schema-on-read
Schema on read refers to an innovative data analysis strategy in new data-handling tools like Hadoop and other more involved database technologies. In schema on read, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in.

They sound like two opposites each suggesting the choice has to be one. Which approach would you choose? Would you build a schema first and save your data or just save and try to build the schema as you read the data later on. A safe answer here would be to say something like: ‘one size does not fit all’ or ‘depends on the use case’. You can’t choose only one and argue is better than the other one. Each has it’s own use case. Right?

Before you make a choice let’s look a bit closer to the phenomena of schemata.

Schemata on-write are data models which help us understand and communicate the meaning of data, ahead of collection. They help us understand what is it that we are looking for to find in data. With schema-on-write data is defined, shared and agreed upon in the enterprise before the data is collected. By looking at the abstractions of data models of schemata we can quickly know what we have and what we don’t. The schema abstractions, when established ahead of data entry, enable us to cross reference other schemata of different domains and discover even more correlations and links to the data that we intend to collect. The metadata the schema-on-write schemata provide us a path of analysis and classification of data prior to any querying of data. This is a very useful knowledge to have when building data extraction engines and integration interfaces. With schemata on-write we can easily tell if our data is reference data, master data, metadata or granular transactional data ahead of any analysis. With schema-on-write, all this knowledge is available to us before we start analysing our data. Schema-on-write is quick to produce data analysis results, is governed, coherent and has a shared understanding across the enterprise. It is a good thing to have, but it comes at a price, is rigid and inflexible.

The schema-on-read option is postponing the fundamental questions to be answered, for later. Schemata on-read are a collect first and analyse later approach. A method which gives infinite flexibility, where you don’t have to collectively agree or decide ahead of time on what you can collect and define as data. Is the approach which lets you put everything in it’s original format, structured and unstructured, as a data domain - in a data lake. You don’t have to understand your data, just store it and look at and try to understand it later. This approach gives you tremendous ability to write fast, vast amounts of data, obtained from any kind of data source. As the schema is not defined and understood and agreed upon across the enterprise like in schema-on-write, schema-on-read is an ungoverned data anarchy. Schema-on-read is an approach where the understanding of the data is postponed to read-time. The understanding of the meaning of data is provided by the data analyst only during data read time. A drawback of this approach can be that it takes time to establish coherence in the enterprise on to what is the meaning of useful and actionable data. Schema-on-read also lacks the ability of providing you metadata ahead of analysis as it will take time to extract and analyse and catalog the vast amounts of data.

To help you choose what is best, next is a list of what is bad about each of these approaches, so that depending on your use case, you can choose the least bad for your data.

Schema-on-write	Schema-on-read
Impossible to store something which is not defined in the schema	Not knowing your schema ahead may lead you to collect trash data. Because you can always think about the schema later. This is a collect everything approach.
Unstructured data has to be understood first and get structured. Of course, what good is just binary if not analysed.	You don’t have to understand it just store it approach, store everything, structured or unstructured, it doesn't matter.
The predefined data models, tables and columns in the schema inhibit creativity and flexibility.	Very subjective depends on the analysis and understanding of data from the analyst point of view. One can make invalid assumptions on to what is data. No cohere enterprise wide meaning on what is data.
Difficult to introduce new structures to the schema, takes time.	Time to deliver from the data lake is slow.
Data is governed, coherent and well understood and agreed upon.	No data governance, no headaches with the data format and data structure