SQL, BI, and Information Management

Wednesday, January 19, 2011

Code Canvas improves knowledge transfer in codes and models

Software developers, programmers, and architects share the same habit of diagramming on white-boards. Lots of actual codes are developed based on the ideas implied in these diagrams or graphs. In fact, these visual elements become an essential part when the code or system is going to be made available to other people, users, new developers, customers, etc. Traditionally, software development involves the modelling step before and around the coding in order to capture the meta-knowledge. How about some efforts to make these steps easier and faster?

The Code Canvas prototype in MS research lab shows an interesting step toward an automation of creating code maps which eases the understanding of complex code structures and class diagrams. As shown in Kael Rowan's blog entry (here), the Code Canvas seems to be a future add-on to Visual Studio. As mentioned in an article at ACM COMMUNICATION ("Software Development with Code Maps," Vol. 53, No. 08, 2010), having a decent code map makes it much faster for newcomers to adapt into the project and complete their tasks.

Friday, December 31, 2010

The dilemma of consistency in technology architecture

As one of the key elements of enterprise architecture, the "platform" or "technology" architecture is about what kind of software and hardware tools and systems to use across the whole organization.

One can imagine that there are ways of maintaining a strategy of having a "buying list" of these systems for an enterprise. Historically or even nowadays, large enterprises tend to put a lot of efforts on maintaining the "open system" concept. Here the "open system" concept means that an enterprise use software systems that accepts and enables interoperability, portability and open software standards. But in reality, not all software vendors give full-support to open standards and using systems from various technology and vendor background means extra cost of ownership to the IT managers. Thus, the following two phenomenas show the general dilemma of most enterprises towards technology choices.

Due to business and economical requirements, different departments are becoming more and more vendor dependent in spite of the enterprise-wide strategy of supporting open system concept.
Although most departments are using standard products, no matter if they come from a single vendor a or limit group of them, different local units still choose to use packaged solutions (often called appliances) in situations where they often find it hard (or expensive) to use the standard tools.

So should an architect put all her/his effort to ensure the consistency of the usage of standard tools?

Although consistency is definitely a personal virtue, it is absolutely not expandable to IT architecture. Instead of wasting all the effort to keeping the consistency, it is more useful to focus on maintaining the architectural strategy in a timely manner and getting ready to accept new ideas and changes from time to time.

In terms of integration of different systems, it is always important to consider a set of protocols in the infrastructure when adding new items in the buying list.

Middlewares and gateways that enable the integration;
Communication protocols such as web services
Information brokers, such as those that transform the data-types, char-sets (ASCII to EBCDC), or XML-based transformation
BPM tools that cope with processes at various frequencies;
Event and alert management tools;
Message-based systems, i.e., solutions that can keep messages for various systems.
Application-oriented adapters that support integration of applications with other existing solutions.

Friday, October 22, 2010

OO, star-schema, and anchor modelling

Apparently there are many other data modelling methodologies besides relational modelling. In the data warehouse and BI world, the key word "multi-dimensional model" has been overwhelming for more than 2 decades. Kimball's theory on creating dimensional models has been well adopted into the industry. What we have found out, based on many people's (hard) experiences, is that the multi-dimensional data models are quite fit for data analysis purposes and it fits into the analytical mind of most business world. But, it is not a proper model for maintaining a large data warehouse where multiple data sources are ETL-ed into a single-version-of-truth.

The object-oriented modelling has been invented into the database world more than 10 years ago. As it is now, I can only see that Oracle has adopted some parts of this methodology into its commercial product. Maybe OO is just not the right way for managing the data, at least in the OLTP/OLAP world.

In the data warehouse modelling world, one of the key challenges to every data warehouse, is how to keep the history of data. Different data has different profiles. Some changes often, some needs to have a traceable history of the changes, some never changes, some only needs the most current change. To model the different enterprise data and keep the histories in a good manner, the concept of anchor modelling has just been discussed in the last ER conferences. It will be valuable to take a read into this topic and get to know more details about the method.

Friday, September 24, 2010

Relational model and E-R diagram

Well, anyone can pick up the definition of relational model by Googling and reading the Wikipedia page on this. What I want to point out, based on my experience, is that relational model is a clean way to sort out concepts around data in an enterprise situation. There have been other types of models in the database theory, such as hierarchy and network models. Relational models fit the need of OLTP system design quite perfectly and have been the dominant modelling method for more than 40 years. Right now, there are object oriented models, and other variations of relational models exist in the industry.

One of the key challenges to the relational modelling, is that the management of history inside a model. In the relational theory, the 1NF (Normal Form), 2NF, to BCNF,4NF, and 5NF are even not enough to keep historical data in a clean way. People have introduced 6NF to manage history data in relational theory.

The key concepts in relational modelling are, for example, entity, entityset, relationship, relationshipset, one-to-many, many-to-many, one-to-one relationships, etc. So, how does people in the enterprise world understand a data model? Diagrams, yes, the diagrams. Most people could not understand the modelling tools such as IDA, RSM, WBM, until they see the diagrams shown in the tool. Diagrams are the key output in a modelling session.

So, improve your "diagramming" skill if you plan to be a modeller :)

Friday, September 10, 2010

How data modelling has been in the enterprises

Well, it all depends...

For an enterprise where modelling is considered as an important step towards the maturity of IT development, process, functionality, and data modelling (and others like user experience) are an important part of developers' life.

What's interesting to see is that data modelling has not been considered as so important compared to functionality or process parts in most organizations. It is not difficult to understand this. The E-R modelling discipline has served most transnational system designs and other tricks, such as multidimensional modelling has been well used in most situations before. And I believe many S-M or close to L size organizations still have no problem of only doing some basic E/R modelling and hire good DBAs to take care of the rest for the next decades.

The difference is that some, I cannot find a better word than some, organizations do have good business in the past and have acquired different lines of business and owned different systems for many years. When they do the integration of IT solutions (they will end-up doing this most of the time), it has been extremely difficult to integrate on the process, or functionality. or any other levels. Data model is the only possible easy way to let the integration succeed. So, it is just the recent years that many large or X large organizations started to realize the importance of controlling their data, meaning the data modelling, data quality, master data, and meta-data.

Another angle is to look at the vendors of data modelling tools and data models. There have been quite a lot of data modelling tools (we mentioned this in a previous note) in the market but there are only a few leaders in the market, most of whom also provide large database engines. And vendors of data models are extremely limited to giant software vendors. So this market up-to-now has been very limited.

This phenomenon looks interesting and even funny to me. Most of people talk about information explosion in the new IT era. But it is also these people who choose not to understand their data, which is then translated to information for some purposes. :)

Friday, August 27, 2010

Useful resources about data modelling

I believe I would definitely try the Wikipedia: http://en.wikipedia.org/wiki/Data_modeling which has a lot of good links and http://www.databaseanswers.org which is also very useful.

To get a further and more detailed understanding about data modelling, there are plenty of tutorials, books, and presentations about data modelling. Among these, I found the following quite useful.

Len Silverston (and his team)'s work "the data model resource book" is definitely a "must-have" for all data modellers. The three series of the book is a "make-a-living" tool for many data modellers in the industry.
The classical book "Data Modelling Essentials" by Graeme Simsion and Graham Witt is another book that I recommend everybody to keep in your book shelf.

If you are more at the business side and would like to know a bit more about data modelling, I would recommend Steve Hoberman's book "Data Modelling Made Simple."

As we've been talking about BI and data warehousing in this blog, the typical theory of multidimensional modelling and star-schema and snow-flake is well introduced in Kimball's book series. Here I would recommend the dimensional modelling book in his series.

Besides these books for industry users, I think there existing many database theory books that use 1 or 2 chapters to talk about E-R modelling, normalization, and dimensional modelling. It is fair enough for university students. And I would actually recommend new-comers to the data modelling world to start your readings from these classical theories. (I've introduced two books in a previous posting.)
More intellectual work exists as products or services from various companies rather than being described in books or blog notes. The IBM industry data model has been well made for several industries. Similarly, Teradata has its industry data model sold together with its platform. Even SAS has its own data model to support its well-known BI platform. What one needs is to find out an employer who would like to hire you and train you to get the knowledge of these industry products. Right?

Friday, August 20, 2010

Data modelling tools and vendors

It is important to point out that a data modelling tool (I mean, a decent and correct tool) should support at least logical and physical modelling. A tool that one can use to create database schema or database design is DEFINITELY not a data modelling tool, but a database development tool.

I found this link very useful: http://www.databaseanswers.org/modelling_tools.htm
And this link actually is quite updated on the most recent modelling tools.

For enterprise users, Erwin, IDA (previously RDA), MS Visio, Sybase Power Designer and Data Architect, and Oracle tools are the most relevant from the list. I believe SAS also has its tools to support certain level of data modelling.

There is no need to compare these tools. But there are few things to be in mind if you are inspecting data modelling tools.

First, a data modelling tool can be able to support different type of models, such as conceptual model, logical model, and physical model. The typical database theory tells that one should start from conceptual modelling and move down to logical models and then to physical data models.

Second, UML modelling or diagramming is not data modelling. Creating UML elements and show some diagrams (Visio is a good example here) are not modelling but creating sketches. Data modelling requires one to create entities, relationships, attributes, keys, etc. (I listed them in a previous note) and the right data modelling tool should keep the metadata in the data models. Creating diagrams is a small part of data modelling. When the data model gets complicated, it is impossible to show the details in the diagram. That's the time when you need to use the modelling tool to keep the design details.

Third, most of these data modelling tools only provide functionality and do not have any model content. This is absolutely OK. Data modelling is a process of designing intellectual properties. There have been a few vendors of data model content, such as IBM industry model, Teradata model, SAS data model, etc. For small and medium size enterprises, I believe it is quite OK to just take Silverstone's book (the data model resource book, VOL I, II, and III) and copy some part of the content into the modelling tool.

Fourth, I believe that vendors are quite important for data modelling tools in large enterprises. One thing is that when your developers at the company work on a model and find out something wrong in the modelling tool, a good vendor will be able to provide sufficient consultancy support in time. The second thing is that mange large enterprises tend to tailor the modelling tool to have its own "scar" on the assets created by the tool. This requires a close relationship between vendors and users.

So, this is about data modelling tools and vendors. I will write something about useful books, references and resources on data modelling in the coming post....

Sunday, August 15, 2010

Why do we do logical data modelling?

In the domain of my work, there has been a long-going debate on the purpose of logical data modelling. There are people who care more about the end-result of a database design and choose to focus on the database part and call it "physical data modelling." There are also people who care about both logical and physical data models but are in doubt whether these two are the same or different.

Well, by opening the few books that I left behind my shelf and check the Wikipedia site, I think I do have some ways to answer these questions.

First, what is logical data model? I guess the Wikipedia definition below is fair enough for people to understand.

"A logical data model (LDM) in systems engineering is a representation of an organization's data, organized in terms entities and relationships and is independent of any particular data management technology."

So, the principle is that no specific database technology should be involved in the logical data modelling part. If we look at the typical database theories, such as the one you can get from these books , here are the key areas that we look at during the logical modelling phase.

Entities
Attributes
Relationships: Binary/tertiary/n-ary
Roles
Participation
Keys, super keys, candiate keys, primary key
Weak Entity Types
ternary relationship
Multi-valued Attributes
Lossless-Join Decomposition
Functional Dependency
Normal Forms
Boyce-Codd Normal Form (BCNF)
3rd Normal Form (3NF)
Update anomalies

It is important to note that part of the E-R modelling is initiated in the conceptual modelling phase. The logical data modelling activity starts by inspecting the E-R model and decides on how the entities and relationships are further arranged into tables.

The term 'Logical Data Model' is sometimes used as a synonym of 'Domain Model' or as an alternative to the domain model. While the two concepts are closely related, and have overlapping goals, a domain model is more focused on capturing the concepts in the problem domain rather than the structure of the data associated with that domain.

Second, why do we need logical data modelling? I think the Wikipedia has a few good points. For example, "Helps common understanding of business data elements and requirements" and " Facilitates avoidance of data redundancy and thus prevent data & business transaction inconsistency."

It is quite apparent that the logical data modelling provides a foundation for designing the database schema. However, many people choose to ignore this fact by creating the database design direct. In fact, the "logical data model" is already inside these people's mind when they are creating the database tables. Otherwise, how can one say that attribute A and attribute B should be in the same table? How can one determine the unique key of a table?

Another vital step in logical data modelling, is the decisions that help to reuse and share data. Most people only understand this point after the database has been used for several years (and when new requirements arrive to the database design). It is hard to say that this is a shame. But a lot of enterprises do need to understand the importance of data by losing billions to re-create solutions every time the database cannot afford the changes.

Third, there has always been debates and discussions about the boundary of logical data modelling and the physical database design. When one comes to generic data modelling where an individual or an organization is generally considered as "Involved Party," the decisions about roll-up/down in the class hierarchy can be either a logical data model decision or a physical database design decision. It is hard to distinguish who should make the final decision in most enterprises. However, a better way to solve this situation is to involved both logical data modellers and the DBAs in the discussion and reach a design where DBAs agree with the logical data modelers. To make this point clear, this is a logical data model decision as long as no specific database technology is involved. The DBAs of a specific database technology can be consulted (so that she/he feels involved and engaged) to find out if a re-work has to be done when the logical design is applied to physical database design.