SQL, BI, and Information Management: August 2010

Friday, August 27, 2010

Useful resources about data modelling

I believe I would definitely try the Wikipedia: http://en.wikipedia.org/wiki/Data_modeling which has a lot of good links and http://www.databaseanswers.org which is also very useful.

To get a further and more detailed understanding about data modelling, there are plenty of tutorials, books, and presentations about data modelling. Among these, I found the following quite useful.

Len Silverston (and his team)'s work "the data model resource book" is definitely a "must-have" for all data modellers. The three series of the book is a "make-a-living" tool for many data modellers in the industry.
The classical book "Data Modelling Essentials" by Graeme Simsion and Graham Witt is another book that I recommend everybody to keep in your book shelf.

If you are more at the business side and would like to know a bit more about data modelling, I would recommend Steve Hoberman's book "Data Modelling Made Simple."

As we've been talking about BI and data warehousing in this blog, the typical theory of multidimensional modelling and star-schema and snow-flake is well introduced in Kimball's book series. Here I would recommend the dimensional modelling book in his series.

Besides these books for industry users, I think there existing many database theory books that use 1 or 2 chapters to talk about E-R modelling, normalization, and dimensional modelling. It is fair enough for university students. And I would actually recommend new-comers to the data modelling world to start your readings from these classical theories. (I've introduced two books in a previous posting.)
More intellectual work exists as products or services from various companies rather than being described in books or blog notes. The IBM industry data model has been well made for several industries. Similarly, Teradata has its industry data model sold together with its platform. Even SAS has its own data model to support its well-known BI platform. What one needs is to find out an employer who would like to hire you and train you to get the knowledge of these industry products. Right?

Friday, August 20, 2010

Data modelling tools and vendors

It is important to point out that a data modelling tool (I mean, a decent and correct tool) should support at least logical and physical modelling. A tool that one can use to create database schema or database design is DEFINITELY not a data modelling tool, but a database development tool.

I found this link very useful: http://www.databaseanswers.org/modelling_tools.htm
And this link actually is quite updated on the most recent modelling tools.

For enterprise users, Erwin, IDA (previously RDA), MS Visio, Sybase Power Designer and Data Architect, and Oracle tools are the most relevant from the list. I believe SAS also has its tools to support certain level of data modelling.

There is no need to compare these tools. But there are few things to be in mind if you are inspecting data modelling tools.

First, a data modelling tool can be able to support different type of models, such as conceptual model, logical model, and physical model. The typical database theory tells that one should start from conceptual modelling and move down to logical models and then to physical data models.

Second, UML modelling or diagramming is not data modelling. Creating UML elements and show some diagrams (Visio is a good example here) are not modelling but creating sketches. Data modelling requires one to create entities, relationships, attributes, keys, etc. (I listed them in a previous note) and the right data modelling tool should keep the metadata in the data models. Creating diagrams is a small part of data modelling. When the data model gets complicated, it is impossible to show the details in the diagram. That's the time when you need to use the modelling tool to keep the design details.

Third, most of these data modelling tools only provide functionality and do not have any model content. This is absolutely OK. Data modelling is a process of designing intellectual properties. There have been a few vendors of data model content, such as IBM industry model, Teradata model, SAS data model, etc. For small and medium size enterprises, I believe it is quite OK to just take Silverstone's book (the data model resource book, VOL I, II, and III) and copy some part of the content into the modelling tool.

Fourth, I believe that vendors are quite important for data modelling tools in large enterprises. One thing is that when your developers at the company work on a model and find out something wrong in the modelling tool, a good vendor will be able to provide sufficient consultancy support in time. The second thing is that mange large enterprises tend to tailor the modelling tool to have its own "scar" on the assets created by the tool. This requires a close relationship between vendors and users.

So, this is about data modelling tools and vendors. I will write something about useful books, references and resources on data modelling in the coming post....

Sunday, August 15, 2010

Why do we do logical data modelling?

In the domain of my work, there has been a long-going debate on the purpose of logical data modelling. There are people who care more about the end-result of a database design and choose to focus on the database part and call it "physical data modelling." There are also people who care about both logical and physical data models but are in doubt whether these two are the same or different.

Well, by opening the few books that I left behind my shelf and check the Wikipedia site, I think I do have some ways to answer these questions.

First, what is logical data model? I guess the Wikipedia definition below is fair enough for people to understand.

"A logical data model (LDM) in systems engineering is a representation of an organization's data, organized in terms entities and relationships and is independent of any particular data management technology."

So, the principle is that no specific database technology should be involved in the logical data modelling part. If we look at the typical database theories, such as the one you can get from these books , here are the key areas that we look at during the logical modelling phase.

Entities
Attributes
Relationships: Binary/tertiary/n-ary
Roles
Participation
Keys, super keys, candiate keys, primary key
Weak Entity Types
ternary relationship
Multi-valued Attributes
Lossless-Join Decomposition
Functional Dependency
Normal Forms
Boyce-Codd Normal Form (BCNF)
3rd Normal Form (3NF)
Update anomalies

It is important to note that part of the E-R modelling is initiated in the conceptual modelling phase. The logical data modelling activity starts by inspecting the E-R model and decides on how the entities and relationships are further arranged into tables.

The term 'Logical Data Model' is sometimes used as a synonym of 'Domain Model' or as an alternative to the domain model. While the two concepts are closely related, and have overlapping goals, a domain model is more focused on capturing the concepts in the problem domain rather than the structure of the data associated with that domain.

Second, why do we need logical data modelling? I think the Wikipedia has a few good points. For example, "Helps common understanding of business data elements and requirements" and " Facilitates avoidance of data redundancy and thus prevent data & business transaction inconsistency."

It is quite apparent that the logical data modelling provides a foundation for designing the database schema. However, many people choose to ignore this fact by creating the database design direct. In fact, the "logical data model" is already inside these people's mind when they are creating the database tables. Otherwise, how can one say that attribute A and attribute B should be in the same table? How can one determine the unique key of a table?

Another vital step in logical data modelling, is the decisions that help to reuse and share data. Most people only understand this point after the database has been used for several years (and when new requirements arrive to the database design). It is hard to say that this is a shame. But a lot of enterprises do need to understand the importance of data by losing billions to re-create solutions every time the database cannot afford the changes.

Third, there has always been debates and discussions about the boundary of logical data modelling and the physical database design. When one comes to generic data modelling where an individual or an organization is generally considered as "Involved Party," the decisions about roll-up/down in the class hierarchy can be either a logical data model decision or a physical database design decision. It is hard to distinguish who should make the final decision in most enterprises. However, a better way to solve this situation is to involved both logical data modellers and the DBAs in the discussion and reach a design where DBAs agree with the logical data modelers. To make this point clear, this is a logical data model decision as long as no specific database technology is involved. The DBAs of a specific database technology can be consulted (so that she/he feels involved and engaged) to find out if a re-work has to be done when the logical design is applied to physical database design.

Wednesday, August 11, 2010

Good article on Eclipse Development Projects

I've found a very decent list of excellent Eclipse development projects from this link at eweek.com.
To make it easy to read, I've extracted part of the content here.

Eclipse Modeling Framework (EMF)
Eclipse is huge in the modeling community. EMF is the core framework and code generation facility that allows developers to create applications based on a structured data model.
Xtext
Xtext is a relatively new project but is quickly become very popular for creating domain specific languages. With Xtext you can easily create your own programming languages and domain-specific languages (DSLs). The framework supports the development of language infrastructures including compilers and interpreters as well as full blown Eclipse-based IDE integration.

Jetty
Jetty is an open-source project providing an HTTP server, HTTP client and javax.servlet container. Jetty is a very popular Web server and servlet container. It is often found embedded in applications such as Yahoo Hadoop Cluster, Google AppEngine and Zimbra. Jetty provides a Web server and javax.servlet container, plus support for Web Sockets, OSGi, JMX, JNDI, JASPI, AJP and many other integrations

CDT
The CDT Project provides a fully functional C and C++ Integrated Development Environment based on the Eclipse platform. CDT is now the defacto C/C++ IDE in the non-Microsoft world. Most embedded vendors and Linux distros use CDT as their C/C++ IDE.
PDT Eclipse PHP Development Tools
The PDT project provides a PHP Development Tools framework for the Eclipse platform. This project encompasses all development components necessary to develop PHP and facilitate extensibility. It leverages the existing Web Tools Platform (WTP) and Dynamic Languages Toolkit (DLTK) in providing developers with PHP capabilities. PDT has quickly become the one the more popular IDE in the Eclipse community.
Mylyn Framework
Mylyn is the task and application lifecycle management (ALM) framework for Eclipse. Over the last three years Mylyn has become the hub or integration point for many of the Agile ALM vendors. Mylyn has over 45 different connectors that make it possible to link different ALM tools to its unique task perspective.
BIRT—The Business Intelligence and Reporting Tools
BIRT is an open-source, Eclipse-based reporting system that integrates with your Java/J2EE application to produce compelling reports. BIRT provides core reporting features such as report layout, data access and scripting. BIRT has become a popular reporting solution for Java developers.

Web Tools/Java EE Tools/Eclipse Java Development Tools (JDT)
Eclipse continues to be the standard for Java developers. If you are creating Java applications chances are you are using some combination of the JDT and Web Tools or Java EE Tools project.

Eclipse Plug-in Development Environment (PDE)
The Plug-in Development Environment (PDE) provides tools to create, develop, test, debug, build and deploy Eclipse plug-ins, fragments, features, update sites and RCP products. PDE also provides comprehensive OSGi tooling, which makes it an ideal environment for component programming, not just Eclipse plug-in development.

eGit Version Control
The rest of this list highlights up and coming projects that have become popular with developers. One of them is eGit, which is an Eclipse Team provider for the Git version control system. Git is a distributed SCM, which means every developer has a full copy of all history of every revision of the code, making queries against the history very fast and versatile. The eGit project is implementing Eclipse tooling on top of the JGit Java implementation of Git. Git is becoming a very popular source code management system. eGit is a new Eclipse project to provide tight integration between Eclipse and Git.

Gemini
The Enterprise Modules Project—Gemini—is all about modular implementations of Java EE technology. It provides the ability for users to consume individual modules as needed, without requiring unnecessary additional runtime pieces. Gemini implements many of the OSGi Enterprise Specifications.

Memory Analyzer (MAT)
The Eclipse Memory Analyzer is a fast and feature-rich Java heap analyzer that helps developers find memory leaks and reduce memory consumption. Memory Analyzer is becoming a very popular tool with Java developers.

Connected Data Objects (CDO)
CDO is both a technology for distributed shared EMF models and a fast server-based object-relational (O/R) mapping solution. With CDO you can easily enhance your existing models in such a way that saving a resource transparently commits the applied changes to a relational database. CDO is a model repository for EMF models. It provides the scalability and transactional capabilities required to use EMF for large scale applications. CDO has a 3-tier architecture supporting EMF-based client applications, featuring a central model repository server and leveraging different types of pluggable data storage back-ends like relational databases, object databases and file systems.

Eclipse Device Software Development Platform (DSDP) Project
The Eclipse Device Software Development Platform (DSDP) Project is an open source collaborative software development project dedicated to providing an extensible, standards-based platform to address a broad range of needs in the device software development space using the Eclipse platform. DSDP is a top-level container project that includes several independent technology sub-projects focused on the embedded and mobile space. Sub-projects under the DSDP include Blinki, Device Debugging, Mobile Tools for Java, Native Application Builder, Real-Time Software Components (RTSC), Sequoyah, Target Management, and Tools for Mobile Linux.

JavaScript Development Tools
The JavaScript Development Tools provide plug-ins that implement an IDE supporting the development of JavaScript applications and JavaScript within web applications. It adds a JavaScript project type and perspective to the Eclipse Workbench as well as a number of views, editors, wizards, and builders.

Eclipse Marketplace
The Eclipse Marketplace, which offers the Eclipse community a convenient portal to help users find open source and commercial Eclipse-related offerings. The new Marketplace client makes it easier for users to download and install tools from Instantiations and others.

I think these books about Eclipse are very useful....

Sunday, August 8, 2010

Agile BI in my view (1)

TDWI has a few good discussion on Agile BI recently, such as the followings.

Reflection on an Agile BI program

An Imperative to Build, Not Buy, Agile BI

While most of these articles are focused on establishing processes, guidelines or toolsets to ensure Agile BI success, I have different points of views.

If one looks at the current HR setup in IT branches of large enterprises, the key to ensure agility in BI is that they do have qualified people to do the good work. This is, in most cases, done by contractors such as external consultants, outsourcing partners, or experts that are going to be head-hunted to a better-pay job in 1-2 years.

Even the right processes, tools and guidelines are available, the more important part is that the developers have the awareness, competency, and willingness to follow the agile development.

Awareness means that the developers (IT and business) knows about the agile process and knows about how to follow the process. In large organizations, sometime it takes more time to know how to do the process than execute the process.

Competency is always an issue for managers in large enterprises. Employees who are eager to learn and improve skills are normally looking for challenges most of the time. If the project is done, it is hard to keep this competency inside the organization. In order to do agile BI, it is very important that the developers have good understanding of the toolset. Otherwise, the first 3-5 sprints will be used just to train the developers. Do we still consider such training as "agile development?" That's one of the reasons that a lot of companies are using external consultants.

By the way, using a single toolset such as MS BI tools or SAS tools or Cognos tools seems to be much better than using different tools from different vendors. at the same time. There are two reasons. i. It is impossible to have your developers with knowledge of all these tools; ii. The communication between these tools has a potentially large cost.

Willingness is an interesting issue. Sometime people may not follow the process even if they know the process and they have the right competency. Think about using a team with a hybrid structure, employees who have been working with you for 20 years (who may know COBOL very well), external consultants from company A with the latest BI tool knowledge, external consultants from company B with knowledge of another BI toolset, and developers from outsourcing partners. It is hard to imagine that all the team members will work perfectly in such situation.

Another important issue to agile BI is the communications. I mean communications inside the sprint team and with the outside world.

By the way, I think these books are very useful when you need to learn about agile methods and BI/DW.