2016-09-24

This blog post is focused at developers and software architects. I’m probably not writing at the right place. You’re on an infrastructure experts blog and the author is an Oracle DBA. So what can you learn from someone working on that 30 years old technology talking about that old SQL language ? You run with modern languages, powerful frameworks, multi-layer architecture, micro-services, distributed database and of course all open-source. You hate your DBA because he is the major slow-down for your agile development. You don’t want SQL. You don’t want databases. You don’t want DBA.

How can I encourage you to read this blog post? I was not always an DBA. I started as a developer, more than 20 years ago. And believe me, it was not prehistory at all. Object-Oriented design, Rapid Application Development, Automatic programming (remember C.A.S.E.?), visual programming (have you ever seen an IDE like IBM Visual Age?), query generation (early days of Business-Objects). All these evolved with more and more languages, frameworks, layers, micro-services, XML, SOA, JSON, REST,… but only one technology remained: the critial persistent data is still in a relational database and accessed by SQL.

What is a database

Most of developers think that a database is there to store and retrieve data. I’m sorry but that’s wrong. That may have been right a long time ago, with key-value storage and hierarchical databases, but that’s too old for me. When I started to work, databases were already doing far more than that. Let me explain. With those prehistoric databases, you retrieved data in the same way you stored it. You insert with a key, you fetch with that key. It is easy to explain to modern developers because they “invented” it few years ago, calling it CRUD (Create Read Update Delete). First argument of those CRUD methods is a key value. And you store unformatted data as XML or JSON associated to that value. If this is the only feature that you need, then for sure you don’t want a database.

Relational database management systems (RDBMS) are doing a lot more than that. First, you can query data in a completely different way than you inserted it. And this is the real life-cycle of data. For example, You take orders, one by one, with customer and product information for each of them. Of course you can update and read it with the order ID that has been generated, but that’s only a small use case and probably not the most critical. Warehouse users will query (and update) orders by product. Delivery users will query (and update) orders by customer. Marketing users will query by channels and many other dimensions. Finance will join with accounting. With a persistence only system, you have to code all that. Of course if you declared the mapping of associations, you can navigate through them. But the user requirement is to get a set of orders, or a quantity of products in stock, or a subset of customers, which is different from navigating through orders one by one. With a database, you dont need to code anything. With a proper data model what you inserted can be manipulated without its key value. All data that you have inserted can be accessed from any different point of view. And you don’t have to code anything for that. Imagine a Data Access Object with ‘QueryBy methods covering any combination of columns and operators.

A database system does not only store data, it processes data and provide a service to manipulate data.

SQL

SQL is not a language to code how to get the information. SQL only describes what you want. It’s a question you ask to a data service. Same idea as Uber where you enter your destination and desired car and the service manages everything for you: the path, the communication, the paiement, the security. You may not like the SQL syntax, but it can be generated. I’m not talking about generating CRUD statements here, but generating SQL syntax from a SQL semantic expressed in Java or example. There’s a very good example for that: jOOQ (look at the exemples there).

I understand that you can hate SQL for it’s syntax. SQL was build for pre-compilers, not for execution time parsing of text, and I’ll come back on that later with static SQL. But you can’t say that SQL semantic is not modern. It’s a 4th generation language that saves all the procedural coding you have to do with 3rd generation languages. SQL is a declarative language build on a mathematics theory. It goes far beyond the for() loops and if()else.

In SQL you describe the result that you want. How to retrieve the data is done by the database system. The optimizer builds the procedural code (know as the execution plan) and the execution engine takes care of everything (concurrency, maintaining redundant structures for performance, caching, multithreading, monitoring, debugging, etc). Do you really want to code all that or do you prefer to rely on a data service that does everything for you?

You know why developers don’t like SQL? Because SQL has not been designed for programmers. It was for users. The goal was that a non-programmer can ask its question to the system (such as “give me the country of the top customers having bought a specific product in last 3 months”) without the need of a developer. There was no GUI at that time, only Command Line Interface, and SQL was the User Friendly Interface to the database. Today we have GUIs and we don’t need SQL. But it is there so programmers build tools or framework to generate SQL from a programming language. Sure it is ridiculous and it would be better to have a programming language that directly calls the SQL semantic without generating plain old English text. We need a Structured Query Language (SQL) we just don’t need it to be in English.

Set vs loops

So why do people prefer to code everything in procedural language (3GL)? Because this is only what they learned. If at school you learned only loops and comparisons, then you are going to access data in loops. If you learned to think about data as sets, then you don’t need loops. Unfortunately, the set concepts are teached in mathematics classes but not in IT.

Imagine you have to print “Hello World” 5 times. Which pseudo-code so you prefer?

print("Hello World\n")

print("Hello World\n")

print("Hello World\n")

print("Hello World\n")

print("Hello World\n")

or

print ( "Hello World\n" + "Hello World\n" + "Hello World\n" + "Hello World\n" + "Hello World\n" )

I’ve put that in pseudo-code. I don’t want to play with String and StringBuffer here. But the idea is only to explain that if you have to process a set of things it is more efficient to process them as a set rather than one-by-one. That works for everything. And this is where databases rocks: they process sets of rows. If you have to increment the column N by one in every row of your table T, you don’t need to start a loop and increment the column row-by-row. Just ask your RDBMS data service to do it: ‘/* PLEASE */ UPDATE T set N=N+1′. The “please” is in comment because everything that is not there to describe the result is not part of SQL. You have to use hints to force the way to do it, but they are written as comments because SQL do not allow any way to tell how to do it. This was a joke of course, the “please” is not mandatory because we are talking to a machine.

ACID

I’m not sure you get all the magic that is behind:

UPDATE T set N=N+1;

it’s not a simple loop as:

for each row in T

set N=N+1

The RDBMS does more than that. Imagine that there is a unique index on the column N. How many lines of code do you need to do that N=N+1 row by row and be sure that at any point you don’t have duplicates? Imagine that after updating half of the rows you encounter someone else currently updating the same row. You have to wait for his commit. But then, if he updated the value of N, do you increment the past value or the new one? You can’t increment the old one or his modification will be lost. But if you increment the new one, your previous incremented rows are inconsistent because they were based on a previous state of data.

I was talking about having an index. You have to maintain this index as well. You have to be sure that what is in cache is consistent with what is in disk. That modifications made in the cache will not be lost in case of server failure. And if you run in a cluster, those caches must be synchronized.

Coding the same as this “UPDATE T set N=N+1″ in a procedural language is not easy and can become very complex in a multi-user environment.

Maybe you have all the tools you need to generate that code. But if you code it you have to test it. Are your tests covering all concurrency cases (sessions updating or reading same rows, or different rows from same table,…). What is already coded within the database has already been tested. It’s a service and you just have to use it.

Static SQL

I claimed above that SQL is there to be pre-compiled. Yes, SQL is witten in plain text, like most of programming languages, and must be parsed, optimized, compiled. It’s not only for performance. The main reason is that you prefer to get errors at compile time than at runtime. If you put SQL in text strings in your code it will remain text until execution time when it will be prepared. And only then you will get errors. The second reason is that when the SQL is parsed, it is easy to find the dependencies. Want to see all SQL statements touching to a specific column? Do you prefer to do guess on some text search or to methodically follow dependencies?

Yes, SQL is there to be static and not dynamic. That claim may look strange for an Oracle DBA because all statements are dynamic in Oracle. Even at the time of precompilers (such as Pro*C) the statements were parsed but were put as text in the binary. And at first execution, they are parsed again and optimized. If you want the execution plan to be defined at deployment time, you have to use Outlines or SQL Plan Baselines. There is no direct way to bind the execution plan at deployment time in Oracle. In my opinion the static SQL as it is known on DB2 for example is really missing in Oracle. OLTP Software Vendors would love to ship the optimized execution plans with their application. Imagine that all SQL statements in an OLTP application are parsed and optimized, compiled as bound procedures, similar to stored procedures, with procedural access (the execution plan) and you just have to call them. For reporting, DSS, BI you need the plans to adapt to the values and volume of data, but for OLTP you need stability. And from the application, you just call those static SQL like a data service.

Talking about procedural execution stored in the database, I’m coming to stored procedures and PL/SQL of course.

Stored Procedures

When you code in your 3GL language, do you have functions that update global variables (BASIC is the first language I learned and this was the idea) or do you define classes which encapsulate the function and the data definition? The revolution of Object Oriented concepts was to put data and logic at the same place. It’s better for code maintainability with direct dependency procedural code and data structures. It’s better for security because data is accessible only through provided methods. And it’s better for performance because procedural code access data at the same place.

Yes Object Oriented design rocks and this why you need to put business logic in the database. Putting the data on one component and running the code on another component of an information system is the worst you can do. Exactly as if in your Object Oriented application you store the object attributes on one node and run the methods on another one. And this is exactly what you do with the business logic outside of the database. Your DAO objects do not hold the data. The database does. Your objects can hold only a copy of the data, but the master copy where are managed concurrency management, high availability and persistance is in the database.

We will talk about the language later, this is only about the fact that the procedural code run in the same machine and the same processes than the data access.

There are a lot of myths about running business logic in the database. Most of them come from ignorance. Until last Monday I believed that one argument against running business logic in the database was unbeatable: You pay Oracle licences on the number of CPU, so you don’t want to use the database CPUs to run something that can run on a free server. I agreed with that without testing it, and this is where myths come from.

But Toon Koppelaars has tested it and he proved that you use more database CPU when you put the business logic outside of the database. I hope his presentation from Oak Table World 2016 will be available soon. He proved that by analyzing exactly what is running in the database, using linux perf and flame graphs: https://twitter.com/ChrisAntognini/status/778273744242352128

All those rountrips from remote compute server, all those row-by-row processing coming from that design have an higher footprint on the database CPUs that directly running the same on the database server.

PL/SQL

Running business logic on the database server can be done with any language. You can create stored procedures in Java. You can code external procedures in C. But those languages have not been designed for data manipulation. It is sufficient to call SQL statements but not when you need procedural data access. PL/SQL is a language made for data processing. It’s not only for stored procedure. But it’s the only language that is coupled with your data structure. As I said above, it’s better to think in sets with SQL. But it may be sometimes complex. With PL/SQL you have a procedural language which is intermediate between row-by-row and sets because it has some bulk processing capabilities.

In pseudo-code the Hello World above is something like that:

forall s in ["Hello World\n","Hello World\n","Hello World\n","Hello World\n","Hello World\n"] print(s)

It looks like a loop but it is not. The whole array is passed to the print() function and loop is done at lower level.

In PL/SQL you can also use pipeline functions where rows are processed with a procedural language but as a data stream (like SQL does) rather than loops and calls.

I’ll go to other advantages of PL/SQL stored procedures but here again there is one reason frequently raised to refuse PL/SQL. You can find more developers in .Net or Java than in PL/SQL. And because they are rare, they are more expensive. But there is a counter argument I heard this week at Oracle Open World (but I don’t remember who raised that point unfortunately). PL/SQL is easy to learn. Really easy. You have begin – exception – end blocks, you declare all variables, you can be modular with procedures and inline procedures, you separate signature and body, you have very good IDE, excellent debugger and easy profiler,… and you can run it on Oracle XE which is free. So, if you have a good Java developer he can write efficient PL/SQL in a few days. By good developer, I mean someone who understands multi-user concurrency problems, bulk processing, multi-threading, etc.

There are less PL/SQL developers than Java developers because you don’t use PL/SQL. It’s not the opposite. If you use PL/SQL you will find developers and there are many software vendors that code their application in PL/SQL. Of course PL/SQL is not free (except in Oracle XE) but it runs on all platforms and on all editions.

Continuous Integration and Deployment, dependency and versioning

I come back quickly to the advantages of using a language that is coupled with your data.

PL/SQL stored procedures are compiled and all dependencies are stored. With one query on DBA_DEPENDENCIES you can know which tables your procedure is using and which procedures use a specific table. If you change the data model, the procedures that have to be changed are immediately invalidated. I don’t know any other language that does that. You don’t want to break the continuous integration build every time you change something in a table structure? Then go to PL/SQL.

Let’s go beyond continuous integration. How do you manage database changes in continuous deployment? Do you know that with PL/SQL you can modify your data model online, with your application running and without breaking anything? I said above that procedures impacted by the change are invalidated and the must be adapted to be able to be compiled. But this is only for the new version. You can deploy a new version of those procedures while the previous version is running. You can test this new version and only when everything is ok you switch the application to the new version. The feature is called Edition Based Redefinition (EBR) it exists since 11g in all Oracle editions. It’s not known and used enough, but all people I know that use it are very happy with it.

In development environment and continuous integration, it is common to say that the database always cause problem. Yes it is true but it’s not inherent to the database but the data. Data is shared and durable and this is what makes it complex. The code can be deployed in different places, and can be re-deployed if lost. Data can be updated at only one place and visible to all users. Upgrading to a new version of application is easy: you stop the old version and start the new version. For data it is different: you cannot start from scratch and you must keep and upgrade the previous data.

Object-Relational impedance

I’m probably going too far in this blog post but the fact that data is shared and durable is the main reason why we cannot apply same concepts to data objects (business objects) and presentation objects (GUI components). Application objects are transient. When you restart the application, you create other objects. The identity of those objects is an address in memory: it’s different on other systems and it’s different once application is restarted. Business objects are different. When you manipulate a business entity, it must have the same identity for any users, and this identity do not change when application is restarted, not even when application is upgraded. All other points given as “object-relational impedance” are minor. But the sharing and durability of business object identity is the reason why you have to think differently.

Where to put business logic?

If you’re still there, you’ve probably understood that it makes sense to run the data logic in the database, with declarative SQL or procedural PL/SQL stored procedures, working in sets or at least in bulk, and with static SQL as much as possible, and versioned by EBR.

Where to put business logic then? Well, business logic is data logic for most of it. But you’ve always learned that business logic must be in the application tier. Rather than taking reasons given one by one and explain what’s wrong with them, let me tell you how came this idea of business logic outside of the database. The idea came from my generation: the awesome idea of client/server.

At first, data was processed on the servers and only the presentation layer was on the user side (for example ISAM was very similar with what we do with thin web pages). And this worked very well, but it was only green text terminals. Then came PCs and Windows 3.11 and we wanted graphical applications. So we built applications on our PCs. But that was so easy that we implemented all business logic there. Not because it’s a better architecture, but because anyone can build his application without asking to the sysops. This was heaven for developers and a nightmare for operations to deploy those applications on all the enterprise PCs.

But this is where offloading business logic started. Application written with nice IDEs (I did this with Borland Paradox and Delphi) connecting directly to the database with SQL. Because application was de-correlated from the database everything was possible. We even wanted to have applications agnostic of the database, running in any RDBMS. Using standard SQL and standard ODBC. Even better: full flexibility for the developer by using only one table with Entity-Value-Attribute.

Actually, the worst design anti-patterns have been invented at that time and we still see them in current applications – totally unscalable.

When finally the deployment of those client/server applications became a nightmare, and because internet was coming with http, html, java, etc. we went to 3-tier design. Unfortunately, the business logic remained offloaded in the application server instead of being part again of the database server.

I mentioned ODBC and it was another contributor to that confusion. ODBC looks like a logical separation of the application layer and the database layer. But that’s wrong. ODBC is not a protocol. ODBC is an API. ODBC do not offer a service: it is a driver running on both layers and that magically communicates through network: code to process data on one side and data begin on the other.

A real data service encapsulates many SQL statements and some procedural code. And it is exactly the purpose of stored procedures. This is how all data applications were designed before that client/server orgy and this is how they should be designed today when we focus on centralization and as micro-services applications.

So what?

This blog post is already too long. It comes from 20 years experience as developer, application DBA, and operation DBA. I decided to write this when coming back from the Oracle Open World where several people are still advocating for the right design, especially Toon Koppelaars about Thick Database at Oak Table World and the amazing panel about “Thinking clearly about application architecture” with Toon Koppelaars, Bryn Llewellyn, Gerald Venzl, Cary Millsap, Connor McDonald

The dream of every software architect should be to attend that panel w/ @ToonKoppelaars @BrynLite @GeraldVenzl @CaryMillsap @connor_mc_d pic.twitter.com/npLzpnktMK

— Franck Pachot (@FranckPachot) September 22, 2016

Beside the marketing stuff, I was really impressed by the technical content around the Oracle Database this year at OOW16.

Cet article Modern software architecture – what is a database? est apparu en premier sur Blog dbi services.

Show more