Sometimes you need to perform some heavy database operations. I don't know why very often programmers are afraid of using databases for that. They usually have some fancy ORM which performs all the operations, and the only way to change the data is to make some SELECT * from a table, create a bunch of unneeded objects, change one field, convert those changed objects into queries and send that to the database.
Have you ever thought about the cost of the roundtrip of data? The cost of getting all the data from database just to send changed data into the database? Why do that if there would be much faster way of achieving the same results?
Imagine that you have quite a heavy operation. Let's make something which normally databases cannot do, some more complicated operation. Many programmers just don't know that there is any other way than writing this in the application code. Let's change all the HTML entities into real characters.
The HTML entities are a way of writing many different characters in HTML. This way you can write for instance the Euro currency sign "€" in HTML even if you don't have it on your keyboard. You just have to write € or € instead. I don't have to, as when I use UTF-8 encoding and write this character directly, it should be showed normally. What's more I have this character on my keyboard.
I will convert the text stored in database changing all the htmlentities into real unicode characters. I will do it using three different methods.
The first will be a simple query run inside PostgreSQL
The second will be an external program which downloads the text column from database, changes it externally and loads into database.
The third method will be almost the same as the second, however it will download whole rows.
Generate Data
So, for this test I need to have some data. Let's write a simple data generator.
First, a simple function for returning a random number within the given range.
Now the function for generating random texts of random length filled with the HTML entities.
The table for storing the data is created with the following query:
Then I filled this table using a query generating 50k rows with random data:
Let's check the table size:
As the table is filled with random data, I need to have two tables with exactly the same data.
The Tests
SQL
Many programmers think that such operations are not normally available inside a database. However PostgreSQL has quite a nice feature, it can execute functions written in many different languages. For the purpose of this test I will use the language pl/perlu which allows me to use external libraries. I will also use HTML::Entities package for the conversion.
The function I wrote is quite simple:
The update of the data can be done using the following query:
Application
In order to have those tests comparable, I will write a simple perl script using exactly the same package for converting html entities.
The Worst Application
There is another terrible idea implemented by programmers too often. Why select only the column you want to change? Let's select all the rows and send them back to database.
This script will look like this (the important changes are in lines 17 and 23)
Results.
The query using pl/perlu function executed in 26 seconds.
The script changing data externally execuded in 2 minutes 10 seconds (5 times slower)
The worst script getting and resending whole rows finished in 4 minutes 35 seconds (10 times slower).
I used quite a small number of rows. There were just 50k rows (about 200MB). On production servers the numbers are much bigger.
Just imagine that the code you developed for changing data could run 10 times faster if you'd do this in the database.