Blogs.oracle.com

Loading documents and other file data into the Oracle Database

2016-03-23

Introduction

With recent versions of Oracle, there is more and more emphasis on handling unstructured text, as well as the traditional fully structured data such as numbers, dates and short text fields.

Here I want to consider various methods of getting "documents" into the Oracle database.

None of these methods are completely trivial, and the choice of which to use will depend on various factors.

What are the characteristics of a "document"? In general, I'm referring to text which may be too long to fit into a PL/SQL VARCHAR2 variable. These variables are limited to 32767 bytes. The document most likely originates from a file on a file system, and it may be in a character set which differs from the database character set. In many cases, documents will have metadata (such as author and creation date) that we want to load at the same time.

We should distinguish between four types of data:

Metadata. This is short textual, numeric or date data which describes the main document. For example: document id, author, date created.

Semi-structured Data. This is textual strings in a specific format, such as JSON or XML. We may be able to make assertions about such data, such as "will always start with '<'", or "does not contain any newline characters".

Full Text. This is text of arbitrary length. It can and usually will contain all sorts of characters including newlines and quotes. It may consist of multibyte characters in UTF-8 or UCS16 (or other) character sets.

Binary Data. Arbitrary length data which is not character encoded. May contain any bytes or combinations of bytes including nulls and end-of-file markers.

Loading a mixture of these types of data can be problematic. If we want to load metadata and full text from a single file, then we need some method of determining where the full text starts and finishes. The easiest way to do that is to enclose the full text in some sort of quotes. But then what if the full text contains those quote characters? So it's usually easier to arrange for each full text document to exist in a separate file, referenced from the main loading method.

It would be impossible to load multiple binary data records from a single file, unless the binary data was first encoded using a scheme such as BASE64, since by its nature you can never specify a delimiter string that could not appear in the binary data.

For these examples, we're going to consider three use-cases:

A set of JSON documents. These documents reside in a single file, one per line.

A set of files, which contain one document per file. Each file/document has metadata associated with it in a "summary" file.

As above, but one of the files contains multibyte (UTF-8) data.

Client / Server considerations - where is your source data?

If your client and server are the same machine, then any of these methods should work. However, if you are connecting to your database from a remote client over SQL*Net, you might first decide whether you want to load files from the client (the machine from which you connect) or the server (the machine where the database resides).

Load Method

Data on Client

Data on Server

SQL*Plus

Yes

No

SQL*Loader

Yes

No

External Tables

No

Yes

BFILES

No

Yes

Method 1: Using SQL*Plus Scripts

Let's consider we have a table with a several short columns and a CLOB column:

If I just want to do some "quick and dirty" loading of this table, I can just create a SQL script file and run that in SQL*Plus. For example:

Simple and quick, and fairly easy to generate from source files using a language such as Perl or Python. But do this with any quantity of data and you'll rapidly hit limitations:

Any single quotes in the text need to be doubled up:

Any ampersand ("&") characters in the script will cause problems unless you've put SET DEFINE OFF somewhere in the script

An input line to SQL*Plus cannot be longer than 2499 characters. If you want to include more than this you'll need to concatenate strings:

Even then, that concatenation method will only allow you to assemble up to 4000 characters. To continue longer than that would require you to create temporary lobs, and use DBMS_LOB.APPEND for each chunk.

It's slower than most other methods which will be an issue if you're loading large quantities of data.

Let's take our usecase 1 - the JSON documents. We could cut-and-paste from the file into a hand-coded script, but it's probably better to do it using a text-processing language such as Perl. The attached file jsonToSql.pl replaces any single quotes with two single quotes, and breaks each line into chunks no longer than 2400 characters (allowing some space for syntactic overheads.

Assuming we have perl installed and available, we can call it from the command line (Windows or Unix) using

And we can then load the resulting file json.sql (after creating the necessary table - see above) using

SQL*Plus isn't very good for loading from separate files, so we'll leave scenarios two and three out for this method.

Multibyte Considerations

SQL*Plus isn't very good at dealing with multibyte / UTF-8 characters. To get useful correct (and viewable!) output you will probably want to use SQL Developer to run the scripts.

Method 2: Using SQL*Loader

SQL*Loader is a utility from the very early days of Oracle. It's reasonably simple to use, but the vast number of options can make it intimidating for the uninitiated.

SQL*Loader loads files from the client. It has no access to files on the server.

It is normal to provide SQL*Loader with two inputs: A data file containing the data to be loaded to the database, and a control file which specifies how the data file is laid out, and which columns the data should be written to (it's possible to include the data inside the control file, but we'll stick with separate files for this example).

Use case 1: JSON documents in a single file

We'll use the same table "my_table" as defned earlier.

This is fairly straightforward. We create a control file called jsonloader.ctl which consists of

Then we load it using the command:

The "log" and "bad" arguments aren't required, but can be most useful for debugging.

Note that we've limited the length of the jsontext field to 32000 characters, but you can increase that size if you need to.

Use case 2: Loading from separate files

A quick reminder about our second scenario. We have a table which we'll define as:

and into this table we want to load the files file1.txt and file2.txt

file1.txt:

file2.txt:

These files only have one line in each, but they could just as easily have multiple lines per file. One file is loaded into each database record.

If we want to use SQL*Loader to load CLOB or BLOB data from separate files, then the filenames for those files must be included in the data file. We use the "FILLER" syntax to specify that the filename isn't to be loaded into any column, and then the LOBFILE syntax to load the text into a column.

So here's our control file. We can see that it's similar to the previous control file but the "FIELDS" section is more complicated, to allow for multiple metadata fields, and the reference to the external file to be loaded.

loader.ctl:

"fname" is an arbitrary label, we could have used "fred" and it would have worked exactly the same. It just needs to be the same on the two lines where it is used.

Note that the control file includes the name of our data file, loader.txt. loader.txt contains the metadata associated with each file, and the filename.

loader.txt:
1,John Smith,2015-04-29,file1.txt
2,Pete Jones,2013-01-31,file2.txt

So we're loading two files, along with the metadata for each.

Now we can run SQL*Loader from the command line as follows:

(again, the "log" and "bad" arguments aren't required, but can be most useful for debugging).

Then we can use SQL*Plus or SQL Developer to examine our loaded table:

Common mistakes:

Do not be tempted to add spacing to format the loader.txt file nicely. If you use

Then not only will the author be loaded with a space in front of "John", but you'll see an obscure error:

Muti-byte considerations

So what if your file contains multi-byte characters? Perhaps some Chinese text mixed up with English?
You will simply need to specify the character set of the file in the LOBFILE directive. For example, I have a file encoded in UTF-8, so I modify the control file to read:

Loading Binary Files with SQL*Loader

The above method (without the CHARACTERSET addition) will work exactly the same if our column is a binary BLOB column rather than a character CLOB. There is no difference in the loading mechanism - all that changes is that the database does not attempt to do any kind of character set conversation on the data as it is loaded into the database.

Method 3: Using BFILES and PL/SQL

A BFILE is a non-standard datatype which provides read-only access to data held on an external file system.

NOTE: This method will only work with files on the server. BFILES have no access to files on the client machine.

It's possible to work directly with BFILEs, but it's perhaps more common to use them as a kind of temporary staging area to allow you to read data into database columns.

BFILEs require a DIRECTORY - which is an internal database pointer to a folder/directory on the file system. To create a directory you must have CREATE ANY DIRECTORY privilege - or you can have a DBA create the directory and grant you access to it. This provides a layer of protection to prevent users from reading or writing files on the database server.

There is a provided package DBMS_LOB which provides a number of utility functions including LOADFROMFILE.

On the face of it, this might seem all we need, but in fact there's some preparation work we need to do.

LOADFROMFILE will only load into an open clob. So we need to prepare a lob locator, open it, call LOADFROMFILE, then close it afterwards.

The following example shows a procedure which writes to our previously-created my_table.

Firstly it creates a BFILE from the supplied directory and filename, then it inserts the metadata into my_table, fetching back the lob locator for the newly created row. It then uses LOADFROMFILE to write into that lob:

To use this procedure, we might do:

Note that in this date we're using SYSDATE for the creation date, which varies slightly from previous examples. We could of course include create_date as another argument to our procedure.

Common Mistakes:

Remember that the file must be found on the database server. You cannot use this method to load a file which is located on a local machine if you are using SQL*Plus or SQL Developer remotely.

The directory name is an identifier, similar to a username, table name or column name. As such, if you specify it without double quotes, it will be automatically upper-cased. But LOADCLOBFROMFILE expects the directory name to be passed in as a string, so you must specify it in upper-case if it was originally created without quotes.

No character set conversion is applied by LOADCLOBFROMFILE. This generally means the above procedure will not work if your data is not simple single-byte characters. Often you'll find the text is inexplicably truncated, becauuse of the difference between the number of characters and the number of bytes. See the next section for a solution.

Multibyte Considerations

If we want to load multibyte character data, we must forget DBMS_LOB.LOADFROMFILE and copy the contents of the BFILE into the CLOB using a more "manual" technique. This involves reading chunks into a RAW buffer, and then using UTL_RAW.CAST_TO_VARCHAR2 to allow us to write the RAW data into the CLOB value:

Method 4: Using External Tables

External tables appear to the database much like normal tables, but their data resides on the file system. The layout of the external files is described using a variation on SQL*Loader syntax.

However, SQL*Loader loads from the client file system, whereas external tables expect their data files to exist on the server file system.

External tables are particularly useful for loading large numbers of records from a single file, so each record appears in its own row in the table.

And here's our data file external.dmp:

Note the second line requires quotes around the full text string, as otherwise the field would be terminated at the comma. Of course we meet the usual problem that the full text string may contain quotes. We can counter that by doubling them:

But we still have the issue that the full text might contain the record delimiter character (0x'0A', or newline). So as with SQL*Loader, we can use specify a separate file for each document. External tables do not support the FILLER directive, so instead we have to use a COLUMN TRANSFORMS line to specify that our fname field contains the name of our external file:

Note: the "FROM (MY_DIRECTORY)" clause is not necessary here since we've already specified that as the default directory.
However if the loader.txt file and the individual clob files were in different locations, it would be needed.

Once your data is in an external table, you can either leave it there and perform normal table operations (DML and most DDL) on the external table, or you can use the external table as a "staging" table to get the data into a normal table. You can simply do

to create a new normal table. And of course you can similarly do:

Conclusion

There are several ways to load documents into LOB columns in Oracle database tables. The choice of method will depend on the nature and location of your data, the performance required for loading, and perhaps the loading method you're already most familiar with.

The author welcomes corrections and comments: roger.ford@oracle.com