2014-10-23

The set-up:

This post is about testing the performances of solutions to the following problem:

A cell array of S strings formatted as numbers separated by underscores is given, e.g.:

list_of_words = repmat({'02_04_04_52_23_14_54_672_0'},10,1);

Is guaranteed for all strings that:

they contain only decimal digits and underscores;

the number of underscores is the same;

there are no two consecutive underscores.

Write MATLAB code that would convert the cell array of strings to a numeric matrix S×M of doubles (values 1000 time smaller), where S is the number of strings and M is the number of numbers in a string.

A very similar problem was posted on StackOverflow, with several solutions proposed. The original goal was improving speed performance.

The code:

Two of the solutions, designed for speed, seem to have wildly varying performances from one MATLAB installation to another, or even from one run to another. Also, they have really different implementation styles.

In the dark corner you'll find a solution that goes against MATLAB's most sacred tenets: eval is evil, and loops should be avoided at all costs. However, the code tries to optimise the speed by avoiding repeated memory allocations, by using the fast ways of converting strings to numbers, and by doing in-place operations:

Note: The original solution returned the result as M×S double array. The code was adapted for S×M; however, this adaptation consumes twice the memory. And yes, I wrote this solution.

In the clear corner you'll find a solution true to MATLAB spirit, that avoids using loops and eval, favoring a single sscanf read of all the strings (which is a clever way of avoiding the overhead of calling the function S times):

Note: This solution belongs to @Divakar.

The referee is made up of two functions: one that generates test cases, and one that does the timig.

The test case generator groups in a cell array underscore-delimited strings out of 10 random integers between 1 and 9999 (for the sake of simplicity); however, an implementation should only assume that the numbers are positive or zero, and the number should fit in a double:

The timing function takes as arguments a test case and an arbitrary number of processing function handles; it is implemented such as errors in one function should not bother the execution of the others. It uses timeit as recommended by @Divakar and @LuisMendo; for those who don't have this function in their default MATLAB installation, it may be downloaded from MATLAB Central / File Exchange:

The problem:

As said before, the results seem to vary from one MATLAB installation to another, and even from one run to another. The problem that I propose is three-fold:

On your specific MATLAB installation (version + OS + HW), how well the two solutions perform compared to eachother?

Is it possible to improve one solution or the other in a sizable way?

Are there even faster MATLAB-native (e.g. no MEX or special toolboxes) solutions based on completely different ideas/functions?

For point 1. (benchmarking), please create in your MATLAB path the four function files eval_and_loops_solution.m, single_sscanf_solution.m, referee_test_case.m, referee_timig.m and other functions that you might want to test (for example, the implementations proposed by answers). Use them for several numbers of words, e.g. like this script:

When posting the results, please specify your MATLAB version, the OS, the size of RAM and the CPU model. Please be aware that these tests might take some time for large number of words! However, running this script will not alter the content of your current workspace.

For points 2. or 3. you can post code that use the same in/out conventions as eval_and_loops_solution.m and single_sscanf_solution.m, along with the benchmarking that supports the claim.

The bounties:

I will up-vote every benchmarking answer, and I encourage everybody to do that. Is the least that one can do for people that contribute with their skills, time and processing power.

A +500 bounty bonus will go to the author of the fastest code, be it one of the two proposed, or a third new one that outperforms them. I really hope that this will match the general agreement. The bonus will be given in a
month
week from the original date of posting.

Show more