2014-09-09

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Seth Mottaghinejad

Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:

if n is 1, stop, otherwise recurse on the following

if n is even, then divide it by 2

if n is odd, then multiply it by 3 and add 1

In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1).  For example, when n = 5 we have

5 -> 3*5+1 = 16 -> 16/2 = 8 -> 8/2 = 4 -> 4/2 = 2 -> 2/2 = 1,

giving us a stopping time of 5 iterations.

In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.

Here are some examples of potentially useful features:

Is the integer a prime number?

What are its proper divisors?

What is the remainder upon dividing the integer by some other number?

What are the sum of its digits?

Is the integer a triangular number?

Is the integer a square number?

Is the integer a pentagonal number?

In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n - 2m = 0, and use the quadriatic formula to get n = (-1 + sqrt(1 + 8m))/2 and (-1 - sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (-1 + sqrt(1 + 8m))/2.

Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.

For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:

Here is a list of other helper functions in collatz.cpp:

'is_prime': given an integer vector, returns a logical vector

'gen_primes': given some integer input, n, generates the first n prime numbers

'gen_divisors': given an integer n, returns an integer vector of all its proper divisors

'is_perfect': given an integer vector, returns a logical vector

As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavy-duty computations to C++ leaving R with the task of processing and analyzing the resulting data.

Let's get started. We will perform the analysis on all integers below 10^5, since R is memory-bound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.

One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any data-driven problem.

To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:

.3 == .4 - .1 # we expect to get TRUE but get FALSE instead
[1] FALSE

The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.

eps <- 1e-9
abs(.3 - (.4 - .1)) < eps # returns TRUE
[1] TRUE

Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the 'unique factorization theorem'. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single comma-separated string.

Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.

There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.
Here's a sample of 10 rows from the data:

df[sample.int(nrow(df), 10), ]
int st sum_digits is_triangular is_square is_pentagonal is_prime
21721 21721 162 13 FALSE FALSE FALSE FALSE
36084 36084 142 21 FALSE FALSE FALSE FALSE
40793 40793 119 23 FALSE FALSE FALSE FALSE
3374 3374 43 17 FALSE FALSE FALSE FALSE
48257 48257 44 26 FALSE FALSE FALSE FALSE
42906 42906 49 21 FALSE FALSE FALSE FALSE
37283 37283 62 23 FALSE FALSE FALSE FALSE
55156 55156 60 22 FALSE FALSE FALSE FALSE
6169 6169 111 22 FALSE FALSE FALSE FALSE
77694 77694 231 33 FALSE FALSE FALSE FALSE
is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7
21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE
36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE
40793 FALSE 19,113 FALSE FALSE FALSE FALSE
3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE
48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE
42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE
37283 FALSE 23,1621 FALSE FALSE FALSE FALSE
55156 FALSE 2,13789 TRUE FALSE FALSE FALSE
6169 FALSE 31,199 FALSE FALSE FALSE FALSE
77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE
is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23
21721 FALSE FALSE FALSE FALSE FALSE
36084 FALSE FALSE FALSE FALSE FALSE
40793 FALSE FALSE FALSE TRUE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 TRUE FALSE FALSE FALSE FALSE
42906 FALSE FALSE FALSE FALSE FALSE
37283 FALSE FALSE FALSE FALSE TRUE
55156 FALSE FALSE FALSE FALSE FALSE
6169 FALSE FALSE FALSE FALSE FALSE
77694 FALSE FALSE FALSE FALSE TRUE
is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43
21721 TRUE FALSE FALSE FALSE FALSE
36084 FALSE TRUE FALSE FALSE FALSE
40793 FALSE FALSE FALSE FALSE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 FALSE FALSE FALSE TRUE FALSE
42906 FALSE FALSE FALSE FALSE FALSE
37283 FALSE FALSE FALSE FALSE FALSE
55156 FALSE FALSE FALSE FALSE FALSE
6169 FALSE TRUE FALSE FALSE FALSE
77694 FALSE FALSE FALSE FALSE FALSE
is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67
21721 FALSE FALSE FALSE FALSE FALSE
36084 FALSE FALSE FALSE FALSE FALSE
40793 FALSE FALSE FALSE FALSE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 FALSE FALSE FALSE FALSE FALSE
42906 FALSE FALSE FALSE FALSE FALSE
37283 FALSE FALSE FALSE FALSE FALSE
55156 FALSE FALSE FALSE FALSE FALSE
6169 FALSE FALSE FALSE FALSE FALSE
77694 FALSE FALSE FALSE FALSE FALSE
is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89
21721 FALSE FALSE FALSE FALSE FALSE
36084 FALSE FALSE FALSE FALSE FALSE
40793 FALSE FALSE FALSE FALSE FALSE
3374 FALSE FALSE FALSE FALSE FALSE
48257 FALSE FALSE FALSE FALSE FALSE
42906 FALSE <span style="color: #000000; font-

Show more