Another old posting from my previous blog.


Continuing to get acquainted with Python, I looked at the book „Introduction to Data Science“ by Joel Grus. I have it in German as an electronic book, which is unfortunately no fun to read, although the content is solid – with restrictions. The most important problem is the formatting of the Python code, which does not even work in landscape mode. The printed book uses a smaller font and lines are completely in one line. You can and should download the examples. There are also some misprints. I found one in the introduction to the Bayes formula, where the fraction is reversed. But most importantly, I have my doubts if the book is really a help for the starting data scientist. The presentation is very short. You will need a good background in math and programming, or have someone to ask if you do not understand the theory or the code. The code gets pretty advance pretty fast.

However, this is not a review of the book. But I take an example of the book to show how Python can be used from within Euler Math Toolbox. The following code reads the bible and prints the 10 most common words.

infile = open("bible.txt","r")

from collections import Counter

num_words = 10

counter = Counter(word.lower()
    for line in infile
    for word in line.strip().split()
    if word)

for word, count in counter.most_common(num_words):
    print(str(count)+"\t"+word)

infile.close()

I downloaded the file „bible.txt“ from the net and put it into the folder, where the IPython notebook resides. The output is the following.

64183	the
51379	and
34746	of
13643	to
12799	that
12559	in
10263	he
9840	shall
8987	unto
8835	for

Now, let us do this programming exercise in Euler Math Toolbox via Python.

>>> from collections import Counter
>function python testbible (filename) ...
$""" 10 most common words in the bible
$"""
$infile = open(filename,"r")
$num_words = 10
$counter = Counter(word.lower()
$    for line in infile
$    for word in line.strip().split()
$    if word)
$res = ""
$for word, count in counter.most_common(num_words):
$    res = res+str(count)+"\t"+word+"\n"
$infile.close()
$return res
$endfunction
>testbible("bible.txt")
 64183   the
 51379   and
 34746   of
 13643   to
 12799   that
 12559   in
 10263   he
 9840    shall
 8987    unto
 8835    for
 
>>> help(testbible)
 Help on function testbible in module __main__:
 
 testbible(filename)
     10 most common words in the bible

The import was done with a direct Python command. The function returns a string with the answer. But it could just as well use „eumat.dump(„….“)“ for a direct output. Note, that you do not need to indent the Python definition. That is done automatically by EMT.

The „$“-signs in the function definition allow copying and pasting the code into an EMT notebook. They are not visible in the EMT window.

Here is another example, done in IPython.

def sieve(N):
    v = [True for _ in range(N+1)]
    for k in range(2,N+1):
        if k*k>N:
            break
        if v[k]:
            m=2*k
            while m<=N:
                v[m]=False
                m=m+k
    return [k for k in range(N+1) if v[k] and k>1]

s=sieve(100_000_000)

from math import log
x=[k for k in range(1000,len(s),1000)]
y=[(s[k]/(k*log(k))-1-(log(log(k))-1)/log(k))*log(k)**2 for k in 
     range(1000,len(s),1000)]

import matplotlib as mpl
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,8))
ax.plot(x,y)

This code computes the primes up to 100 million and plots the actual size of the k-th prime, divided by a known asymptotic for the size of the k-th prime.

We only use Python to do the sieve. For the plot, we use Euler Math Toolbox.

>function python sieve (N) ...
$v = [True for _ in range(N+1)]
$for k in range(2,N+1):
$    if k*k>N:
$         break
$    if v[k]:
$         m=2*k
$         while m<=N:
$             v[m]=False
$             m=m+k
$return [k for k in range(N+1) if v[k] and k>1]
$endfunction
>tic; s = sieve(100 000 000); toc;
 Used 29.407 seconds
>n = length(s)
 5761455
>k = 1000:1000:n;
>y = (s[k]/(k*log(k))-1-(log(log(k))-1)/log(k))*log(k)^2;
>plot2d(k,y):

As you see, the sieve takes about 30 seconds on my computer (Core I7, 4.0GHz). Of course, it is much faster to do that in C.

>function tinyc csieve (N) ...
$ARG_DOUBLE(N);
$int n=(int)N;
$CHECK(n>10,"Need a positive integer.");
$char *v = (char *)getram((n+1)*sizeof(char));
$CHECK(v,"Out of heap space.");
$
$for (int i=0; i<n+1; i++) v[i]=1; 
$for (int k=2; k<n+1; k++)
${
$   if (k*k>n) break;
$   if (v[k])
$   {
$       int m=2*k;
$       while (m<=n)
$       {
$           v[m]=0; m+=k;
$       }
$   }
$}
$int count=0;
$for (int k=2; k<n+1; k++) count+=v[k];
$
$header *out=new_matrix(1,count);
$CHECK(out,"Out of heap space");
$double *res=matrixof(out);
$
$int j=0;
$for (int k=2; k<n+1; k++)
$   if (v[k])
$       res[j++]=k;
$
$moveresults(out);
$endfunction
>s=csieve(100)
 [2,  3,  5,  7,  11,  13,  17,  19,  23,  29,  31,  37,  41,  43,  47,
 53,  59,  61,  67,  71,  73,  79,  83,  89,  97]
>tic; s=csieve(100 000 000); toc;
 Used 1.459 seconds

The procedures to communicate from EMT to Tiny C are a bit involved. But the result is 15 times faster. The basic code is a direct translation of the Python code.

Depending on your heap space of Euler Math Toolbox, you can do more primes. If you compress the sieve using bits, you could to a lot more. But for this code and the default size, 100 million is the limit. Increasing the stack size to 2048 MB, I could do all primes up to 1 billion in 15 seconds.

Published .

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert