Wednesday, September 3, 2014

Getting started with Apache Spark in Python

Apache Spark is one of the promising new technologies for big data processing; it provides distributed in-memory tables and allows you to quickly compute aggregates and even complicated calculations. Spark provides interfaces for Java, Scala and Python.

Although setting a production cluster of Spark can be complicated (especially since you probably want to set up a hadoop cluster too), you can run Spark locally and, in that case, installation is a breeze (download,unpack :).

After you've installed it, running pyspark will get you a shell, and it will create a context (from which you can access most functionality) in a variable called sc.

For this simple tutorial we will load a few files containing sales data, and aggregate in a couple of different ways. You can read any kinds of files, but you need to parse them yourself, so you normally want a simple format; we will read tab-separated files, which contain a fixed number of fields, separated with tabs. Our file will contain the day, the store id, the product id and the amount sold on that day.

To load a file we use the textFile method of the context; you pass it the name of a file (or a regular expression matching more than one file)

For example, to load all of the files that start with sales we would do:
sc.textFile("sales_*.txt")

now, this gives us a collection (spark calls them RDDs) of lines; we normally want to parse each line into its fields; we can call the map method, and give it a function; this will return a collection resulting from applying this method to each of the elements of the previous one (and we can chain this calls). Since we have a tab-delimited file, we just need to call the split method on each line, and that will return an array of strings; so we can do:
sales=sc.textFile("sales_*.txt").map(lambda x:x.split('\t'))

Here we use textFile to read all files that start with sales_ and end in .txt; this would get us a collection of lines; we then apply to each line a function which splits the line by tabs ('\t'), so we now have an array of fields instead of each line; we finally store that in a variable called sales (as an aside, this doesn't actually read the files; instead, spark stores the transformations and won't actually compute it until it's needed; if you want to see the results you can use first to get the first object, or collect to get the full collection).

Once we have the data loaded, we may want to do some calculations; most of the time, we want to calculate values for a group;  something equivalent to a GROUP BY in SQL. We can use the reduceByKey method; this method takes a collection of pairs; will create groups by the first element, and then apply a reducing function to the elements in the group.

Our sales RDD does not contain pairs, so we first need to apply a map, to get pairs; the function we want to apply to the group is addition, so if we want to calculate the sales by day, we can do:
sales_by_day=sales.map(lambda x:(x[0],int(x[3])))       
   .reduceByKey(lambda x,y:x+y)

Remember that the day is the first field (so x[0]) and the amount sold is the fourth element, x[3]. We use reduceByKey to group by the first field (the day) and calculate the sum of the amount sold.

You can look at the elements in sales_by_day with:
sales_by_day.collect()

Which, for our sample data, produces 
[(u'2014-01-02', 212), (u'2014-01-01', 357)]

Many times we want to store the output into another text file; we can use saveAsTextFile; notice that we normally want to apply map again, to transform our output into a more suitable format; for example, to store it again as a text-delimited file we can do:
sales_by_day.map(lambda l: "{0}\t{1}".format(l[0],l[1])) .saveAsTextFile("sales_by_day")

Notice this will actually produce a folder called sales_by_day with a bunch of files, part-0000 etc, and SUCCESS (if it succeeds), just like hadoop would.

Sample data can be obtained from https://github.com/okaram/spark ; I plan to make this into a series; next one will be on joins, another on writing applications, and at least one more on performance.

Friday, July 4, 2014

Using Excel as your UI

At work, part of our job involves getting data from a bunch of different databases and verifying that certain relationships hold among them, or finding other kinds of problems or bugs; we usually end up pasting our results into Excel, and sending the spreadsheets around.

One of our teammates had the idea of writing a script that writes directly to Excel; I'm still amazed it didn't occur to me :) but to compensate, I figured I'd write a program to do the same; since I couldn't find any simple .net libraries to write to excel, I figured I'd try with python; found openpyxl (great simple library, although its docs are outdated), so I wrote a simple script that takes queries in JSON and produces an excel spreadsheet with the data produced by those queries.
The Json file looks like this:
Notice we can specify several sheets, and for every sheet we specify queries; for each query we specify its DBI driver (you need to have those libraries available in your system), connection string and query string.
The python code is amazingly simple:
And you call it by passing it the name of the json file, and the name of the output you want.

Monday, June 9, 2014

NodeJs tips and Tricks

I'm relearning nodejs, for fun; I'm writing here some links I've found useful (more as a reminder to myself)


Monday, May 26, 2014

Playing with nginx and postgresql

I've always used Apache as my web server (and now MS IIS :), but have wanted to play with nginx; I was looking for something to do with my Raspberry Pi, so I figured setting up nginx's postgresql module would be fun. I followed the instructions for OpenResty (a 'distribution' of nginx with a bunch of modules configured by default; not a lot of documentation, but enough to get started).

I was running raspbian (a debian derivative), so it was as simple as:
apt-get install libreadline-dev libncurses5-dev libpcre3-dev libssl-dev perl make libpq-dev
(notice libpq-dev is not in the docs, but it is needed for the postgresql module, which is not installed by default with openresty)

then download with wget (this is the latest version at the time)
wget http://openresty.org/download/ngx_openresty-1.5.12.1.tar.gz

Uncompress, configure, make and install
tar -xzf ngx_openresty-1.5.12.1.tar.gz
cd openresty-1.5.12.1
./configure --with-luajit --with-http_postgress_module && make && sudo make install

nginx gets installed into /usr/local/openresty/nginx , so we can edit the configuration file with:
sudo vim /usr/local/openresty/nginx/nginx.conf

(I was using vim; use your favorite text editor otherwise :)

And we get to the fun part ; in nginx, we use the server stanza to configure a server; inside the http section, we use upstream to specify how to connect to the database (db is just a name, which we'll use in our other directives)
upstream db {
    postgres_server 127.0.0.1 dbname=test user=test password=abc123;
}

And then inside the server section, we can define specific queries for certain locations; for example:
location /stores {
    rds_json          on;
    postgres_pass   db;
    postgres_query  "SELECT * FROM Stores";
}
Specifies that when the user visits /stores, we should return (as json) the result of executing SELECT * from Stores.

Tuesday, May 13, 2014

Getting running queries in SQL Server

Today I needed to get running queries in SQL server (you can kill them with the command KILL followed by the session id). After 'binging' it, got to the following query :). Need to study more about the metaschema in sql server.

SELECT session_id,
       request_id,
       Db_name(database_id),
       start_time,
       status,
       command,
       Substring(txt.TEXT, ( statement_start_offset / 2 ) + 1,
       ( ( CASE statement_end_offset
       WHEN -1 THEN Datalength(txt.TEXT)
       ELSE statement_end_offset
                                                                   END
           - statement_start_offset ) / 2 ) + 1) AS statement_text,
       wait_type,
       wait_time,
       blocking_session_id,
       percent_complete,
       cpu_time,
       reads,
       writes,
       logical_reads,
       row_count
FROM   sys.dm_exec_requests
       CROSS APPLY sys.Dm_exec_sql_text([sql_handle]) AS txt
WHERE  session_id <> @@SPID
       AND session_id > 50

Sunday, April 27, 2014

Automatas (finite state machines): A programmer's perspective

Finite state machines are simple and nifty; when I was teaching theory, I saw students, even good programmers, had trouble with them when presented as math, so I figure showing them as programs could make it click for some people.

From http://cg.scs.carleton.ca/~michiel/TheoryOfComputation/ (a free ToC book)

Definition 2.2.1 A finite automaton is a 5-tuple M = (Q, Σ, δ, q, F), where
1. Q is a finite set, whose elements are called states,
2. Σ is a finite set, called the alphabet; the elements of Σ are called symbols,
3. δ : Q × Σ → Q is a function, called the transition function,
4. q is an element of Q; it is called the start state,
5. F is a subset of Q; the elements of F are called accept states.

Doesn't that sound fancy ? It means we have an object with 5 fields (a 5-tuple), Q, a set of things called states, Σ, a set of symbols called the alphabet (we'll use characters as our symbols), δ, a function that takes a state and a symbol and returns a state (conceptually we are 'moving' from one state to another when we see a symbol), q, an element of Q , called the start state and F a subset of Q, the set of accepting states.

In Scala, we can represent it as:


We can then define the extended transition function, which says which state we end up at by following a string, recursively as
This means if the string is empty, we stay at the state we're at; if not, we follow the first character (by using delta), and then follow the rest of the string from there; this recursive definition maps nicely to a List in scala (or any other language with cons-style lists :), so we end up with the following code:


And the automata accepts a string if we would go from the initial state to the final state by following that string; in Scala
To use it we would first define a delta function, say:
And then we can use like this:
The full code is at https://github.com/okaram/scala/tree/master/automata, including non-deterministic automata (which I will probably get around to blog about some other time :)

Thursday, April 17, 2014

Accessing Databases with R (RStudio)

R has libraries for accessing several databases, but if you're on Windows, you'd probably prefer the RODBC package; with this, you can access almost any db through ODBC. You can create a DSN and use odbcConnect, or use odbcDriverconnect ; the problem is figuring out the connection string, especially since you need to specify the driver. For SQL Server, this works (assuming db is in localhost):


library(RODBC)
c=odbcDriverConnect('Driver=SQL Server;Server=localhost;Database=......;Trusted_Connection=True')

And then you can do odbcquery like:
res=sqlQuery(c,"SELECT * FROM ....")

And you get a data frame !

http://www.connectionstrings.com/ has odbc connection strings for many dbs