Mac OS X – List Open Ports

In the world of Linux, you could use netstat to list all ports that are open on your system. I tend to use the following in Linux:

netstat -aep | grep ':\*'

However, netstat in Mac OS X behaves extremely differently. To be able to list open ports on Mac OS X, you could use something along the lines of:

sudo lsof -i -P | grep -i "listen"

For example:

localhost:~$ sudo lsof -i -P | grep -i "listen" 
launchd 1 root 27u IPv6 0xf38ca75ebb725cfd 0t0 TCP localhost:631 (LISTEN)
launchd 1 root 28u IPv4 0xf38ca75ebb726c1d 0t0 TCP localhost:631 (LISTEN)
launchd 1 root 30u IPv6 0xf38ca75ebb72591d 0t0 TCP *:22 (LISTEN)
launchd 1 root 31u IPv4 0xf38ca75ebb7264cd 0t0 TCP *:22 (LISTEN)
polard 79 root 6u IPv4 0xf38ca75ebdb15c1d 0t0 TCP localhost:49152 (LISTEN)
...

XP-Dev.com: Another Milestone

So, it has been 3 months and a bit since the upgrade to the new platform that runs the current version of XP-Dev.com, and it has been eventful. There was the release that took a whole day, and then were some functional releases as well. The latest release bring some really cool features to XP-Dev.com.

Another milestone has been reached, and the functionality gap has been narrowing drastically the past 3 months. However, I will be brave enough to admit that the gap is still there, and at least for me, there’s still a mountain to climb ahead.

Enjoy the new releases, and as usual, your feedback is appreciated. Do give a shout in the forums, or just raise a support ticket. You can contact me via this blog as well.

Asymmetric Follow, Pub/Sub and Systems Design

James Governor mentions a very interesting pattern of web 2.0 – asymmetrical follow. In a nutshell, it’s basically an unbalanced communication network – you have some nodes on a network (hubs) that tend to have a lot of inbound links compared to others. In other words, it is a situation where popular people whose words/thoughts/opinions/tweets get read by the masses, and they (the popular people) do not reciprocate.

The thing is, asymmetrical follow exists everywhere. Celebrities have a lot of inbound links (tabloids, fans, press, etc), but they do not necessarily have a reverse link back. Blogs by their nature are asymmetrical as well – the blogger publishes a post thats read by visitors, and comments on the blogs, or even pingbacks do not necessarily get read or responded to by the poster. Its not being rude or anything anti-social about it, but its a pattern, and James is right – it is core to Web 2.0. Back in Dec ‘07, JP mentions that Twitter is neither a push or a pull network, but it is actually publish-subscribe.

The point of this post surrounds what James mentions in his article:

But Twitter wasn’t designed for whales. It was designed for small shoals of fish. Which brings us to one of the big issues with Asymmetrical Follow – it introduces unexpected scaling problems. Twitter’s architecture didn’t cope all that well at first, but has performed a lot better since the message broker was re-architected using Scala LIFT, a new web application programming framework). The technical approach that is most appropriate to support Asymmetrical Follow is well known in the world of high scale enterprise messaging- its called Publish And Subscribe.

Publish-Subscribe is a very common pattern in technology. Having worked in 2 investment banks, I have seen plenty of implementations that do the exact same thing: Publishers fire data once to a middleware, and that middleware layer sends that data off to many subscribers. Sounds simple enough to implement, right ? Well, it’s not.

Designing a good, reliable, highly performant Publish-Subscribe framework is not easy. Getting the initial bits is simple and trivial, but the problem that a lot of people face is scalability. If you are looking to build a Pub/Sub layer on your own, then the first thing you have to do is stop and take a reality check. Its not worth the trouble. Buy it from someone, or reuse another framework (like Twitter have done with Scala LIFT). I am not kidding. I have seen millions of dollars go down the drain in missed opportunities, direct trading losses, etc all due to poorly designed and implemented Pub/Sub layers.

Pub/Sub frameworks are a lot like caches (for e.g. memcached) but with a twist. Not only do you have to cache data, but you have to tell subscribers when this data has changed. In fact, they are closer to finite state machines than caches.

Here are a few that I have used in the past and highly recommend any of them:

Oracle Coherence (this seems to be the hot stuff that everyone is talking about)
Reuters RMDS
Tibco Rendezvous

Now, if you’re still stubborn and think you’re up for the challenge, then here are a few pointers:

Design it really, really well

Sit down with a few people and walk them through your design. Your design has to look into how memory is managed, the threading model, the communication mechanism, etc. Find as many defects as possible and don’t take it personally. Do this before you even write a single line of code.

Have a solid, clean API

I have used some really arcane APIs in the past, and oddly, some of them are provided by electronic markets (no names mentioned here). Remember, the API will be used by publishers and subscribers. The cleaner the API, the less bugs it will introduce in publishers/subscribers code.

Non-blocking IO

If you’re using TCP sockets as a communication layer, do use select() (non-blocking IO). You need to break away from the one-thread-per-client model. That model, while easy to code to, just does not scale at all. I have been in way too many situations where I have inherited a system that uses the one-thread-per-client model, and all of a sudden it does not work in production because they’ve just scaled from 30 connected clients to 3000. BTW, if you’re developing in Java, I highly recommend using Apache MINA to reduce the stress of writing non-blocking IO code.

Watch out for data state inconsistencies

A common approach that a lot of frameworks use is to send a snapshot followed by updates of changes.

Publishers should send the following messages upon startup:

An initial message saying “This is the beginning of my initial data”
The initial data itself
A final message saying “This is the end of my initial data”

From then on, publishers should just send updates.

Subscribers will get the reverse. A call to subscribe() should result in at least the following callbacks:

A callback saying “This is the beginning of the data”
The initial snapshot data itself
A callback saying “This is the end of the data”

From then on, subscribers should just send updates.

Handling the subscribe() call in your framework is going to be tricky. You’ll need to be careful of locking your cache to ensure that no one updates it while you’re taking the initial data for the subscriber. Alternatively, you could create a snapshot copy of the cache, but keep an eye on your memory usage.

Correctness over performance

Don’t worry so much about reducing latency from 200ms to 20ms. Getting your implementation correct is far more important than performance. I’m not saying performance is not important, BUT you need to get it to work correctly before looking into performance.

Build a load test framework

You will definitely need one of these. There have been way too many times in the past where I need to reproduce a production problem related to scalability, only to find that the original authors of the system did not bother building a load test framework.

I hope by now you would realise that designing and implementing a Publish/Subscribe framework is not trivial. Buy it off the shelf as someone out there has gone through all of this pain for you.

I am a big believer that your system architecture should reflect the underlying business. It will not be a good fit if you are trying to retrofit an incorrect architecture as you will end up having loads of problems (scalability, maintenence, etc) in the long run. Asymmetric follow is here to stay, and in your next project think about how it is going to affect your architecture and what you need to do at the initial outset to get it right.

Moving Over to Nginx

Running XP-Dev.com has its set of unique problems, and it has not always been easy. I’ve always tried to run the whole infrastructure on a shoe-string budget at the same time trying not to compromise on quality.

One of the problems is hardware resource.

The truth is: Apache is a memory hog, and to keep things scalable for serving Subversion repositories, I decided to remove all PHP websites out from apache and run them under nginx and PHP-CGI (sudo apt-get install php5-cgi). To be honest, I did not notice any difference in performance of the web sites (apache/mod_php vs nginx/fastcgi/php-cgi), however, the main motivation of this exercise is to limit the maximum amount of memory that my non-critical PHP web sites take, and at the same ti

me, giving apache more room to grow for serving the Subversion repositories. I could have had two apache installations, and give them different limits (by tweaking MaxSpare*, MaxRequests* and friends), but that’s an outright pain to manage. Moreover, I needed a simple webserver that can just serve static content as well.

And lets not forget the users of virtual private servers (VPS) with limited amount of memory. Nginx and PHP-CGI is a much appropriate solution for those memory limited configurations.

I had a look around, and it was basically down to lighttpd or nginx as a replacement to serve the PHP websites, and I picked nginx as there were some odd bugs with lighttpd serving large files. The FastCGI performance is almost the same (I did not really do any scientific benchmarks). However, the part that really got me sold on these two was that it used a master-slave threading model, rat

her than the (out of date) one thread/process per client model, which does not scale at all. Both of them are event driven, rather than “client socket” driven. BTW, this includes the awesome J2EE web container Jetty (if you use the SelectChannelConnector).

Migrating the websites across from apache to nginx/fastcgi/php-cgi was an absolute breeze and here are a few pointers that will help ease the burden.

Strategy

Just to clarify, in the apache/mod_php world, PHP files are served via the apache process itself. The strategy under nginx is to get nginx to pass on the request to another set of long running php-cgi processes that do the actual PHP processing. The response will then be passed back to nginx, which will send it back to the web browser.

Documentation

Use the English Nginx wiki extensively. There’s a lot of documentation there on configuring and tweaking nginx, especially the module reference pages. Here’s a quick and dirty howto on getting nginx+fastcgi and php-cgi working.

PHP FastCGI Start/Stop Scripts

Save yourself the trouble of writing a custom PHP FastCGI start/stop script. Install lighttpd and use their spawn-fcgi script wrapper. Its really going to save you a lot of painful hours. I wrote a simple wrapper around that script as I wanted PHP cgi to startup on every server bootup, or if I wanted a quick restart of the processes. You might rant to adjust the variables pidfile and cgidir for your setup.

#!/bin/bash

me=`whoami`
if [ $me != "root" ]; then
        echo Not root!
        exit 1
fi

pidfile=/root/php.PID
pid=`cat $pidfile`
cgidir=/var/run/php-cgi
sock=$cgidir/unix.sock

[ ! -d $cgidir ] && echo creating $cgidir && mkdir $cgidir && chown www-data.www-data $cgidir

if [ "$pid" != "" ]; then
        echo Killing $pid
        kill $pid
        rm $pidfile
        sleep 1
fi

[ -f $sock ] && chown www-data.www-data $sock

/usr/bin/spawn-fcgi -f /usr/bin/php-cgi -s $sock -C 5 -P $pidfile -u www-data -g www-data

Stop serving .htaccess

Plenty of web apps out there have built in support for apache, and include .htaccess files in their distribution to reduce the configuration overhead for the installer. However, nginx will serve these files by default, which maybe fine for most of the cases, but its always good practice to deny access to it. Simple config for nginx does the trick

location ~ /\.ht {
    deny  all;
}

Serving PHP files

To serve PHP files, nginx will pass the request to the PHP-CGI handlers.

location ~ .*\.php$ {
	fastcgi_pass   unix:/var/run/php-cgi/unix.sock;
	fastcgi_index  index.php;
	include /etc/nginx/fastcgi_params;
	fastcgi_param  SCRIPT_FILENAME  /home/rs/local/wordpress/$fastcgi_script_name;
}

Notice that I’ve included a /etc/nginx/fastcgi_params file above. This file contains all the regular FastCGI directives, and I’ve put it in a seperate file to avoid too much repetition. The content of the file /etc/nginx/fastcgi_params is below:

fastcgi_param  QUERY_STRING       $query_string;
fastcgi_param  REQUEST_METHOD     $request_method;
fastcgi_param  CONTENT_TYPE       $content_type;
fastcgi_param  CONTENT_LENGTH     $content_length;

fastcgi_param  SCRIPT_NAME        $fastcgi_script_name;
fastcgi_param  REQUEST_URI        $request_uri;
fastcgi_param  DOCUMENT_URI       $document_uri;
fastcgi_param  DOCUMENT_ROOT      $document_root;
fastcgi_param  SERVER_PROTOCOL    $server_protocol;

fastcgi_param  GATEWAY_INTERFACE  CGI/1.1;
fastcgi_param  SERVER_SOFTWARE    nginx/$nginx_version;

fastcgi_param  REMOTE_ADDR        $remote_addr;
fastcgi_param  REMOTE_PORT        $remote_port;
fastcgi_param  SERVER_ADDR        $server_addr;
fastcgi_param  SERVER_PORT        $server_port;
fastcgi_param  SERVER_NAME        $server_name;

WordPress Rewrite

The final tip is for all those WordPress junkies out there. To get nice urls for WordPress, you will need the following rewrite directive. If I’m not mistaken, one will be given to you for apache when you’re setting up custom urls via the admin screen, but not for nginx:

if (!-e $request_filename) {
    rewrite ^(.+)$ /index.php?q=$1 last;
}

And that’s about it. I really do hope these tips will help someone out there. I know it would have shaved a couple hours off my setup time had I known them beforehand.

Spring and Jetty Integration

Jetty is a pretty darn awesome J2EE web container. With amazing features like non-blocking IO, continuations and immediate integration with Cometd – I feel that it is a solid, production ready container.

I hate war files, I hate web.xml files – there’s just way too much black magic that needs to be done to get things up and running. It is nice once someone has done the dirty work, and got the initial web.xml constructed, but I wouldn’t want to be that person who starts it all off.

Oh – another thing – I absolutely LOVE dependency injection. Using the web.xml approach, you’ll almost always have to start off a servlet of some sort to initialise various services that you’ll need – moreover the easiest way to access these services on other servlets is to use singletons, and we all know why singletons are bad!

So, I end up using Jetty in an embedded setup, and used to write various wrappers around the configuration so that I can do most of the common things with minimal code. A good example will be to setup a bunch of contexts and a DefaultServlet for regular file serving. However, the way Jetty is written makes it really easy to be used in Spring – everything is a simple bean with a bunch of setters.

To start off, lets write down the bean. Notice I’ve added an init-method attribute to start(). If you don’t want Spring to kick off your server, then just grab hold of the bean and call start() on it explicitly.

<bean name="WebServer" init-method="start">
</bean>

Then, lets add some connectors to it:

<property name="connectors">
  <list>
  <bean name="LocalSocket">
      <property name="host" value="localhost"/>
      <property name="port" value="8080"/>
  </bean>
  </list>
</property>

You will need some handlers (one of them will be a context handler to serve your servlets). I’ve added a logging handler so that the server logs requests in the same format as apache’s combined log.

<property name="handlers">
  <list>
    <bean>
      <property name="contextPath" value="/"/>
      <property name="sessionHandler">
        <bean/>
      </property>
      <property name="resourceBase" value="/var/www"/>
      <property name="servletHandler">
        <bean>
          <property name="servlets"> <!-- servlet definition -->
            <list>
            <!-- default servlet -->
            <bean>
              <property name="name" value="DefaultServlet"/>
              <property name="servlet">
                <bean/>
              </property>
              <property name="initParameters">
                <map>
                  <entry key="resourceBase" value="/var/www"/>
                </map>
              </property>
            </bean>
            </list>
          </property>
          <property name="servletMappings">
            <list><!-- servlet mapping -->
            <bean>
              <property name="pathSpecs">
                <list><value>/</value></list>
              </property>
              <property name="servletName" value="DefaultServlet"/>
            </bean>
            </list>
          </property>
        </bean>
      </property>
    </bean>
    <!-- log handler -->
    <bean>
      <property name="requestLog">
        <bean>
          <property name="append" value="true"/>
          <property name="filename" value="/var/log/jetty/request.log.yyyy_mm_dd"/>
          <property name="extended" value="true"/>
          <property name="retainDays" value="999"/>
          <property name="filenameDateFormat" value="yyyy-MM-dd"/>
        </bean>
      </property>
    </bean>
  </list>
</property>

And thats about it. If you need to add more servlets, then all you have to do is add an entry to ServletHandler’s properties for servlets and servletMappings.

Now, image if I had to get a reference to a DAO, or some other service in my servlet – it’s just going to be a matter of adding a member, exposing it via a setter and whacking in the dependency on the servlet Spring config above. All done in nice dependency injected way. No more overriding init() on the servlet and picking up some context attribute via some magic string. The best part of this

Here’s the whole Spring config – hack away to your needs!

<bean name="WebServer" init-method="start">
<property name="connectors">
  <list>
  <bean name="LocalSocket">
    <property name="host" value="localhost"/>
    <property name="port" value="8080"/>
  </bean>
  </list>
</property>
<property name="handlers">
  <list>
    <bean>
      <property name="contextPath" value="/"/>
      <property name="sessionHandler">
        <bean/>
      </property>
      <property name="resourceBase" value="/var/www"/>
      <property name="servletHandler">
        <bean>
          <property name="servlets"> <!-- servlet definition -->
            <list>
            <!-- default servlet -->
            <bean>
              <property name="name" value="DefaultServlet"/>
              <property name="servlet">
                <bean/>
              </property>
              <property name="initParameters">
                <map>
                  <entry key="resourceBase" value="/var/www"/>
                </map>
              </property>
            </bean>
            </list>
          </property>
          <property name="servletMappings">
            <list><!-- servlet mapping -->
            <bean>
              <property name="pathSpecs">
                <list><value>/</value></list>
              </property>
              <property name="servletName" value="DefaultServlet"/>
            </bean>
            </list>
          </property>
        </bean>
      </property>
    </bean>
    <!-- log handler -->
    <bean>
      <property name="requestLog">
        <bean>
          <property name="append" value="true"/>
          <property name="filename" value="/var/log/jetty/request.log.yyyy_mm_dd"/>
          <property name="extended" value="true"/>
          <property name="retainDays" value="999"/>
          <property name="filenameDateFormat" value="yyyy-MM-dd"/>
        </bean>
      </property>
    </bean>
  </list>
</property>
</bean>

Converting PEM certificates and private keys to JKS

If there is one irritating, arcane issue about Java, it is their SSL and Crypto framework. It is a pile of mess. I remember using openssl as a library about 3-4 years ago in a project that was pretty crypto heavy and their library can be used by any junior developer – it’s that simple to use.

However, Java’s crypto framework is just absolutely irritating to use – tons of unnecessary boiler plate, and not enough of self discovery of file formats (as an example). Try to do SSL client certificate authentication from ground up and you’ll know what I mean. Knife, wrist – sound familiar ?

Last night, I had to convert some PEM formatted certificates and private keys to JKS (was getting SSL nicely configured under Jetty). I remember doing this a few years back and there were molehillsmountains of issues to jump across and I did pull my hair out back then. Last night was no different. However, I did manage to solve it and ended up with much less hair.

So, to save everyone else the trouble (and their hair!), I’m jotting down some notes here on how to convert a certificate and private key in PEM format into Java’s keystore and truststore in JKS format.

The Keystore

If we’re starting with PEM format, we need to convert the certificate and key to a PKCS12 file. We’ll use openssl for that:

Remember to use a password for the command below, otherwise, the Jetty converter (the following step) will barf in your face!

openssl pkcs12 -export -out cert.pkcs12 \
  -in cert.pem -inkey key.pem

Once that’s done, you need to convert the pkcs12 to a JKS. Here, I will be using a small utility that comes bundled with Jetty called PKCS12Import. You can download the necessary library (you’ll need the main jetty.jar) which can be a huge download for such a small thing, or just grab the jar from here. Run the following command and use the password from the step above and your keystore password:

java -cp /path/to/jetty-6.1.7.jar \
  org.mortbay.jetty.security.PKCS12Import \
  cert.pkcs12 keystore.jks

The Truststore

Next, you’ll almost definitely need to import the certificate into your truststore whenever you need to do anything related to SSL.

First, export the certificate as a DER:

openssl x509 -in cert.pem -out cert.der -outform der

Then import it into the truststore:

keytool -importcert -alias mycert -file cert.der \
  -keystore truststore.jks \
  -storepass password

And that’s it! You have your key in the keystore, and your certificate in the truststore. Hope this helps some of you out there.

Free Subversion Hosting

Many people over the past few months, have been asking the same questions over and over again about the services over at XP-Dev.com. I don’t mind answering them with the same answers, but I think it is time to put all of these questions into one place and discuss them.

Why are you offering Subversion Hosting for free ? Is it too good to be true ?

Let me set something straight:

I offer it free because I really do not believe that anyone should pay for something so simple to setup and run as Subversion.

Here is the reality: I setup Apache using mod_svn, mod_dav, mod_ssl and mod_auth_mysql once. Believe me: only once and never ever ever ever (ever!) touched it again. No, I am not kidding – only once! No tinkering needed, it just runs like Forrest Gump (no pun intended to all you Gump fans out there).

It does cost $$$ to host it, including my time to add more features to it. Disk space and bandwidth is getting cheaper. They are not free, but then again, if you average it across the number of users that I have on XP-Dev.com, the figure looks really, really small. It is a cost nonetheless, which I’ll try to cover below.

So, we’ve established it does cost money, how are you covering these costs ? Are you really rich ?

OK. I wish I was rich, but the truth is – I am not. I could claim I was rich and lie to you all, but then I would not get any glory every time I look at my monthly bank statements.

So, where does the money come from to pay for the services ? Well, at the moment, I am paying for it. But I won’t be doing this forever.

I have got a few models to generate revenue and these models will be implemented in the next few months. I can’t reveal them to the public just yet, but rest assured that the usage of Subversion and project tracking on XP-Dev.com will always remain free. This is how I started and envisaged XP-Dev.com, and that is how it will always be.

Free Subversion Hosting and Project Tracking on XP-Dev.com is a life-time guarantee.

You’re offering a free service. There’s a catch to it, right ? Are you selling our code to someone else ?

No. Nada. No catch. I am not a petty code trader. I don’t go around knocking on other peoples doors saying “PHP codez $4 per line! .. $3.50 per line! .. $3.40 per line! ..”. I could not even be the least bothered about what everyone else is coding. I have my own ideas to push forward and materialise (one of them is XP-Dev.com, there are a lot more in the pipeline).

So, your code is safe on our servers. No one else other than the ones you have permissioned are looking at your repositories. We do have backups that run every night and copied over off-site, but they are all encrypted before leaving the server.

I put all my code on XP-Dev.com. I am a consumer of my own service. I believe that anyone who offers a service should always be their own users/clients/customers. You should see your service from the customers point of view.

If someone else looked at my code and data, I’d be really worried. I respect that tremendously and try my very best to lock down the server.

What you see is what you get – WYSIWYG. There are no catches at all. Your code and data are safe. We have a “no prying eyes” and “mind your own business” policy.

OK. So it is a genuine service that is FREE with no strings attached. Then I suppose it will have to be an overloaded, slow service ?

Never! This is one of the things that come out from being a consumer of your own service. If the services do get slow, there’s going to be one really noisy, angry, verbal user – me. And I’m really scared of him.

On a serious note, I’d be disappointed with myself if the service ever comes to a unacceptable quality. At the moment it’s fast and quick and I intend on keeping it that way. If it every becomes slow, I’ll be there in front of the queue shouting.

I’m not too sure if this is a good thing, or a bad thing – I’ve only ever worked in the Front Office for Investment Banks building real-time (well, its near real-time) trading and pricing system. They are all high performance scalable systems. The systems I work on can cost a trader anywhere between $100,000 to $500,000 if latency went up a nudge above 10ms (yes, that’s milliseconds!). XP-Dev.com is a testament of my experience building & architecting these crazy systems (trust me, they are crazy!). If performance degrades, it will be a major failure on my part and I’m a really proud person 🙂 .

It is a great service. How can I help ?

This reply is a cliche. There are a few ways you can help.

If you are not a user, register now!

If you are a user, and have any problems, queries or just want to say thank you, then please tell me, or email admin@xp-dev.com. Every single non-spam email that goes there gets a reply. If you don’t get a reply in a few hours, then it’s probably SpamAssassin acting up. You should use this form instead.

If you are a user, or not even one just yet – you can help by telling your friends, mom, dad, brothers, sisters, relatives, neighbours, cats, dogs, fish and everyone else about XP-Dev.com. Digg it, Buzz it, Reddit. Do whatever. Just keep spreading the word. I really appreciate it.

If you have any other questions or concerns, please post them as comments to this blog entry, or do contact me directly.

Python and Multi-threading

It has been a few days since Python 2.6 has been out, and the word on the street is that it’s meant to ease the transition into Python 3k. Python3k is not backwards compatible to the 2.X releases. I haven’t had much time on my hands to get down and dirty with the new 2.6 release, but have had some time to read up on it.

Most people know that Python does have a threading API that is pretty darn close to Java’s. However, the way it has been implemented is that all threads need to grab hold of the Global Interpreter Lockto ensure that only one thread at any one time can execute within the Python VM. This is to ensure that all threads have the same “view” of all variables. Apparently they tried to avoid this by making the Python VM thread safe, but it did get a terrible performance hit.

Java tends to get around this by having a rather complex memory model within the Java VM where each thread has it’s own virtual memory. That’s why you have to synchronize various sections of your code to ensure that threads see the same variable states. I highly recommend reading up Doug Lea’s article on synchronization and the Java Memory Model for anyone who wants to do very intensive multi-threaded applications in Java.

So, what are the implications of having to grab hold of the Global Interpreter Lock in Python ? The problem is that it is not TRUE multi threading. You, as the programmer and designer (you DO design your solutions first, right?), will have to plans on when threads should just go to sleep and allow other threads to run. The VM will not do this for you, and one might say that it really is closer to a single threaded VM. From past experience, I’ve found Python’s Threads to be really useful when I’m making blocking calls (for e.g. grabbing a DB connection, blocking APIs (yuck!)), and can do something else in the background while the main thread is sleeping. You could get around this problem by using sub-processes, but there was no easy way to do it, and you had to add a lot of boiler plate code every single time. There was just no support for a clean true multi-threaded interface out of a standard installation.

Now, in Python 2.6, there’s a new package for creating sub-processes called multiprocessing. After a quick glance, it looks very similar to the threading API, BUT instead of running threads, it creates a child process which has it’s own memory and in turn does not need to share it’s Global Interpreter Lock. My own prediction is that it comes at a cost of creating a new process and memory space efficiency. However, you do end up with a TRUE multi-threaded application that really uses all the available processor cores on a multi-core CPU. Considering that RAM is getting cheaper, and processors getting more cores built into them, I think this is a fair trade off.

As always, and this applies to Java as well – writing a true multi-threaded application is not trivial, and always do your homework before you get started! In the past, I always had to fallback to Java for the more intensive applications that I wrote because I always thought creating sub-processes in Python was too tedious. From now on, I have no excuses! The new package in Python 2.6 looks very neat and removes the need to write tons of boiler plate.

Ext3 – handling large number of files in a directory

If you’ve used Linux in the past, I am pretty sure that you’ve heard of the Ext3 file system. It is one of the most common file system format out there, used mainly on Linux based systems.

I’ve noticed something really annoying about how it handles large number of files in a single directory. Essentially, I have a directory with almost a million files and I found that creating a new file in this directory took ages (in the region of tens of seconds), which is not ideal at all for my purpose.

After some reading, and much research, I learnt that Ext3 stores directory indices in a flat table, and this causes much of the headache when a directory has many files in a directory. There are a couple of options.

One, restructure the directory so that it does not contain that many files. I did some tests, and in a default (untuned) Ext3 partition, each subsequent write degrades horribly past the 2000 file limit. So, keeping the items in a directory to within 2000 files should be fine.

Second, is to enable the dir_index option on the Ext3 file system. Run the following as root and you should find that it improves a lot. Do note that the indexing will take up much more space, but then hard disk space is not too expensive nowadays:

$ sudo tune2fs -O dir_index /dev/hda1

Finally, just use something like ReiserFS which stores directory contents in a balanced tree, which is pretty darn fast and you don’t have to muck around tweaking things.

If you’ve got your main partition as an Ext3, and can’t really afford to reformat it into ReiserFS, there might be an alternative: create a blank file and format that as a ReiserFS file system and mount it using loopback.

So, lets create the file first. This depends on how much data you need to handle, and in this example, I’ll just create a ~100MB file full of zeros:

$ dd if=/dev/zero of=reiser.img bs=1k count=100000

Next, format the file using ReiserFS as below. It will complain about the file ‘reiser.img’ not being a special block device (and we know that!). Just say yes and carry on.

$ mkreiserfs -f reiser.img

Finally, mount it where you would like to read/write files into it (need to do this as root):

$ sudo mount -t reiserfs -o loop reiser.img /tmp/listdir

You might need to do some chown so that your normal user can write into it. Moreover, if you need it to startup during boot, do remember to put it in /etc/fstab !

FYI, I used a Python script below to see how long it took to write new files:

import os
import time

count = 1000000
total = 0.0
for i in range(count):
	if i % 1000 == 0:
		print 'Creating %i' % i
	start = time.time()
	open('/tmp/listdir/%s' % i, 'w').close()
	total += (time.time() - start)
print 'Avg is %0.8f' % (total / count)