Posts Tagged ‘Python’

Calculating CPU Utilisation in Linux

Saturday, January 31st, 2009

Metrics are really useful. It lets you monitor performance, and based on those metrics that you’ve gathered over time, you can make informed decisions on improving performance. One of these metrics is CPU utilisation.

Procfs in Linux is full of information, and CPU utilisation can be be calculated from the outputs of /proc/stat. Have a look at man proc for more details.

Here’s a small python app that reads /proc/stat and prints out the CPU utilisation time.

#!/usr/bin/python
import time

FNAME='/proc/stat'

def readBody():
    fp = open(FNAME, 'r')
    lines = []
    try:
        lines.extend([l.strip() for l in fp])
    finally:
        fp.close()
    return lines

def splitBody():
    lines = []
    lines.extend(l.split() for l in readBody())
    return lines

class CPUTime:
    user = 0
    nice = 0
    system = 0
    idle = 0
    total = 0

    def parse(self, line):
        self.user = long(line[1])
        self.nice = long(line[2])
        self.system = long(line[3])
        self.idle = long(line[4])
        self.total = float(self.user + self.nice + self.system + self.idle)

    def __repr__(self):
        return 'user=%s, nice=%s, sys=%s, idle=%s, total=%s' % (self.user, self.nice, self.system, self.idle, self.total)

    def usageUser(self):
        return self._doPercentage(self.user)

    def usageNice(self):
        return self._doPercentage(self.nice)

    def usageSystem(self):
        return self._doPercentage(self.system)

    def usageIdle(self):
        return self._doPercentage(self.idle)

    def delta(self, other):
        self.user -= other.user
        self.nice -= other.nice
        self.system -= other.system
        self.idle -= other.idle
        self.total -= other.total

    def copy(self):
        t = CPUTime()
        t.user = self.user
        t.nice = self.nice
        t.system = self.system
        t.idle = self.idle
        t.total = self.total
        return t

    def _doPercentage(self, a):
        return a / self.total * 100.0;

def main():
    print 'Collecting first sample'

    first = CPUTime()
    first.parse(splitBody()[0])

    while True:
        time.sleep(1)
        second = CPUTime()
        second.parse(splitBody()[0])
        secondCopy = second.copy()
        second.delta(first)
        print 'user=%s, nice=%s, sys=%s, idle=%s' % (second.usageUser(), second.usageNice(), second.usageSystem(), second.usageIdle())
        first = secondCopy

if __name__ == '__main__':
    main()

The source code can be downloaded from http://svn.xp-dev.com/svn/rs_scripts/trunk/get_cpu.py
as well.

Here’s an example run output:

rs@laptop:~/projects/scripts/pub$ ./get_cpu.py
Collecting first sample
user=2.5, nice=0.0, sys=0.5, idle=97.0
user=0.995024875622, nice=0.0, sys=1.49253731343, idle=97.5124378109
user=1.96078431373, nice=0.0, sys=0.490196078431, idle=97.5490196078
user=0.980392156863, nice=0.0, sys=0.980392156863, idle=98.0392156863
user=1.9512195122, nice=0.0, sys=0.975609756098, idle=97.0731707317
user=1.9801980198, nice=0.0, sys=0.990099009901, idle=97.0297029703
user=2.89855072464, nice=0.0, sys=1.44927536232, idle=95.652173913
user=2.42718446602, nice=0.0, sys=1.45631067961, idle=96.1165048544
user=6.34146341463, nice=0.0, sys=1.9512195122, idle=91.7073170732
user=1.9512195122, nice=0.0, sys=2.43902439024, idle=95.6097560976
user=3.98009950249, nice=0.497512437811, sys=1.49253731343, idle=94.0298507463
user=1.94174757282, nice=0.0, sys=0.970873786408, idle=97.0873786408
user=2.94117647059, nice=0.0, sys=0.490196078431, idle=96.568627451

Feel free to take it and do something useful from it. It’s in the public domain.

Python for IntelliJ

Saturday, November 22nd, 2008

Jetbrains have released the long awaited Python plugin for IntelliJ. I had a test drive with it, and all looks pretty much OK as a first cut version. I’m sure there will be a ton of added features in the coming month. Having said that, it is quite usable at the moment. There is no debugger for now and I wouldn’t use it for  complex, mature projects as yet. But kudos to them! It’s a very good start!

Source http://xkcd.com/353/

Python and Multi-threading

Tuesday, October 7th, 2008

It has been a few days since Python 2.6 has been out, and the word on the street is that it’s meant to ease the transition into Python 3k. Python3k is not backwards compatible to the 2.X releases. I haven’t had much time on my hands to get down and dirty with the new 2.6 release, but have had some time to read up on it.

Most people know that Python does have a threading API that is pretty darn close to Java’s. However, the way it has been implemented is that all threads need to grab hold of the Global Interpreter Lock to ensure that only one thread at any one time can execute within the Python VM. This is to ensure that all threads have the same “view” of all variables. Apparently they tried to avoid this by making the Python VM thread safe, but it did get a terrible performance hit.

Java tends to get around this by having a rather complex memory model within the Java VM where each thread has it’s own virtual memory. That’s why you have to synchronize various sections of your code to ensure that threads see the same variable states. I highly recommend reading up Doug Lea’s article on synchronization and the Java Memory Model for anyone who wants to do very intensive multi-threaded applications in Java.

So, what are the implications of having to grab hold of the Global Interpreter Lock in Python ? The problem is that it is not TRUE multi threading. You, as the programmer and designer (you DO design your solutions first, right?), will have to  plans on when threads should just go to sleep and allow other threads to run. The VM will not do this for you, and one might say that it really is closer to a single threaded VM. From past experience, I’ve found Python’s Threads to be really useful when I’m making blocking calls (for e.g. grabbing a DB connection, blocking APIs (yuck!)), and can do something else in the background while the main thread is sleeping. You could get around this problem by using sub-processes, but there was no easy way to do it, and you had to add a lot of boiler plate code every single time. There was just no support for a clean true multi-threaded interface out of a standard installation.

Now, in Python 2.6, there’s a new package for creating sub-processes called multiprocessing. After a quick glance, it looks very similar to the threading API, BUT instead of running threads, it creates a child process which has it’s own memory and in turn does not need to share it’s Global Interpreter Lock. My own prediction is that it comes at a cost of creating a new process and memory space efficiency. However, you do end up with a TRUE multi-threaded application that really uses all the available processor cores on a multi-core CPU. Considering that RAM is getting cheaper, and processors getting more cores built into them, I think this is a fair trade off.

As always, and this applies to Java as well – writing a true multi-threaded application is not trivial, and always do your homework before you get started! In the past, I always had to fallback to Java for the more intensive applications that I wrote because I always thought creating sub-processes in Python was too tedious. From now on, I have no excuses! The new package in Python 2.6 looks very neat and removes the need to write tons of boiler plate.

Python

Marsenne Prime and Python

Saturday, September 6th, 2008

A couple of weeks ago, I read the Slashdot entry on the discovery 45th Marsenne Prime. In a nutshell, a Marsenne Prime is a prime number that is calculated via: Mn = 2n – 1

This got me thinking – how does one calculate such a large number (the 44th prime has 9.8 million digits). The obvious problem in calculating this number would be the fact that the largest number most compilers and languages support would be an unsigned long long, which is not even close enough to hold such a large number. At this point I recalled reading somewhere that Python has support for handling numbers of any size. I knocked up the following script and ran it on my dev “box”:

#!/usr/bin/env python
import sys

if len(sys.argv) != 2:
	print 'Need power'
	sys.exit(1)

exp = long(sys.argv[1])
print 'Exponent %s' % exp
ans = pow(2, exp)

fp = open('out.write', 'w')
try:
	fp.write(str(ans))
finally:
	fp.close()

print 'Done'

All it does is grab the exponent from a command line parameter and writes the output to a file called out.write. I only cared in calculating the exponent part (2n) as the subtraction is trivial. The processor is a 1GHz Transmeta Crusoe with 256MB RAM. Not exactly the fastest CPU on the block, but it gets the job done as my development machine. CPU details below:

processor	: 0
vendor_id	: GenuineTMx86
cpu family	: 6
model		: 4
model name	: Transmeta(tm) Crusoe(tm) Processor TM5800
stepping	: 3
cpu MHz		: 993.322
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr cx8 sep cmov mmx longrun lrti constant_tsc up
bogomips	: 2026.55
clflush size	: 32

In total, it took 4020 minutes (2 days 19 hours) to calculate the 44th Marsene prime on this weakling machine:

rs@small:~/apps$ time ./dopow.py 32582657
Exponent 32582657
Done

real    4020m29.975s
user    3836m26.106s
sys     5m49.542s

And the number has 9,808,358 digits!

rs@laptop:~/mounts/petite/apps$ wc out.write
      0       1 9808358 out.write

Have a look at the file (10MB RAW), or grab the compressed version (5MB ZIP).

I am impressed by Python’s ability to handle such large numbers. Kudos to Guido and his team of brainiacs for coming up with this feature.

IntelliJ IDEA vs Eclipse

Wednesday, May 21st, 2008

With a risk of starting a religious war between these schools, I will be bold enough to say that when it comes to Java development, IntelliJ wins hands down, and I’ll explain my perspective below.

But, before moving on to the gist of it, let me add some context and background to my current situation. Having been a geek at heart, I am always looking for even geekier ways of doing things. For example, rather than just deploying one of the few hundred blogging applications out there, I’ve written my own using Django. It’s not because I have a masochist mind, but because I find it fun, exciting and challenging. Based on that, I can basically summarise my professional (IT and programming) needs as: geeky, fun and exciting. Others, come second (yes, that includes remuneration and benefits). I’ll jump on to anything that is interesting and challenging. With that out of the way, lets continue.

I’ve been building front office trading applications for a good part of 2.5 years now (straight out of college). All of it in Java. No J2EE, just all in good ol’ plain J2SE. In a short span of time, I’ve managed to use so many Java technologies – Hibernate, Jetty, Tomcat, JBoss (we never used it as a J2EE container), Log4j, Slf4j, commons-*, etc. I’ve learnt how to write performant code, to think twice on how a JVM might compile this code, and how JIT might enhance it. I’ve learnt to plan and code very quickly as well. Summing all of these up, I need an excellent tool for programming to ensure that I spend all my time solving the problem at hand, rather than solving the problems of the tool.

I had used Eclipse in college, albeit sparingly. It was OK, and I thought that IBM did do a great job in initiating it. Then I graduated and started at my first job. The team had a standard on using IntelliJ, and for the first time in my life, I started using it. I saw how team members around me were using it and I picked up some shortcuts and tips on how to code quickly by leveraging IntelliJ features. At some point I saw this one guy use IntelliJ and he was literally killing it. Till that point, I have never seen so much code being generated in such a short time, and it was useful code! Not templated rubbish and stuff that you’d add to an API “because someone might need it in the future”. He was really productive. So, I did what I think has changed the way that I code: I copied him (shamelessly)! I observed and learnt what he was doing, and why. How his thoughts flowed while coding and how he tied that in with his tool – IntelliJ. Safe to say – I learnt more about IT and programming in my first 4 months of my job than 4 years in college.

From then on, I did not look back on Eclipse at all for Java. I even bought a personal license for IntelliJ. It was one of the most productive times in my professional career. I took ownership of the team sometime early 2007, and was advocating IntelliJ to everyone in the firm. Everyone got about doing their work, and maybe once in a while had a tiff with IntelliJ, but that’s about it. Best part of all was that it was fun coding with IntelliJ. It does what I’m expecting it to do, and quickly at that. Refactoring is a sheer pleasure as well.

Towards the end of last year, I had a bit of a problem. One of my servers had 2GB of physical RAM, and I needed to chuck in 10-20 JVMs for one of my personal projects, and that just won’t scale well. FYI, a busy running JVM does take quite a lot of heap space. So, I started looking at Python as a light weight alternative. I really like Python as a language, and am advocating its use in building low latency trading apps (tied in with the use of Psyco). The only problem was that there wasn’t an IntelliJ equivalent for Python. All the IDEs just sucked! I wanted to be able to kill an IDE and spew out just as much code as I did when programming Java with IntelliJ.

This was when I started using Eclipse heavily. I settled with PyDev with Extensions and it’s actually not too bad compared to the other Python IDEs out there. Take away all Eclipse’s nuisance, the plugin is pretty polished. It has code completion, some refactoring, satisfactory code browsing and as a whole it was OK. I started to convert most of my Java personal projects to Python, though not at the level of productivity that I had desired. IntelliJ 8 is due to have good Python support (including debugging). If that’s satisfactory, I’m ditching Eclipse for my personal

At this point, I was using IntelliJ at work, though I did not code much for a year. I did build prototypes, but most of my time was spent in managing the team and running the show. I used Eclipse outside work for Python. In effect, I ended up spending most of my coding time in Eclipse.

At the beginning of this year, I moved firms and am back to coding full time. At this new place, the mandate is to use Eclipse. What a shocker it was! My productivity almost halved. What took me 30 minutes to complete in IntelliJ took an hour in Eclipse, due to its clunky refactoring, code browsing, searching and interface. At this point I realised what the problem with Eclipse was: its just trying to satisfy too many people. It’s trying to be the mother of all IDEs and support every bloody language in the world. In my opinion, IntelliJ have perfected this – for a long time, they’ve focused primarily on Java. They’ve done an amazing job, and at the moment, just replicating their success to other languages. Being unhappy with coding Java in Eclipse, I went about in the firm asking why there aren’t many IntelliJ users – the main reason: cost! Apparently, the cost to benefit ratio of IntelliJ was appaling.

I am not entirely sure how they conducted their research, but in my humble opinion, there is almost no way that you can make a statement like that. While IntelliJ costs money, you quickly recover that cost from the individuals higher contribution levels. In the big picture, its an extremely small cost ($599 per enterprise license) compared with the other costs of keeping a developer seated in an office. You have overheads, admin costs and other misc things to worry about, which really makes the one off $599 miniscule. In fact, I think Eclipse actually costs more – the time spent working with the tool, instead of the problem at hand has an internal cost. There is no way in Eclipse to get about doing a big refactor of a few code bases without changing perspectives (most common for me is switching between Java and Team Synchronisation). The lower levels of productivity have more costs associated to it. What about repercussions of delivering late? Reputational risk? I can just carry on and on about the extra costs that comes with Eclipse. Plenty of folks in management who decide on Eclipse over IntelliJ just don’t realise that the $0 on Eclipse is not a true $0. At the same time, the $599 for IntelliJ is not a true $599 – there are additional costs with using IntelliJ as well, but in the big picture in hiring and keeping a developer in office, its amounts to something really small. The benefits from using IntelliJ are pretty evident – it’s so much simpler to use, and developers tend to spend more time on the problems at hand rather than on the IDE itself. How did they carry our the cost-benefit analysis? I can only assume that they took the cost of Eclipse at a flat $0, which really does not reflect reality.

So, here I am. For the past 4 months, during my day job I am using Eclipse for Java and at home using Eclipse for Python, until this morning when I added a tutorial section to XP-Dev.com (which is still in Java using the amazing Wicket framework) and fired up IntelliJ for the first time. I added some code here and there, and only took a few minutes to add the section in and deploy it out. I was so relieved and excited to use IntelliJ, that I just had to rant about it and how Eclipse has been picked as a standard Java IDE in many organisations due to costs.

I am waiting eagerly for Python support to improve in IntelliJ, and then I can start using it again for my personal projects. On the work front – I never give up without a fight. I will carry on advocating IntelliJ for the right reasons, even if it’s all about the costs. I do want to put back the fun in coding at the workplace! I do want to get the firm (or at least my immediate team members) on to higher levels of productivity.

Ext3 – handling large number of files in a directory

Saturday, May 10th, 2008

If you’ve used Linux in the past, I am pretty sure that you’ve heard of the Ext3 file system. It is one of the most common file system format out there, used mainly on Linux based systems.

I’ve noticed something really annoying about how it handles large number of files in a single directory. Essentially, I have a directory with almost a million files and I found that creating a new file in this directory took ages (in the region of tens of seconds), which is not ideal at all for my purpose.

After some reading, and much research, I learnt that Ext3 stores directory indices in a flat table, and this causes much of the headache when a directory has many files in a directory. There are a couple of options.

One, restructure the directory so that it does not contain that many files. I did some tests, and in a default (untuned) Ext3 partition, each subsequent write degrades horribly past the 2000 file limit. So, keeping the items in a directory to within 2000 files should be fine.

Second, is to enable the dir_index option on the Ext3 file system. Run the following as root and you should find that it improves a lot. Do note that the indexing will take up much more space, but then hard disk space is not too expensive nowadays:

$ sudo tune2fs -O dir_index /dev/hda1

Finally, just use something like ReiserFS which stores directory contents in a balanced tree, which is pretty darn fast and you don’t have to muck around tweaking things.

If you’ve got your main partition as an Ext3, and can’t really afford to reformat it into ReiserFS, there might be an alternative: create a blank file and format that as a ReiserFS file system and mount it using loopback.

So, lets create the file first. This depends on how much data you need to handle, and in this example, I’ll just create a ~100MB file full of zeros:

$ dd if=/dev/zero of=reiser.img bs=1k count=100000

Next, format the file using ReiserFS as below. It will complain about the file ‘reiser.img’ not being a special block device (and we know that!). Just say yes and carry on.

$ mkreiserfs -f reiser.img

Finally, mount it where you would like to read/write files into it (need to do this as root):

$ sudo mount -t reiserfs -o loop reiser.img /tmp/listdir

You might need to do some chown so that your normal user can write into it. Moreover, if you need it to startup during boot, do remember to put it in /etc/fstab !

FYI, I used a Python script below to see how long it took to write new files:

import os
import time

count = 1000000
total = 0.0
for i in range(count):
	if i % 1000 == 0:
		print 'Creating %i' % i
	start = time.time()
	open('/tmp/listdir/%s' % i, 'w').close()
	total += (time.time() - start)
print 'Avg is %0.8f' % (total / count)

Calculating directory sizes in Python

Tuesday, April 22nd, 2008

A very simple way of getting the total size of a certain directory.

All it does is use Python os.path.walk and keep track of the accumulated size while walking the directory tree.

import os
def getDirectorySize(directory):
    class TotalSize:
        def __init__(self):
            self.total = 0

    def visit(totalSize, dirname, names):
        for name in names:
            absFilename = os.path.join(dirname, name)
            if os.path.isfile(absFilename):
                totalSize.total += os.path.getsize(absFilename)

    totalSize = TotalSize()
    os.path.walk(directory, visit, totalSize)
    return totalSize.total

Example usage:

print getDirectorySize('/home/rs')

And the output is:

91904975510

Which means my home directory is about 87 GB! One thing to note is that it will double count if you have multiple symlinks that point to the same file in the same tree.

XML dict anyone ?

Thursday, April 17th, 2008

I wanted to be able to persist a Python dictionary (dict, essentially a Map in the Java world) into a string form, and be able to reload it later from the XML and make some sense out of it.

Here’s a very simple example on how you can achieve this very easily.

from xml.dom import minidom
from StringIO import StringIO

VALUE_ATTR = 'value'
class XmlDict(dict):
    def __init__(self, keyToXml, keyFromXml, valueToXml, valueFromXml):
        dict.__init__(self)
        self.keyToXml = keyToXml
        self.keyFromXml = keyFromXml
        self.valueToXml = valueToXml
        self.valueFromXml = valueFromXml

    def fromXml(self, xml):
        self.clear()
        doc = minidom.parseString(xml)
        root = doc.documentElement
        for child in root.childNodes:
            valueBuffer = StringIO()
            for cn in child.childNodes:
                valueBuffer.write(cn.data)
            self[self.keyFromXml(child.tagName)] = self.valueFromXml(valueBuffer.getvalue())

    def toXml(self):
        doc = minidom.getDOMImplementation().createDocument(None, 'dict', None)
        for key, value in self.iteritems():
            wordElement = doc.createElement(self.keyToXml(key))
            valueNode = doc.createTextNode(self.valueToXml(value))
            wordElement.appendChild(valueNode)
            doc.documentElement.appendChild(wordElement)
        return doc.toxml()

The example below basically maps a string to a string:

    d = XmlDict(str, str, str, str)
    d['sds'] = 'ssds'
    d['sdss'] = 'asda'
    print d.toXml()
    d2 = XmlDict(str, str, str, str)
    d2.fromXml(d.toXml())
    print d2

Output is:

<?xml version="1.0" ?><dict><sdss>asda</sdss><sds>ssds</sds></dict>
{'sdss': 'asda', 'sds': 'ssds'}

All you have to do is define your own callable that can take in a keys/values and convert them into string representations.

Infact, by having smarter callables that check the value being persisted, you could runaway with chucking anything into the string value (nested XML in attributes is wrong though and should really be avoided)

Python bindings for Berkeley DB (pybsddb)

Thursday, April 17th, 2008

Off late, I have been using the pretty fast and robust Berkeley DB (BDB) (which is now owned by Oracle!) to store some data (in fact, for this project, its a LOT of data – around a couple of gigs per db file, and there’s a bunch of db files). I write most of my apps outside work in Python and needed to access BDB files.

There is built in support for BDB in any modern Python installation and its pretty easy to use. However, you won’t benefit from the latest additions unless you’re willing to wait for a new version of Python to be shipped out.

Well, there’s an alternative called pybsddb, and there is support for the latest BDB in the works. The installation is pretty simple:

  1. Install BDB runtime and development packages. I’m using Ubuntu, so, that’s pretty straight forward:
    sudo apt-get install libdb4.5 libdb4.5-dev
  2. Download the latest file from here. I downloaded bsddb3-4.6.3.
  3. Extract it out using:
    tar xvf bsddb3-4.6.3.tar.gz
  4. cd bsddb3-4.6.3
  5. Build pybsddb:
    python setup.py build
  6. Run the tests (optional):
    python test.py
  7. Install it:
    sudo python setup.py install

Simple as that. You should be able to create DBEnv() and DB() objects and start manipulating the BDB files!

Dynamic vs Strong Typed Languages

Monday, February 18th, 2008

BTW, I’ve had some programming languages re-direction in the past few months.

Hardcore programming for me started a while back doing a project in Java during my college days. What I mean by “hardcore” is not generating a lot of code, but instead taking a step back and evaluating a programming language for what it offers – it’s pros and cons. I really found it difficult to make sense out of Java – there was plenty of boiler plate code that needs to be written to do anything. A good example is having a class with read-only members (very very useful for things like configurations). There is no easy way to ensure that anyone using your class is prohibited from changing the members post generation. The quickest (and it’s not that quick at all) is to make your members private and use getters. Now, imagine a class with a couple of hundred members and nothing but vi/vim as an editor – see what I mean ?

Following that, I began being a big fan of PHP due to it’s dynamic typing. in fact, this was so liberating from strong type languages, I used PHP for every single thing! BTW, PHP does not solve the problem with Java mentioned above. Then I found myself in the situation where I ended up doing a lot of reflection and function of functions, and other highly dynamic structures, which is what one would do with a dynamic language. The problem was that it was very difficult to decide what a variable was holds at any one point, and I ended up going back into writing more boiler plate to check types during runtime (as PHP’s compile time checks are not so strict).

At this point I graduated and had just begun working at an Investment Bank as a full time Java developer. BTW, if anyone is wondering, the level of programming skills that you have when coming out of college is appalling compared to what is needed for building Front Office trading systems. My advice for all those students pursuing a CS/Info Sys degree: learn how to solve problems and think critically. But thats another story.

I started using a lot of Java, and I mean A LOT – to the point where I used to have nightmares about NullPointerExceptions (this is just a joke BTW – I am sad, just not that sad). And guess what ? I started falling in love with Java again. Compared to the time in college where I was complaining about boiler plate, the difference was IntelliJ (an IDE for Java). I found that IntelliJ generated all the boiler plate for you. This was an amazing step forward. Here I was happy with all the strong typing that comes with Java and an amazing tool that did all the boiler plate. For example problem with Java above, IntelliJ solves in a few keystrokes – Alt+Insert, then select “Getters” and pick all the members. Moreover, I discovered reflection and generics in Java, which is extremely powerful in getting type-safety and some essence of a dynamic language (provided you know what you’re doing, as you can shoot yourself in the foot if you don’t!).

So, I was back being an advocate of strong typed languages, but on one condition – have the right tools at your disposal. Lately, I’m back to using dynamic languages. I’ve been having some very good thoughts on Python, and how it’s such a cleaner and refreshing approach to dynamic languages compared to PHP, without the behemoth resource usage of a full virtual machine like Java.