Posts Tagged ‘Programming’

Debugging python (multi)processing

Thursday, January 7th, 2010
CPython
Image via Wikipedia

My goal is to get the pdb shell from the worker processes i spawned with Process() from python-processing. The “classic” approach to spawning the pdb shell miserably fails:

(Pdb) > /home/redduck666/dev/abj/bin/feeds.py(639)__init__()
-> self.timeout = timeout
Process Process-3:2:
Traceback (most recent call last):
  File "/var/lib/python-support/python2.5/processing/process.py", line 227, in _bootstrap
    self.run()
  File "/var/lib/python-support/python2.5/processing/process.py", line 85, in run
    self._target(*self._args, **self._kwargs)
  File "./feeds.py", line 639, in __init__
    self.timeout = timeout
  File "./feeds.py", line 639, in __init__
    self.timeout = timeout
  File "/usr/lib/python2.5/bdb.py", line 48, in trace_dispatch
    return self.dispatch_line(frame)
  File "/usr/lib/python2.5/bdb.py", line 66, in dispatch_line
    self.user_line(frame)
  File "/usr/lib/python2.5/pdb.py", line 144, in user_line
    self.interaction(frame, None)
  File "/usr/lib/python2.5/pdb.py", line 187, in interaction
    self.cmdloop()
  File "/usr/lib/python2.5/cmd.py", line 130, in cmdloop
    line = raw_input(self.prompt)
ValueError: I/O operation on closed file

The problem here is that processing closes the file descriptors for the processes it spawns, so a straight forward approach like that will not work. Due to the same reason using sys.__std(out|in|err)__ will not work.

The solution for me was to tell explicitly python to use my current stdin/stdout. The ‘r+’ flag is needed as pdb needs to read from stdin.

pdb.Pdb(stdin=open('/dev/stdin', 'r+'), stdout=open('/dev/stdout', 'r+')).set_trace()

I use this on Linux, AFAIK it should work across Unix world (and is probably horribly broken on Windows).

AppEngine debugging tip

Saturday, August 22nd, 2009

Image representing Google App Engine as depict...
Image via CrunchBase
As i explained in my last blog post, GAE does some very weird stuff with stdout, making it very difficult to print information from arbitrary points of code. For example if you invoked the Pdb(stdout=sys.__stdout__) i couldn’t print on my screen from a POST request. I have finally found a fix for that annoyance, you hack sys.stdout:

def set_trace():
    import pdb, sys
    sys.stdout = sys.__stdout__
    sys.stdin = sys.__stdin__
    debugger = pdb.Pdb()
    debugger.set_trace(sys._getframe().f_back)

Routing options

Friday, July 31st, 2009

Public Transit Patriotism
Image by Danielle Scott via Flickr
Ever wanted to do your own google maps like routing implementation? Like user tells you where he is and when where he wants to go and you tell him which buses/trains to take? Well from the open source world i looked at two options, pgRouting and libroutez.

Libroutez gives you the c++ library with ruby and python bindings for working with it, pgRouting gives you SQL stored procedures to work with. I kinda don’t like working with SQL, so my bias was towards libroutez at this point. Since pgRouting operates on the SQL level it kinda makes sense that it’s data is in SQL table, libroutez on the other hand build’s it’s own graph which it operates on. While i haven’t had any reason to think anything bad about the graph it creates i tend to prefer existing, well tested and moderately (hi #nosql!) scalable solutions.

One important difference is the feature set pgRouting provides you a way to find the way between two end points out of the box and to route between two arbitrary points without major difficulties. Libroutez provides both of those plus the ability to use schedules, which is a major advantage for it.

Now lets have a look at the things around the two projects, say activeness of community. I asked a question of both of those mailing lists, on libroutez it took 4 days to get the answer, on pgRouting mailing list i got the first one after 10 minutes and the second one after an hour and 10 minutes. Now don’t get me wrong, i don’t blame the developer for not answering my question, but it does show the activeness of the community, or in this case non-activeness. Indeed libroutez seems to have only one developer, while juding by the first screen pgRouting has at least 3 developers. While this per se is not an issue, it does raise concerns for long term maintainability of the project.

Another issue is the documentation, they both have rather poor, but pgRouting has more examples you can play with :) . The last thing worth mentioning is the ease of getting things up and running, if you have recent enough system libroutez is probably easier start with. Added bonus here is that no postgres/postgis whatsoever is required. On the other hand if you don’t have recent system (both debin lenny and ubuntu 8.10 are too old) you are gonna have issues, because libboost-1.35 is too old for libroutez.

Things i hate in django

Sunday, July 26th, 2009
django-logo-negative
Image by John Griffiths via Flickr

First a disclaimer, i use and like django, occasionally i even advocate it. IMHO it’s advantages greatly over weight it’s disadvantages. To paraphrase brian d foy never trust someone who can’t find things to hate in a thing he loves. Here is my attempt at explaining things i hate about django.

My first problem is called reverse. Imagine a situation where you import something that has a problem in say “pie.views”, reverse will give you “pie.views” as the source of the problem. On an upside it will report the error :) . The real world example of this is trying to run the “social” project from newsmixer source code, after you get through initial problems it throws:

ViewDoesNotExist at /
 
Tried index in module pie.views. Error was: 'module' object has no attribute 'register'

Problem being that “pie.views” doesn’t mention “register” in it’s source code :) .

Reverse ain’t really that powerful, for example, if you have urls ala http://www.kiberpipa.org/sl/event/2009-jun-11/727/luka-princic-crosshibrid/ and the only thing in there you care about is the id, 727 in this case. Well, you’re out of luck with reverse :) .

My next major class of things i hate can be joined under name “bundle”. Say you like the forms library and want to use it in your non-web project?

ImportError: No module named django.utils.html

Ops! Doesn’t work without the rest of django. This is a really small example, for a much bigger one let’s have a look at one of django’s killer features, the admin interface. It doesn’t work without the authentication, which in turn requires django ORM. Those things are by themselves quite big part of django. For example I’d love to see django broken down in reusable packages

When i say that authentication requires ORM i mean that User row in the SQL table has to exist, no matter what your custom auth backend does. Suppose you want to write the LDAP auth module which grants permissions based on the department people are in. The corner case here is what happens when a person changes department? To handle that case reliably at every login you have to delete persons permissions and grant them again the permissions which belong to current department :/.

DJANGO_SETTINGS_MODULE. WTF? :) The manage.py does handle it, but when you deploy scenarios don’t have that luxury. Is it really that hard to do some checks and try to automatically determine it? :)

Having a full text search on django docs page would make me happier, for example search for examples of .extra() is more difficult than it has to be :) . While we are at the django web site, having membership management (password reset/change) for trac would be helpful.

To conclude i can say that given django’s size i actually expected to have more things to rant about.

bash WTFs

Sunday, July 19th, 2009

wtf

Here is my collection of weird bash features, stuff that doesn’t really behave the way i expected it. For short, stuff that made me do “WTF”.

$ [[ 8 > 12 ]] && echo true
true

Wait what? 8 is more than 12? What happens here is that bash performs a lexicographical comparison and since 8 is bigger than 1 it returns true.

$ (( 8 > 12 )) && echo true

The double brackets cause bash to do arithmetic evaluation inside which > is a math operator. Another feature of arithmetic mode is that you don’t have to use the $ to reference variables.

$ var='\'
$ var=`echo "$var" | sed 's/\\/\\\\/g'`
sed: -e expression #1, char 8: unterminated `s' command

Above is the naive way of trying to replace the backslashes in a variable with sed.. What really happens here is that bash interprets the backslashes once before they get to sed, so what sed get’s to see is ’s/\/\\/g’. As far as it is concerned the escaped slash is not a delimiter, and since the substitution expect 3 delimiters it throws the error.

$ var='\'
$ echo `echo "$var" | sed 's/\\\\/\\\\\\\\/g'`
\\
$ echo $(echo "$var" | sed 's/\\/\\\\/g')
\\

As you can see the possible solutions to this are either escape the backslashes one more time, or simply use the newer form of command substitution. Why does it interpret the backslashes? Why does $(command) behave differently? Since ksh behaves the same way i’d assume this is a burden of history we have to live with, and with $(command) they are fixing it. This is a documented misfeature.

When the old-style backquote form of substitution is used, backslash retains its literal meaning except when followed by ‘$’, ‘`’, or ‘\’. .. When using the $(command) form, all characters between the parentheses make up the command; none are treated specially.

While we are at backslashes let’s try to escape the backslash with awk.

$ echo '\' | awk '{ gsub("\\\\", "replaced"); print; }'
replaced

Wait, what? One would expect that ‘\\\\’ pattern would match ‘\\’? It turns out this is another documented misfeature, that string is parsed twice and both times the backslashes are interpreted :| .

When using sub, gsub, or gensub, and trying to get literal backslashes and ampersands into the replacement text, you need to remember that there are several levels of escape processing going on.

First, there is the lexical level, which is when awk reads your program and builds an internal copy of it that can be executed. Then there is the runtime level, which is when awk actually scans the replacement string to determine what to generate.

Pipe handling also deserves to be in the WTF category. My thoughts/examples are here while the official docs exaplain this:

Each command in a pipeline is executed in its own subshell

$ time sleep 0.1 &> file
real	0m0.106s
user	0m0.000s
sys	0m0.008s

This is a final WTF :) . One would expect that the output of time would go to “file”, well, wrong. This is possible because time is a shell keyword and as such can do stuff no other kind (builtins, commands, aliases, functions) in shell ecosystem can do. The positive effect of this kind of behavior is that you can pass time a pipeline and it will time entire pipeline, as opposed to just the first part.

$ { time sleep 0.1; } &> file
$ cat file 
 
real	0m0.108s
user	0m0.004s
sys	0m0.000s

I can’t find this documented anywhere in the official docs, it is however documented in BashFAQ

This concludes my list of bash WTFs, if you can think of any more please leave a comment :)

Reblog this post [with Zemanta]

sed vs awk

Saturday, July 18th, 2009

awk


This post attempts to answer that question once and forever! Just joking ;) , the answer of course depends on the type of task you are doing. Let’s have a look at the differences and use cases when these tools are appropriate. The tools are fundamentally different, awk is full blown language while sed is just a tool. For the record, my personal preference is sed. :)

The typical use case for awk is column manipluation. Say you wanted to print some fields from CSV.

$ seq 19 | tr '\n' , | awk -F, '{ print $12, $11, $9 }'
12 11 9
$ seq 19 | tr '\n' , | sed \
's/\([^,]*,\)\{8\}\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\),.*/\5 \4 \2/'
12 11 9

Awk’s -F determines what awk consideres a delimiter, $N (where N is a number) is that field in current line. That being said it’s quite easy to understand the awk line, while the sed is rather cumbersome. And here is a related hint, -F takes a regex as delimiter, so to have awk split the line on either space or coma you would use -F’[ ,]‘.

$ seq 10 | awk '{ i+=$1 } END { print i }'
55

Next use case for awk is math, sed can’t do arithmetics in a sane way. Add up all first columns is a simple task in awk with no (sane) sed alternative. Another common use case is if you want to compare certain column to some value (say, want lines with second column more than 10).

$ var=a
$ echo ab | sed "s/$var/b00/g"
b00b
$ echo ab | awk -v var=$var '{ gsub(var, "b00"); print; }'
b00b

Next up is philosophical stuff, with sed you can’t really separate the input from the logic. You are essentially generating sed commands, with awk you have fixed logic and are only telling it what variable to act on. Consider what happens if you pass any characters special to sed in $var? sed chokes, awk doesn’t care.

$ var='\'
$ echo ab | sed "s/$var/b00/"
sed: -e expression #1, char 8: unterminated `s' command
$ echo ab | awk -v var=$var '{ gsub(var, "b00"); print; }'
ab

This particular problem is easy to work around, you have to escape the backslash and slash (because it is the sed delimiter) and you’re good to go:

var=$(echo "$var" | sed 's/[\/\\]/\\&/g')

But what happens if you use a regex meta character (such as ‘^’ or ‘$’) in your input? This problem is really irrelevant to the sed vs awk debate, as both suffer from this problem. The solution of course is the same as above, escape it. Here it is, assumes you are not enabling ERE (if you are couple more meta characters have to be added). Before we have a look at the example let me explain that this kind of thing is usually wrong :) , you are putting a hack in there. A sane alternative is usually fgrep.

var=$(echo "$var" | sed 's/[]\/*.^$[]/\\&/g')

Awk being a full blown language offers you considerable control over flow of he program. In sed you have ‘b’ to do an unconditional jump, ‘t’ for conditional jump and unportable ‘T’ for negated conditional jump. With awk you can jumps (you have if statement) on pretty much arbitrary conditions.

Doing anything with variables is another big no-no in sed. If you want to do filter the rest of the text based on something in the text itself, there is no sane way to do that. Event simple things like getting lines where 1st column is contained in the forth is painful/unmaintainable in sed. Here we use a handy sed feature where you can use back referencing even when still on the left side of the regex.

$ echo '1,2,3,11,4' | awk -F, '$4 ~ $1'
1,2,3,11,4
$ echo '1,2,3,11,4' | sed -n \
'/^\([^,]*\),\([^,]*,\)\{2\}[^,]*\1/p'
1,2,3,11,4

An advantage sed has over awk is that it uses NFA regex engine, making back references possible. There are hacks that allow you to get back referencing to work in awk, one is gawk’s gensub(), but that places you in the portability ghetto, There was awk library named awke which provided this functionality as well, but it seems dead now.

So in conclusion awk is great for math inside text and for column manipulation, i use sed for most of the other stuff. And if you have really complex stuff to do chances are in the long term it is more maintainable to do it in say python.

Reblog this post [with Zemanta]

Avahi thoughts

Tuesday, June 30th, 2009
Diagram of Streaming Multicast
Image via Wikipedia

First a short intro to avahi, basically it is a ZeroConf implementation for linux, to make a long story short through use of multicast it is able to discover services as they appear (as well as scan network for those services).

This is a tale of a hacker deciding to take a pydra ticket. After verifying that avahi has python bindings i headed to their home page looking for docs. There is a “ProgrammingDocs” which looked like a good sign, imagine my horror when i read on that page

Though no real documentation about the DBUS API is available, you may browse the DBUS introspection data online

OK, so they pretty much don’t have docs, it can’t be that bad to work with it, right? Their API is to f**** complicated :-) , this is supposed to be the simplest of the “client” examples, this is the simplest of the “publisher” examples, fortunately the publisher wrapper is very nice to work with, but the fact that it exists hints that there is something wrong with the API.

As far as the protocol goes, first thing that struck me is that you can’t really distinguish between server and client, both take an active role. “publisher” advertisizes it’s service, while the “client” (or however you wanna call it) initiates the glib event loop and waits for asynchronous callback to happen. If we ignore the fact that you are forced to use certain event loop having continuous discovery is not a bad thing, it provies you a way to do discovery even if certain involved parties are temporary down, as well as the ability to see them as they join the network. What i’m saying is that forcing people to use event loop is a bad thing (as it is big overkill in simple cases).

If we get to my code, i choose to make the master the one discovering nodes, the reason for this is that this way i don’t have to make the node tell the master “hey i’m alive use me” (and of course implement the appropriate extension to the protocol). So basically a master is looking for nodes, when a node is put on network it advertisizes itself and master finds it and add’s it to it’s Node list.

Reblog this post [with Zemanta]
Blog Widget by LinkWithin