Print Story A time-killing demi-hacker is a sad sight to behold

A cybersleuthing puzzle for yo' bored Friday-afternoon arses



The following is a text dump from an xterm (command-line prompt) on my workstation yesterday afternoon (with some extraneous cruft removed). It all started when a colleague of mine made an (incorrect) assertion ...:

 


 

yicky@workstation:~$ mkdir us_tmp
yicky@workstation:~$ cd us_tmp
yicky@workstation:us_tmp$ python
>>> import re
>>> s = file('/usr/share/dict/words').readlines()
>>> len(s)
234937
>>> vs = [x for x in s if x[0].islower() and x.endswith("us\n")]
>>> len(vs)
7545
>>> v = []; r = range(1,21)
>>> for i in r:
...     o = re.compile("^\w{%d}us$" % i)
...     c = 0
...     for j in vs:
...         if(o.match(j)): c+= 1
...     v.append(c)
...
>>> sum(v)
7544
>>> file('data', 'w').writelines(["%d\t%d\n" % (i,j) for i, j in zip(r, v)])
>>>
yicky@workstation:us_tmp$ cat data
1       3
2       9
3       82
4       197
5       318
6       482
7       718
8       1068
9       1204
10      1218
11      977
12      616
13      319
14      176
15      84
16      48
17      14
18      6
19      4
20      1
yicky@workstation:us_tmp$ gnuplot
gnuplot> plot 'data' with linespoints lt 3 pt 4
gnuplot> set grid
gnuplot> replot
gnuplot> set terminal postscript colour linewidth 3
Terminal type set to 'postscript'
Options are 'landscape noenhanced color colortext \
   dashed dashlength 1.0 linewidth 3.0 defaultplex \
   palfuncparam 2000,0.003 \
   butt "Helvetica" 14'
gnuplot> set output 'data.ps'
gnuplot> replot
gnuplot> exit
yicky@workstation:us_tmp$ (gimp &)
yicky@workstation:us_tmp$ ls -l
total 176
-rw-r--r--  1 yicky somegroup    123 Feb 23 15:11 data
-rw-r--r--  1 yicky somegroup  32850 Feb 23 15:14 data.jpg
-rw-r--r--  1 yicky somegroup 133611 Feb 23 15:13 data.ps

 


 

Graph:

 


 

The Question:

  • What (beyond an overly-spoddish reaction to stultifying tedium) had motivated all of this in the first place? — i.e. What assertion had my colleague made, and why?

Being able to understand what was literally going on ("You grommeted x into y") isn't enough to get the full picture - it requires a small piece of Holmesian logical abduction. The people who can read the technutiae may possibly need help from those who can't.

< Paranoid | BBC White season: 'Rivers of Blood' >
A time-killing demi-hacker is a sad sight to behold | 13 comments (13 topical, 0 hidden) | Trackback
Your colleague by DesiredUsername (4.00 / 2) #1 Fri Feb 24, 2006 at 09:41:33 AM EST
tried to ask this and failed to do it correctly.

---
Now accepting suggestions for a new sigline


Heh. Good try, but nope. [nt] by yicky yacky (2.00 / 0) #2 Fri Feb 24, 2006 at 09:56:32 AM EST

----
15 days left ...
[ Parent ]

Answer by DullTrev (4.00 / 2) #3 Fri Feb 24, 2006 at 09:58:22 AM EST

He said you don't know how to use python or gnuplot.

Unfortunately, I don't know any python at all. I haven't a clue what you did, let alone why.

I like being a project manager, you know.


--
DFJ?


did he say something like by TPD (4.00 / 2) #4 Fri Feb 24, 2006 at 10:08:21 AM EST
there's nothing infront of us

Rock Hard Abs are just a sw-sw-swivel away!


Haha. by yicky yacky (4.00 / 2) #6 Fri Feb 24, 2006 at 10:14:13 AM EST

And I turned round and said, in my best John Major intonation, "That is fallacious: There are exactly seven thousand, five hundred and forty ........ four things in front of us".

It's pretty feckin' geeky, but not quite that bad.


----
15 days left ...
[ Parent ]

my guess by tps12 (4.00 / 3) #5 Fri Feb 24, 2006 at 10:14:08 AM EST
It was proposed that Latin nouns are adopted into English as longer alternatives to short Anglo-Saxon ones. If that's the case, you'd expect the mean length of words ending in "-us" (mostly Latinate nouns) to be somewhat larger than the average for all English words (five letters?).



Not far off at all by yicky yacky (2.00 / 0) #7 Fri Feb 24, 2006 at 10:15:45 AM EST

That's very similar to the assertion made, but no closer as to why it was made in the first place.


----
15 days left ...
[ Parent ]

Linguist! by BadDoggie (4.00 / 1) #8 Fri Feb 24, 2006 at 11:10:55 AM EST
Not calling you racisT.

woof.

OMG WE'RE FUCKED! -- duxup ?
[ Parent ]

Update: The answer by yicky yacky (4.00 / 1) #9 Fri Feb 24, 2006 at 11:36:35 AM EST

I'll have to scoot some time in the next couple of hours and I don't know how long for. This was just a bit of ephemera and I don't want to leave the answer hanging in the air all weekend so here it is, for those who want to know:

The assertion:

My colleague asserted that the frequency distribution of words ending in 'us' would have characteristics more akin to those of a saw wave than a gaussian distribution owing to their latin roots and their over-representation in long, florid, academic texts. I mostly disagreed, saying there may be a slight bias that way but it was still probably more gaussian than sawtooth. In fairness, there is a slight bias in that direction, but it's still gaussian.

This is why tps12's guess isn't completely right, as the data values don't represent the true length of the words, but the length minus two (the length of the string 'us' itself): The true values are shifted rightwards on the graph by one grid-block.

The reason:

My colleague was domain-name-bashing, and had contemplated the cheesy URLism of having a word ending with 'us', so that he could buy a .us TLD. - e.g. 'gibbous' would become http://www.gibbo.us etc. He noticed that the length of such words seems, as tps12 guessed, longer than those from other linguistic roots.

All of this was expessed in much more long-winded and casual language than the above, but hey, it's dull enough already ...


----
15 days left ...


But more importantly: by gazbo (4.00 / 2) #10 Fri Feb 24, 2006 at 11:51:42 AM EST
I spit on your Python!  (Ooh-err):

for i in `seq 1 21`; do
  echo $i  `egrep -c ^[a-z]{$i}us$ /usr/share/dict/words`
done;

Hmm - at least I hope that's right.  I get pretty different numbers to you, but your dict is bigger than mine (man, I'm a one-man comedy show!  Or "comedian" as they call it).  Roughly 5 times the size.  That said, the numbers you came up with are way more than 5 times higher - possibly indicating that more comprehensive dictionaries have a disproportionate number of Latin words.

Or, of course, that my basically untested code is basically wrong code.


"Engarde!" cried the larvae, huskily. - Scrymarch

[ Parent ]

You bash-ed my python! Ghetto 7 by yicky yacky (2.00 / 0) #12 Fri Feb 24, 2006 at 12:03:16 PM EST

This entendr-itude has to stop ...

Nice one. In complete honesty, I spent a couple of minutes titting around with egrep and getting odd results before realizing that, if I'd done it the long way in python, I'd have done it already.


----
15 days left ...
[ Parent ]

hm by tps12 (4.00 / 1) #11 Fri Feb 24, 2006 at 11:59:28 AM EST
Long -us words' "over-representation in academic texts" would affect the frequency distribution over a body of texts (where you're not just tracking how many different words are used, but also how many times individual words are repeated), not the distribution of words in the dictionary. There are more 11-letter -us words in the language, but if they're each much less frequently used than words like "bus," then you might see a less Gaussian distribution when measuring actual use.

Of course, actual use isn't really the issue when you're talking about domain names.

[ Parent ]

Yeah - that was a weak and short-handed by yicky yacky (2.00 / 0) #13 Fri Feb 24, 2006 at 12:05:33 PM EST

explanation for his conjecture, which is more to do with words as atomic non-repeating units, but you got the picture.


----
15 days left ...
[ Parent ]

A time-killing demi-hacker is a sad sight to behold | 13 comments (13 topical, 0 hidden) | Trackback