Latest Event Updates

So what is "tm-2010-03-07T00:44:11.w64"?

Posted on

The meaning of files is not always clear. I got a file named “tm-2010-03-07T00:44:11.w64” in my Linux Ubuntu home directory. It is on 121MB. I didn’t put it there… And there are a few others with the same file pattern. Can I erase them? Googling on w64 gives the information that it is some kind of audio file “Sonic Foundry”. Writing $ file tm-2010-03-07T00:44:11.w64 just tells that it is “data”.
Nautilus file manager has not associated the file with any program. Rhythmbox won’t open it. Further Googling identifies it as a recording by the Jack “timemachine” program. After trying a few audio programs I finally got Ardour (ardour2) to import and play the file…

Advertisements

Third party cookies on Danish public radio web site

Posted on Updated on

Dr-thirdpartycookie

I have recently switched on ‘ask’ for the web cookie (HTTP cookie) management on my Web browser. And I must say I am surprised by the number of web sites using not only their own cookies, but also cookies from third parties.

Danmarks Radio (DR), the Danish publicly funded radio and television, has no commercials neither in the radio, the television nor on the associated web site. I was therefore somewhat surprised that I got a cookie for www.mybanker.dk after browsing on the DR Web site. It turns out that some of the financial data displayed on the DR web site are coming from a server in the domain “six.se”, and it seems that the cookie is coming in from included content. So by browsing DR web site data is send to a company. A cookie is not quite an advertisement. However, it must be of commercial value. Why else should the company have the trouble of adding the cookie? I would think that DR is on slippery ground here.

Besides www.mybanker.dk DR third party cookies is also transmitted from the Polish Gemius company that presumable is doing Web statistics for DR.
Also cookies from dr.adservinginternational.com and .addthis.com show up.

In a modern browser the user can control how ordinary cookies are set. Disabling it completely is not usually an option, because cookies are widely used for web service login. A further issue is Flash cookies, – also called Local Shared Objects. These cookies come from Flash content and are not (at the moment) disabled with the web browser configuration. Many users are unaware of that. Flash cookies can be used by advertisers to circumvent user deletion of web cookies.

A 2009 study found that 98 of the top 100 web sites used ordinary cookies in a standard browsing session (with wikipedia.org and wikimedia.org as the two only exceptions) and 54 sites used Flash cookies. 37 of the 100 web sites had matching web and Flash cookies from advertisement companies. Even whitehouse.gov used Flash cookies.

Social opinion Web sites

Posted on

Data mining for opinions in the social Web will be quite easier if “opinionists” are quantitative about their opinions, the opinions are gathered on a single site and named entities are resolved. Several Web sites aim to solve these issues. A lot of companies that sell products or facilitate the selling already let users rank the individual products or services, (e.g., Amazon, pricerunner.dk), but other Web sites focus exclusively on opinions: Epinions.com comes to my mind. In Denmark I know of three sites: Trustpilot, VaVirk and Spred Rygtet

As of 2010 Spred Rygtet (Spread the Rumor) seems smallish, but interesting in its scope: It has the subtitle “Recommendations and warning about everything from A to Z.” and users have rated, e.g., books, movies, persons and companies with a grade between minus three and plus three. I got to the site when I heard about a TV-program to be aired on national television. The program exposed Jakob Storgaard and his schemes of fraud. On Spred Rygtet 10 accounts rates Jacob Storgaard with the lowest grade. Spred Rygtet seems to have operated since 2007 by someone called Andreas. His blog has not been updated since February 2008 and Spred Rygtet has messy layout, but at least the site is mentioned on the Danish Wikipedia.

Trustpilot is run by Peter Holten M??hlmann since 2007 and had at one point 10 employees. It is customer driven rating site and the company has gained some venture capital. Opinions that do not relate to an actual trade on the Internet are removed from Trustpilot, e.g., comments on the so-called OrangoGate where a company “stole” a domain name (orango.dk) from a private person were removed from Trustpilot, according to M??hlman because “the ratings had nothing to do with the service of the company”. From a business ethics/corporate social responsibility point of view M??hlman must be wrong. The company that stole the domain name surely has come out with a bad reputation. In 2008 Trustpilot itself had ethical problems: Sending out spam emails, parasiting on a recognized Internet trust mark and violating the Nordic ombudsman’s stands on Internet trade. Among its references Trustpilot counts getmore.dk. getmore.dk has been under investigation for fraud with sales tax. On its release in 2007 a commenter noted Trustpilot’s lack of checking on the identity of the people that rate, and in 2009 the rating of getmore.dk was suspicious leading a competitor from Proshop to suggest manipulated ratings. I hope that Trustpilot learn from their mistakes. In their line of business ethics is surely an issue they need to focus on.

VaVirk (vurderinger af virksomheder: evaluation of companies) is letting users rate companies. It is run by the company N??rdIt headed by Ralf Willers. One blogger found that a n??rdit Twitter account engaged in Twitter spamming. N??rdit itself is rated on a page on VaVirk without any indication that N??rdIt owns VaVirk on that page. They too needs to work a bit on the ethics. VaVirk seems less international oriented than Trustpilot. However, it is broader in scope with not just evaluation of companies with Internet-based trade but also of, e.g., Medical Doctors and Hair dressers.

It will be interesting to follow these sites, and how they deal with the problem of rate spamming.

And by the way: I cannot find the API on any of the sites. Third parties cannot easily extract information from these sites. Maybe someone should tell them about Web 2.0?

(This post was also posted on Responsible Blogging)

(Typos correction 16. March 2010)

How to read all of Wikipedia in half an hour

Posted on Updated on

Some may wonder how you are able to read Wikipedia from the huge dumps that are made available from http://download.wikipedia.org. The present file for the current English Wikipedia is 5.6 GB, – and that is compressed! The tricks I used for the article Scientific citations in Wikipedia published in First Monday in 2007 was bzcat, pipes and Perl. In Perl it is relatively easy to call the decompression program bzcat and then capture its output within Perl “as you go”. It is not necessary to decompress the entire file to one very large file. By redefining the input record separator I iterate over each wiki page, and then perform some extraction on each page, – in my case looking for the ‘cite journal’ template.

The original program, slightly edited, is listed below. As far as I remember it toke around half an hour to run the program.

#!/usr/bin/perl -w

use open ':utf8';
use English; 
if (@ARGV) {   $archive = $ARGV[0]; } else {  $archive = "enwiki-20080312"; }  $filenameInput = $archive . "-pages-articles.xml.bz2"; $filenameOutput = $archive . "-cite-journal.txt"; $filenameTitles = $archive . "-titles.txt";
$INPUT_RECORD_SEPARATOR = "";  
open($fileInput, "bzcat $filenameInput |") || die("can't open file ($filenameInput): $!"); 
open($fileOutput, "> $filenameOutput") || die("can't open file ($filenameOutput): $!"); 
open($fileTitles, "> $filenameTitles") || die("can't open file ($filenameTitles): $!");  
$pagenumber = 1; while (<$fileInput>) {
    # Match "Cite journal" template  
    @citejournals = m/({{s*cite journal.*?}})/sig;
    @titles = m|(.*?)|;
    $titles[0] =~ s/s+/ /g;
    # Remove consecutive whitespaces and print to file
    foreach $citejournal (@citejournals) {
          $citejournal =~ s/s+/ /g;
          print $fileOutput "$pagenumber: $citejournaln";
          print $fileTitles "$pagenumber: $titles[0]n";
    }
    $pagenumber++;
}  
close($fileInput);
close($fileOutput);
close($fileTitles);

You can consider the code as under GPL. If you are a scientist and use it I would be glad if you also could site the associated First Monday paper. The program was also used for Clustering of scientific citations in Wikipedia.

Good luck with Wikipedia mining!

Installing Ubuntu on a new Acer Aspire One N450

Posted on

Installing Linux distributions is not always a easy task. Back in October 2009 I wrote about my problems updating an old computer to Debian Lenny. Home I got an stationary computer that dates almost back to the previous millennium. I needed to upgrade that old and slow computer to the present Debian stable from Debian oldstable. In the course of the upgrade I got into troubles and the computer now is quite difficult to boot. In year 2009 I could bring home my portable computer from work, a Dell D420 laptop, but since the introduction of the Danish infamous ‘Multimedieskat’ taking the computer home now costs around 1,500 Danish Crowns each year in extra tax. So I was slightly computer-less and wanted a new computer. I bought an Acer Aspire One netbook with an Intel Atom N450 1.66GB processor (3325.31 bogomips), 1GB RAM, 160GB hard disk, 10.1 inch screen. Several other computer manufactures had models with almost the same configuration, only a larger 250GB hard disk and the design seemed to distinguish the models.

In Denmark it is difficult to avoid buying Microsoft Windows when you buy a new computer. Almost all computers are sold with Windows. Presently, Danish overnerd Poul-Henning Kamp is leading a court battle trying to get a refund from an unnecessary Microsoft Windows Vista license that was attached to his Lenovo. The Acer One that I bought had Microsoft Windows 7 Starter ready for installation.

To see how much effort actually went into installing Linux I timed the process, and here are my notes:

  • 0:00 Unpacking, while trying to find information (i.e. googling) about Linux swap space on the hard disk. Imaging that I should use the standard 2 x the amount of RAM memory…
  • 0:08 Windows installation.
  • 0:18 Installing Windows 7 Started. ‘Configure System Settings’. It here turns out that Windows is not installed but part of the hard disk is allocated to what in the old days would have been an installation CD.
  • 0:50 Finished installation of Microsoft Windows 7 Starter
  • 1:00 Checked that external screen worked at 1900×1080. DR Update worked in fullscreen. Slow. Webcam works. The hard disk was sold to have 160 GB. Now it reports 122 GB free and 136 GB total. The Microsoft installation have taken up quite a bit.
  • 1:10 Rebooting into Ubuntu from a USB stick with a ‘live CD’ that I ‘burnt’ before Christmas. Didn’t catch BIOS. hmmm… it boots into Windows. Rebooting and pressing DEL or F2 continuously.
  • 1:26 Now booting from Ubuntu 9.10 USB stick. Wireless works. Sound out works. External display works. Not in full resolution. Looking around.
  • 1:42 Ubuntu Install. Keyboard layout set to Danish.
  • 1:45 Partitioning. Ubuntu installer does suggestions for partitioning. I see that part of the hard disk is taken up by the Microsoft installation. Deletes all Microsoft. Ubuntu suggests an ext4 file system and swap (so my googling for information about swap space was not that necessary). I was a bit reluctant here: I erased the entire Microsoft installation – together with the part of hard disk that had the ‘installation CD’. In my previous computers I installed Windows together with Linux, although I almost didn’t use Windows. Erasing Windows will give my some extra hard disk space, but with ‘no way back’.
  • 2:11 Ubuntu installed and in Ubuntu. External mouse is slow. In 1900×1080. Cannot close the lid and run on the external screen.
  • 2:40 Continue installation. Crash at the install of Emacs. Unstable wireless.
  • 2:56 Mouse now runs better. 212 packages for upgrade. New kernel 2.6.31-17.54.
  • 3:03 Restarting after kernel and grub upgrade.
  • 3:43 Installed cvs, zynaddsubFx. Camera works with the cheese program. Surprisingly the microphone works (with gnome-sound-recorder 2.28.1) even though people on the Internet have reported problems. Installed Latex.

So around an hour to install Windows – that I later would erase – and two hours to install (with overhead) a basic version of Ubuntu.

After these steps of installation I still have issues e.g., setting up of secure shell passwords, copying of cvs repositories, Java installation, setting the environment variables so Emacs and terminals know where my Latex style file are, setting up the Gnome desktop with multiple workspaces and edge flip.

The installation process went better than my expectations. In the old days setting up the obscure XF86Config could take many days – at least for me.

After a few weeks of experience with the new Acer One running Ubuntu Karmic I have the following observations:

  • It runs around 5 hours on batteries :-) It usually reports under 10 Watts in usage. I wonder if that is due to the new N450 processor?
  • It has no apparent problems in making desktop OpenGL effects on a 1920×1080 screen. :-)
  • It has problems with showing video on external 1920×1080, e.g. YouTube fullscreen. I guess the graphics card is not fast enough.
  • The wireless network is unstable :-( I don’t know what this is. It usually happens with large downloads, and log messages “ath9k: DMA failed to stop in 10 ms” are my only hook on this problem.
  • The power button is under the lid, – not outside :-(. My Dell 420 laptop has the button outside, so when attached to an external screen, keyboard and mouse I do not need to open it to switch it on.
  • No ‘network off’ button which might be a problem in an airplane :-(
  • Two of the USB connectors are positioned quite close :-( Thick USB sticks cannot sit next to each other. There are, however, one connector sitting by itself on the right side of the computer.
  • Switching screens is badly handled. Sometimes I get two black screens. Fn+F5 and ‘sk??rmindstillinger’ do something but not necessarily the right thing.
  • The desktop is sometimes unstable with occasionally Ctrl+Alt+F1 needed.

And the problems are endless: Now my upper and lower edge flip on workspaces has now gone. I also tried installing Ubuntustudio with realtime kernel, jack and all that, but that a longer story.

But all in all I would say the system is reasonably OK.

Together with the netbook I bought a USB computer mouse and a USB qwerty keyboard – the cheapest I could find. The mouse is a small Logitech (M-U0017?) which is OK, though the wire is awfully short, so the computer cannot be far from the table. The keyboard is a compact Logitech ‘Ultra-Flat Keyboard’, and hmmm… I am not that satisfied with it since the configurations of the keys are compact and awkward. There is no space between the keys except between the numerical keys and the rest of the keyboard. For example the arrow keys have no space around them so it is difficult to feel where they are. You have to look. ‘Page Up’ and ‘Page Down’ are long away from where they use to be. I have been looking for another keyboard that is both reasonable compact and have a standard configuration. It seems that most stores in my area all carry the same line of keyboards: Either from Microsoft or Logitech and none of fully fit my wishes.

Backward matching with regular expressions

Posted on

#!/usr/bin/python
# -*- coding: utf-8 -*-
#

s = """I wan't to do 'backward matching': In my case match a series of  numbers backwards, e.g., with "1 2 3 4" I want to pick out  "2 3 4" rather than "1 2 3" with a regular expression such as  "d+s+d+s+d+". Within Python and using re.findall() the  na??ve application gets you "1 2 3". One way would be to reverse  the string - in Python with s[::-1] - and write the regular  expression 'backwards'. However, a solution that seems less  complex simply extends the regular expression repeating the  subpattern with a 'match 0 or more time' preamble. In my case  that would read '(?:d+s+)*(d+s+d+s+d+)'. The code below  behaves properly with the string that this sentence is part of.  """  import re pattern1 = "d+s+d+s+d+" print(re.findall(pattern1, s))  pattern2 = "(?:d+s+)*(d+s+d+s+d+)" print(re.findall(pattern2, s))  #

Brede Wiki and Brede Database 2009

Posted on Updated on

Fninf-03-026-g010

 

I have just drafted a section for the CIMBI 2009 annual report:

We have argued for a wiki approach to database information from published neuroimaging articles [1], and we now have implemented the Brede Wiki available from the Web site http://neuro.imm.dtu.dk/wiki/

The wiki is based on MediaWiki – the software that runs Wikipedia. With an extensive use of so-called MediaWiki templates information can be structured and easily extracted [2]. The content in the wiki is focused on neuroscience information: Text and data about neuroimaging studies, brain regions, topics, software, researchers, organizations, journals and events. The wiki makes extensive use of deep links to other neuroscience databases, enabling federation of content with other neuroinformatics databases. The Brede Wiki has almost 1,500 pages, e.g., describing 206 brain regions and 175 scientific papers. With the extracted data from the structured part of the Brede Wiki a small search interface has been constructed that allows for searching for nearby coordinates to a given query coordinate. The Brede Wiki also allows for upload of volume files in a standardized format. Thus it provides a uncomplicated means for sharing result volumes from neuroimaging statistical analyses.

Another project of the group – Brede Database – has now been included in the large scale American database federation effort ‘Neuroscience Information Framework‘. Furthermore some of the visualization efforts for the Brede Database was described in a recent article [3].

  1. Lost in localization: A solution with neuroinformatics 2.0? Finn Årup Nielsen, NeuroImage, 48:11-13, 2009
  2. Brede Wiki: Neuroscience data structured in a wiki, Finn Årup Nielsen, Proceedings of the Fourth Workshop on Semantic Wikis – The Semantic Wiki Web : 6th European Semantic Web Conference, Hersonissos, Crete, Greece, June 2009: Lange, Christoph
  3. Visualizing data mining results with the Brede tools, Finn Årup Nielsen, Frontiers in Neuroinformatics, 3:26, 2009.

The image is a figure from the CC-by Frontiers in Neuroinformatics article