Java program: count quantities of 5-letter combinations in human chromosome 21

Started by IchoTolot, November 06, 2016, 02:25:42 PM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

IchoTolot

Had a little programming homework for "Biological Databases" this week.

The task was:

- Count the quantities of all 5-letter comninations in the human chromosome 21.

There are 5 letters:   A T G C and N

Text will be as following: NNNNNNNATAGCGATCGTANNNNNNGCTAGGCNNNNNNNN........

For example:  NNNNNNTG      has "NNNNN" 2 times"NNNNT" 1 time and "NNNTG" 1 time

The input text is around 800000 lines with mostly 80 letters per line (the last one may have fewer), so the speed is critical here!

My first version would have taken 1,5 DAYS to do this!
This version here < 1 hour. Of course if this would be a marked homework which I must turn in I would seek some ways to do this even faster + improve my code further as it lacks annotation now + has english and german parts. Also the quick-sort algorithm is just a modified copy-pasted google result uptated to sort a pair of data and not just a simgle entry).

So what exactly does the java code?

It makes an array of Strings with all possible combinations (3125) of A,T,G,C,N and an equaly sized integer array with the counter to the 5 letter sequence initialised, with 0 of course.
Then a BufferedReader reads in the first line of the input text file into an input string.
The substring of the first 5 letters is compared to the string array and upon a match the counter is raised by 1.
This could be optimised by stopping the search for a match when a match is indeed found. Also "NNNNN" could be the first sequence to compare as it is by far the most common.
After raising the counter the first letter of the input string is deleted as it can go now.
When the input string has fewer than 5 letters the next line is added to the input string.
Aslo I print out a little status (in german :P) in which line the program currently is.
The loop continues until there is no more line to ad in the input text file.

After that I sort the result arrays with the modified copied quick-sort algorithm (that is also in german btw :P) and the sorted results are beeing written into an ouput text file "QuantityOfWords" by a BufferedWriter.

Simon mentioned a more robust way by only reading in single characters and count + delete when the string has reached 5 characters.

I attached the java code here + the output file for the whole human chromosom 21 (to be Opened with a text editor forgot the ending :-[).
The whole chromosome sequence is FASTA format would be ~ 45 MB so you need to find that on your own if you want it ;P.

Again this code is FAR FROM OPTIMAL and contains poor writing + mixed language. If this was sth that is marked or I must turn in there would be quite some work still to do :P




Simon

I don't have Java installed, therefore can't test.

The bookkeeping with two arrays looks slow: You compare (the found string) against every (string that you treat as an associative array index) in linear fashion, until you find an exact match. You might want binary search instead. Search the web for Java's HashMap.

Not going into code style because this is explicitly a throwaway program, written as quickly as possible. :lix-tongue:

-- Simon

Simon

I have written this in D, using an associative array for bookkeeping. See attachment for source. My program doesn't throw away comment lines, you must do that yourself. Icho's 45 MB input has one comment line in the first line, otherwise it's all data.

Icho's 45 MB input runs in 13 seconds on my 10-year-old laptop, even without optimizations enabled. :lix-evil: And I have checked the output against Icho's file, it's identical.

I have commented the source to explain D syntax that might not be obvious for Java or C programmers.

-- Simon

IchoTolot

And here we see the power of a hashtable :)

1 hr ---> 13 secs

down from 1,5 days :XD:

Colorful Arty

My Youtube channel where I let's play games with family-friendly commentary:
https://www.youtube.com/channel/UCiRPZ5j87ft_clSRLFCESQA

My Twitch channel: https://www.twitch.tv/colorfularty

My levelpack: SubLems
For New formats NeoLemmix: https://www.lemmingsforums.net/index.php?topic=4942.0
For Old formats NeoLemmix: http://www.lemmingsforums.net/index.php?topic=2787.0
For SuperLemmini: http://www.lemmingsforums.net/index.php?topic=2704.0

My levelpack: ArtLems
For New formats NeoLemmix: https://www.lemmingsforums.net/index.php?topic=4583.0

IchoTolot

Quote from: Colorful Arty on November 06, 2016, 08:04:27 PM
Are hash tables really that fast? O_O

It depens. For this they are excellent though. As each value has its own hash value and simply increase the counter in the bucket with the same hash value (as far as I understand them in my inner picture :P).

Nepster

Warning: Every language implements hash tables slightly differently, so be careful. Moreover I have only passing knowledge about java, but I am answering nevertheless :P

Quote from: Colorful Arty on November 06, 2016, 08:04:27 PM
Are hash tables really that fast? O_O
The answer is: It depends. For some purposes, they are very useful, for others less so.

So what are hash tables? Think of them as arrays, but not indexed by integers, but indexed by a key type that the programmer can specify. So in the example above, the type for the keys would be strings and whenever we find a substring "ACGCC", we would increase the counter via
MyHashTable["ACGCC"]++
without having to compute the index, where the counter for the string "ACGCC" is stored.

This has two important consequences:
1) Accessing random values via keys is extremely fast (assuming one uses a halfway decent hash function for the keys) and you have the flexibility to choose the most convenient objects as your keys. This is what hash tables are optimized for and this is exactly what we need in IchoTolot's homework: We get random substrings and want to access the corresponding variable counting the number of occurances.
2) The set of keys resp. elements of your hash table is not enumerated, so one cannot simply loop over all elements in the hash table. Even if your language supports constructs like
foreach (Element in MyHashTable) DoSomething;
it likey will first enumerate all elements and then call a usual for loop. So any action that requires looping through all elements, e.g. sorting, finding all elements satisfying a certain condition, summing up the values of all elements, ... are slower when compared to arrays (or similar types).

Colorful Arty

That's really cool! I should probably do more research into Hash Tables. We talked about them briefly in my data structures and algorithms course, but never programmed with them. They sound very useful though. Thanks!
My Youtube channel where I let's play games with family-friendly commentary:
https://www.youtube.com/channel/UCiRPZ5j87ft_clSRLFCESQA

My Twitch channel: https://www.twitch.tv/colorfularty

My levelpack: SubLems
For New formats NeoLemmix: https://www.lemmingsforums.net/index.php?topic=4942.0
For Old formats NeoLemmix: http://www.lemmingsforums.net/index.php?topic=2787.0
For SuperLemmini: http://www.lemmingsforums.net/index.php?topic=2704.0

My levelpack: ArtLems
For New formats NeoLemmix: https://www.lemmingsforums.net/index.php?topic=4583.0

Simon

Design of the main algorithm makes or breaks these programs, yeah. Nepster gave a great overview. Know the standard libraries in your languages of choice. :-)

I found When Haskell is not faster than C, Jacques Mattheij describes optimizing a FASTA tool. He knows optimizing far better than I do:

  • He reads 1 MB into memory at a time with fread, which is like my rawRead: Read until end-of-file or until the given number of chars are read. It's faster than reading from file until newline (you must look at each char) and by reading single chars (didn't test, but his intuition was that single-char-reading from files are slower).
  • He heeds section headers beginning with `>' and comments, and therefore needs some line-based processing after all. But he does this in memory on the large chunk, not with file-getting functions.
In the words of Colin Wright: "You can't make a computer do stuff faster, but you can make it do less work", and that really is the essence of optimization.

-- Simon