Reply
Old May 23rd, 2003 Top | #1
Rexy
 
Rexy's Avatar
Unregistered
Posts: n/a

Default [C++] sorting MASSIVE amounts of data

In the C++ course i'm taking, my teacher gave us a project to sort massive amounts of records. We need to sort 10,000,000 (ten mil) records. Each record consists of 9 fields, area code, prefix, line number, 5-digit zip code, month number, day number, year number, GPA, First Name, and Last Name (2 strings, 1 double, 6 ints/shorts). a sample record could be "879 722 1794 60809 05 9 2522 4.45955 Timothy Solomon". The comparison will rely on nested fields. All 9 fields will be somewhere in the nesting order/hierarchy. The first obstacle is RAM. But the records are possible to be stored on the hdd (according to my calculations the records could take up to 560mbs). So the use of an external merge sort with two scratch files is an option. The problem is i tried doing this but the output file is messed up and contains weird characters and contains weird and irrelevant numbers for some reason. Also, it takes it quite a while to just sort 10,000 records. I thought maybe anyone here could help me with making this program. Our teacher gave us access to the small program he made to create random records. It is attached (i tried putting it here but it's too long).

Can anyone please help me with this program. It needs to be very efficient and fast and i'm not that great at optimization, never liked optimizing my code.

Thanks.
  Reply With Quote
Old May 26th, 2003 Top | #2
Blitzkrieg
 
Blitzkrieg's Avatar
Unregistered
Posts: n/a

Default

I'm not 100% sure on how to help on this, but I can say this. It has to process over 500MB's of data, theres no way it can do that in a very short amount of time. Not to mention, you will probably end up having to do some kind of a bubble sort, which will result in it taking even longer.

Anyways, good luck.
  Reply With Quote
Old May 26th, 2003 Top | #3

OSNN Folding Team  
Geffy's Avatar
OSNN Veteran Addict
Joined: March 2002
Location: United Kingdom
Posts: 7,805
Reputation: 1490
Power: 217

Default

Try posting this over at http://x-istence.com/forums/ its setup for precisely this kind of question


blogtumbloglastfmflickr#rubyonrails@twitter
"I could be replaced with a very small shell script"
Geffy is offline   Reply With Quote
Old May 26th, 2003 Top | #4

OSNN Folding Team  
Zedric's Avatar
NTFS Guru
Joined: January 2002
Location: Sweden
Posts: 4,006
Reputation: 890
Power: 175

Cool

I think bubble sort would be a bad idea. There is another kind of sort (the name escapes me) that is used by taking a value and then spilt the rest of the data in two, the data lower than the value in one and higher data in the other. Repeat this on both piles until it's sorted (giving you a tree structure over time). This should save you memory since you only work with one pile at the time. This may be Quicksort but I'm unsure.


Did I help you? Please use the reputation system. Click the icon on the left!
Proud host of the OSNN.net folding sigs. Want one? Check the folding thread!
http://zedric.no-ip.com/
Zedric is offline   Reply With Quote
Old May 27th, 2003 Top | #5
 
dave holbon's Avatar
OSNN Veteran Addict
Joined: May 2002
Location: London England
Posts: 1,014
Reputation: 140
Power: 133

Default

This is actually a very good database design question (for SQL databases NOT SEQUEL which is the programming language).

Say you have 25 million records to sort for (any) data output queries, remember that queries that circumnavigate these records don’t need to search all 25 million records (if Cods theories are correct and they are not) but only the indexes that you have so carefully created, and the relationships between them. That’s why you have split your data (and databases) over many servers so as to combine their assets.

For instance you would not save 25 million postcodes with the name and address of each person, indeed there are not 25 million postcodes for 25 million people and it’s this knowledge that allows you to design your distributed database correctly.

Following this logic will lead to the surprising revelation that using postcodes or (say) surnames (optimised) as an initial search criteria is an elimination process (millions of records at a time) and if carefully crafted can provide the results in a few seconds and not weeks as is currently the case.

Designing distributed databases in this way is not easy but one thing is for sure they can’t be kept on one computer, you need at least five to ten.

This is the art of not searching unwanted items!
dave holbon is offline   Reply With Quote
Old May 27th, 2003 Top | #6

OSNN Folding Team  
Geffy's Avatar
OSNN Veteran Addict
Joined: March 2002
Location: United Kingdom
Posts: 7,805
Reputation: 1490
Power: 217

Default

you always have good stuff to post dont you Dave


blogtumbloglastfmflickr#rubyonrails@twitter
"I could be replaced with a very small shell script"
Geffy is offline   Reply With Quote
Old May 27th, 2003 Top | #7
 
X-Istence's Avatar
*
Joined: December 2001
Location: USA
Posts: 6,496
Reputation: 2808
Power: 220

Default

So basically you need to read from a file, sort the stuff and dump it back into a file?

I at the moment would have a clue on how to go about it.

Like someone allready said, it would take a lot of CPU power, and memory.

What do you need to sort it by? Names, GPA or just names and GPA and all that other stuff?.
X-Istence is offline   Reply With Quote
Old May 27th, 2003 Top | #8
 
X-Istence's Avatar
*
Joined: December 2001
Location: USA
Posts: 6,496
Reputation: 2808
Power: 220

Default

Question:

could you please compile that C++ file ( I cant, i dont have the apstring.h and another file ) and put the exe you get in a file and attach it here. I would like to download it, and test if i can get this sort of thing done.
X-Istence is offline   Reply With Quote
Old August 19th, 2003 Top | #9
 
X-Istence's Avatar
*
Joined: December 2001
Location: USA
Posts: 6,496
Reputation: 2808
Power: 220

Default

Okay, so this problem is one that i encountered at school as well, and i had the summer to thing about how to make it work.

So other C++ people help me out.

What i am thinking of is creating a vector. ( A vector is like an array but it can dynamically become bigger and smaller ) then reading the entire file into the vector ( memory ineffecient but we dont care about that ) and then use the sort() function that can be used on listboxes with random iterators.

This would cause the vector to be sorted after a few hours depending on your CPU, then after that we open a new file, and write the data from the vector straight back into a file. Since its now sorted its all good =).

Now if we had to sort by highest SAT scrore to lowest it would be more of a problem, here we just sort the entire string that is in the vector.
X-Istence is offline   Reply With Quote
Old August 22nd, 2003 Top | #10
Xrs2
 
Xrs2's Avatar
Unregistered
Posts: n/a

Default

I would use a bin sort for this...
If you are not familiar with this I will explain how it could be implemented in your situation.

Ok a bin sort there are different containers
1
2
3
4
5
6
7
So there are 7 bins, and say you would like to sort out these numbers

27
52
13

Ok sounds simple enough you just check the ones column first so they would all go in there bins like this

1
2 - 52
3 - 13
4
5
6
7 - 27

Ok now you would start from the top of this new list and and put them in order by whats in the 10's place

1 - 13
2 - 27
3
4
5 - 52
6
7

It read the 5 in 52 and said it must go in the 5th spot
it read the 2 in 27 and said it must go in the 2nd spot
it read the 1 in 13 and said it must go in the 1st spot


You can go up to any place you want with this. I have seen programs that go upto the 10 millions place
This is so very powerful because your bins can be whatever you want you just have to take the string your sorting and parse it all up in to the peices you want. I hope this may have helped, or maybe it hurt, if you have any more questions just email me at xrs2_2srx@yahoo.com, as I am not much of a member of this board, but was directed here from Bit-Quest.com
  Reply With Quote

Reply

Bookmarks

Thread Tools

Posting Rules

Similar Threads
Thread Thread Starter Forum Replies Last Post
Inordinate amounts of spam from system administrator Unleashed Windows Server Systems 1 December 11th, 2007 3:29pm
sorting mySQL jimi_81 Web Design & Coding 5 May 6th, 2005 12:36am
mp3 sorting by id3 tag Abbadon2001 Windows Desktop Systems 7 March 6th, 2005 7:10pm
I recently ordered massive amounts of computer neoterixx General Hardware 39 July 17th, 2003 1:51pm
sorting favorites bisher Windows Desktop Systems 1 August 24th, 2002 3:48pm