[C++] sorting MASSIVE amounts of data

R

Rexy

Guest
In the C++ course i'm taking, my teacher gave us a project to sort massive amounts of records. We need to sort 10,000,000 (ten mil) records. Each record consists of 9 fields, area code, prefix, line number, 5-digit zip code, month number, day number, year number, GPA, First Name, and Last Name (2 strings, 1 double, 6 ints/shorts). a sample record could be "879 722 1794 60809 05 9 2522 4.45955 Timothy Solomon". The comparison will rely on nested fields. All 9 fields will be somewhere in the nesting order/hierarchy. The first obstacle is RAM. But the records are possible to be stored on the hdd (according to my calculations the records could take up to 560mbs). So the use of an external merge sort with two scratch files is an option. The problem is i tried doing this but the output file is messed up and contains weird characters and contains weird and irrelevant numbers for some reason. Also, it takes it quite a while to just sort 10,000 records. I thought maybe anyone here could help me with making this program. Our teacher gave us access to the small program he made to create random records. It is attached (i tried putting it here but it's too long).

Can anyone please help me with this program. It needs to be very efficient and fast and i'm not that great at optimization, never liked optimizing my code.

Thanks.
 
I'm not 100% sure on how to help on this, but I can say this. It has to process over 500MB's of data, theres no way it can do that in a very short amount of time. Not to mention, you will probably end up having to do some kind of a bubble sort, which will result in it taking even longer.

Anyways, good luck.
 
I think bubble sort would be a bad idea. There is another kind of sort (the name escapes me) that is used by taking a value and then spilt the rest of the data in two, the data lower than the value in one and higher data in the other. Repeat this on both piles until it's sorted (giving you a tree structure over time). This should save you memory since you only work with one pile at the time. This may be Quicksort but I'm unsure.
 
This is actually a very good database design question (for SQL databases NOT SEQUEL which is the programming language).

Say you have 25 million records to sort for (any) data output queries, remember that queries that circumnavigate these records don’t need to search all 25 million records (if Cods theories are correct and they are not) but only the indexes that you have so carefully created, and the relationships between them. That’s why you have split your data (and databases) over many servers so as to combine their assets.

For instance you would not save 25 million postcodes with the name and address of each person, indeed there are not 25 million postcodes for 25 million people and it’s this knowledge that allows you to design your distributed database correctly.

Following this logic will lead to the surprising revelation that using postcodes or (say) surnames (optimised) as an initial search criteria is an elimination process (millions of records at a time) and if carefully crafted can provide the results in a few seconds and not weeks as is currently the case.

Designing distributed databases in this way is not easy but one thing is for sure they can’t be kept on one computer, you need at least five to ten.

This is the art of not searching unwanted items!
 
you always have good stuff to post dont you Dave
 
So basically you need to read from a file, sort the stuff and dump it back into a file?

I at the moment would have a clue on how to go about it.

Like someone allready said, it would take a lot of CPU power, and memory.

What do you need to sort it by? Names, GPA or just names and GPA and all that other stuff?.
 
Question:

could you please compile that C++ file ( I cant, i dont have the apstring.h and another file ) and put the exe you get in a file and attach it here. I would like to download it, and test if i can get this sort of thing done.
 
Okay, so this problem is one that i encountered at school as well, and i had the summer to thing about how to make it work.

So other C++ people help me out.

What i am thinking of is creating a vector. ( A vector is like an array but it can dynamically become bigger and smaller ) then reading the entire file into the vector ( memory ineffecient but we dont care about that ) and then use the sort() function that can be used on listboxes with random iterators.

This would cause the vector to be sorted after a few hours depending on your CPU, then after that we open a new file, and write the data from the vector straight back into a file. Since its now sorted its all good =).

Now if we had to sort by highest SAT scrore to lowest it would be more of a problem, here we just sort the entire string that is in the vector.
 
I would use a bin sort for this...
If you are not familiar with this I will explain how it could be implemented in your situation.

Ok a bin sort there are different containers
1
2
3
4
5
6
7
So there are 7 bins, and say you would like to sort out these numbers

27
52
13

Ok sounds simple enough you just check the ones column first so they would all go in there bins like this

1
2 - 52
3 - 13
4
5
6
7 - 27

Ok now you would start from the top of this new list and and put them in order by whats in the 10's place

1 - 13
2 - 27
3
4
5 - 52
6
7

It read the 5 in 52 and said it must go in the 5th spot
it read the 2 in 27 and said it must go in the 2nd spot
it read the 1 in 13 and said it must go in the 1st spot


You can go up to any place you want with this. I have seen programs that go upto the 10 millions place
This is so very powerful because your bins can be whatever you want you just have to take the string your sorting and parse it all up in to the peices you want. I hope this may have helped, or maybe it hurt, if you have any more questions just email me at xrs2_2srx@yahoo.com, as I am not much of a member of this board, but was directed here from Bit-Quest.com
 

Members online

No members online now.

Latest profile posts

Also Hi EP and people. I found this place again while looking through a oooollllllldddd backup. I have filled over 10TB and was looking at my collection of antiques. Any bids on the 500Mhz Win 95 fix?
Any of the SP crew still out there?
Xie wrote on Electronic Punk's profile.
Impressed you have kept this alive this long EP! So many sites have come and gone. :(

Just did some crude math and I apparently joined almost 18yrs ago, how is that possible???
hello peeps... is been some time since i last came here.
Electronic Punk wrote on Sazar's profile.
Rest in peace my friend, been trying to find you and finally did in the worst way imaginable.

Forum statistics

Threads
62,015
Messages
673,494
Members
5,621
Latest member
naeemsafi
Back