[C++] sorting MASSIVE amounts of data

Discussion in 'Web Design & Coding' started by Rexy, May 23, 2003.

  1. Rexy

    Rexy Guest

    In the C++ course i'm taking, my teacher gave us a project to sort massive amounts of records. We need to sort 10,000,000 (ten mil) records. Each record consists of 9 fields, area code, prefix, line number, 5-digit zip code, month number, day number, year number, GPA, First Name, and Last Name (2 strings, 1 double, 6 ints/shorts). a sample record could be "879 722 1794 60809 05 9 2522 4.45955 Timothy Solomon". The comparison will rely on nested fields. All 9 fields will be somewhere in the nesting order/hierarchy. The first obstacle is RAM. But the records are possible to be stored on the hdd (according to my calculations the records could take up to 560mbs). So the use of an external merge sort with two scratch files is an option. The problem is i tried doing this but the output file is messed up and contains weird characters and contains weird and irrelevant numbers for some reason. Also, it takes it quite a while to just sort 10,000 records. I thought maybe anyone here could help me with making this program. Our teacher gave us access to the small program he made to create random records. It is attached (i tried putting it here but it's too long).

    Can anyone please help me with this program. It needs to be very efficient and fast and i'm not that great at optimization, never liked optimizing my code.

    Thanks.
     
  2. Blitzkrieg

    Blitzkrieg Guest

    I'm not 100% sure on how to help on this, but I can say this. It has to process over 500MB's of data, theres no way it can do that in a very short amount of time. Not to mention, you will probably end up having to do some kind of a bubble sort, which will result in it taking even longer.

    Anyways, good luck.
     
  3. Geffy

    Geffy Moderator Folding Team

    Messages:
    7,805
    Location:
    United Kingdom
  4. Zedric

    Zedric NTFS Guru Folding Team

    Messages:
    4,006
    Location:
    Sweden
    I think bubble sort would be a bad idea. There is another kind of sort (the name escapes me) that is used by taking a value and then spilt the rest of the data in two, the data lower than the value in one and higher data in the other. Repeat this on both piles until it's sorted (giving you a tree structure over time). This should save you memory since you only work with one pile at the time. This may be Quicksort but I'm unsure.
     
  5. dave holbon

    dave holbon Moderator

    Messages:
    1,014
    Location:
    London England
    This is actually a very good database design question (for SQL databases NOT SEQUEL which is the programming language).

    Say you have 25 million records to sort for (any) data output queries, remember that queries that circumnavigate these records don’t need to search all 25 million records (if Cods theories are correct and they are not) but only the indexes that you have so carefully created, and the relationships between them. That’s why you have split your data (and databases) over many servers so as to combine their assets.

    For instance you would not save 25 million postcodes with the name and address of each person, indeed there are not 25 million postcodes for 25 million people and it’s this knowledge that allows you to design your distributed database correctly.

    Following this logic will lead to the surprising revelation that using postcodes or (say) surnames (optimised) as an initial search criteria is an elimination process (millions of records at a time) and if carefully crafted can provide the results in a few seconds and not weeks as is currently the case.

    Designing distributed databases in this way is not easy but one thing is for sure they can’t be kept on one computer, you need at least five to ten.

    This is the art of not searching unwanted items!
     
  6. Geffy

    Geffy Moderator Folding Team

    Messages:
    7,805
    Location:
    United Kingdom
    you always have good stuff to post dont you Dave
     
  7. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    So basically you need to read from a file, sort the stuff and dump it back into a file?

    I at the moment would have a clue on how to go about it.

    Like someone allready said, it would take a lot of CPU power, and memory.

    What do you need to sort it by? Names, GPA or just names and GPA and all that other stuff?.
     
  8. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    Question:

    could you please compile that C++ file ( I cant, i dont have the apstring.h and another file ) and put the exe you get in a file and attach it here. I would like to download it, and test if i can get this sort of thing done.
     
  9. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    Okay, so this problem is one that i encountered at school as well, and i had the summer to thing about how to make it work.

    So other C++ people help me out.

    What i am thinking of is creating a vector. ( A vector is like an array but it can dynamically become bigger and smaller ) then reading the entire file into the vector ( memory ineffecient but we dont care about that ) and then use the sort() function that can be used on listboxes with random iterators.

    This would cause the vector to be sorted after a few hours depending on your CPU, then after that we open a new file, and write the data from the vector straight back into a file. Since its now sorted its all good =).

    Now if we had to sort by highest SAT scrore to lowest it would be more of a problem, here we just sort the entire string that is in the vector.
     
  10. Xrs2

    Xrs2 Guest

    I would use a bin sort for this...
    If you are not familiar with this I will explain how it could be implemented in your situation.

    Ok a bin sort there are different containers
    1
    2
    3
    4
    5
    6
    7
    So there are 7 bins, and say you would like to sort out these numbers

    27
    52
    13

    Ok sounds simple enough you just check the ones column first so they would all go in there bins like this

    1
    2 - 52
    3 - 13
    4
    5
    6
    7 - 27

    Ok now you would start from the top of this new list and and put them in order by whats in the 10's place

    1 - 13
    2 - 27
    3
    4
    5 - 52
    6
    7

    It read the 5 in 52 and said it must go in the 5th spot
    it read the 2 in 27 and said it must go in the 2nd spot
    it read the 1 in 13 and said it must go in the 1st spot


    You can go up to any place you want with this. I have seen programs that go upto the 10 millions place
    This is so very powerful because your bins can be whatever you want you just have to take the string your sorting and parse it all up in to the peices you want. I hope this may have helped, or maybe it hurt, if you have any more questions just email me at xrs2_2srx@yahoo.com, as I am not much of a member of this board, but was directed here from Bit-Quest.com