What is rsort?

rsort provides a library implementation of MSD radix sort, taken from the source code for OpenBSD's libc library, and a command-line filter which sorts the standard input to the standard output.

MSD radix sort is known to be optimally fast for in-memory sorting of binary strings, for a machine model which approximates many modern machines. The rsort program is a demonstration application of libradixsort. It reads standard input into memory, applies radixsort, and prints the results to standard output.

rsort is essentially optimal, if we treat the sort algorithm as a black box. That means that, according to gprof, rsort spends 100% of its time sorting. In actual tests rsort ran an average of about 10 times faster than GNU sort overall, with a maximum of about 25 times faster.

Getting rsort

rsort was last modified on August 15, 2000. The current version is 0.10.

You can download the latest version of rsort from this site (about 25K). You can also download it from sunsite at http://sunsite.unc.edu/pub/Linux/utils/file/.

Installation

	tar xvzf rsort-VERSION.tar.gz
	cd rsort-VERSION
	make
	make setup check
    

Using rsort: Sorting Tips

rsort sorts input strings in ASCII order. It has no fancy options; that's partly why it's so fast. In practical applications rsort is up to 25 times faster than GNU sort (depending on data and on file size). Here are some sorting tips. These tips are worth keeping in mind when designing things like log-file formats, whatever sorting tool you use.

  • ASCII-order sorting is identical to numeric sorting when number fields have a fixed length. For example:
    • Four-digit years, two-digit months, etc.
    • telephone numbers
    • UNIX-style dates between 1973-03-03 and about 9:46pm on 2001-09-08, or from then until sometime in 2287, or...
    • ZIP/postal codes
    • Zero- or space-padded numbers in a fixed range
  • ASCII-order sorting is identical to by-date sorting when the date format puts larger time units before smaller ones, for example ISO-format dates, dates of form "YYYY-MM-DD", or time formatted as "HH:MM:SS.XXX".
  • To eliminate duplicates, "rsort | uniq" is faster than "sort -u"--ten times faster, on one 115K test file.
  • For case-insensitive sorting, "tr A-Z a-z | rsort" is faster than "sort -f"--ten times faster on a 114K test file.
  • GNU sort can sort on subkeys; rsort cannot. However, if columns in a file occur in order of importance, then whole-line sorting is identical to subkey sorting. On one 115K file with three columns, sorting on columns 3, 1 and 2 in that order,

    awk '{print $3, $1, $2;}' | rsort | awk '{print $2, $3, $1;}'
    

    ran five times faster than "sort +2 -3 +0 -2".

Contact

Send any patches, bugs, complaints, free beer, etc to me. Len Budney lbudney@pobox.com

 

Top


Len Budney
lbudney@pobox.com
Copyright © 1998 - 2004
Page generated: 20:12:48 21-Dec-2004