Sort::Naturally -- sort lexically, but sort numeral parts numerically |
Sort::Naturally -- sort lexically, but sort numeral parts numerically
@them = nsort(qw( foo12a foo12z foo13a foo 14 9x foo12 fooa foolio Foolio Foo12a )); print join(' ', @them), "\n";
Prints:
9x 14 foo fooa foolio Foolio foo12 foo12a Foo12a foo12z foo13a
(Or ``foo12a'' + ``Foo12a'' and ``foolio'' + ``Foolio'' and might be switched, depending on your locale.)
This module exports two functions, nsort
and ncmp
; they are used
in implementing my idea of a ``natural sorting'' algorithm. Under natural
sorting, numeric substrings are compared numerically, and other
word-characters are compared lexically.
This is the way I define natural sorting:
use locale
statement.
When comparing two strings, where a numeric substring in one
place is not up against a numeric substring in another,
the non-numeric always comes first. This is fudged by
reading pretending that the lack of a number substring has
the value -1, like so:
foo => "foo", -1 foobar => "foo", -1, "bar" foo13 => "foo", 13, foo13xyz => "foo", 13, "xyz"
That's so that ``foo'' will come before ``foo13'', which will come before ``foobar''.
The start of a string is exceptional: leading non-\W (non-word, non-digit) components are are ignored, and numbers come before letters. I define ``numeric substring'' just as sequences matching m/\d+/ -- scientific notation, commas, decimals, etc., are not seen. If your data has thousands separators in numbers (``20,000 Leagues Under The Sea'' or ``20.000 lieues sous les mers''), consider stripping them before feeding them tonsort
or
ncmp
.
This function takes a list of strings, and returns a copy of the list, sorted.
This is what most people will want to use:
@stuff = nsort(...list...);
When nsort needs to compare non-numeric substrings, it
uses Perl's lc
function in scope of a <use locale>.
And when nsort needs to lowercase things, it uses Perl's
lc
function in scope of a <use locale>. If you want nsort
to use other functions instead, you can specify them in
an arrayref as the first argument to nsort:
@stuff = nsort( [ \&string_comparator, # optional \&lowercaser_function # optional ], ...list... );
If you want to specify a string comparator but no lowercaser,
then the options list is [\&comparator, '']
or
[\&comparator]
. If you want to specify no string comparator
but a lowercaser, then the options list is
['', \&lowercaser]
.
Any comparator you specify is called as
$comparator->($left, $right)
,
and, like a normal Perl cmp
replacement, must return
-1, 0, or 1 depending on whether the left argument is stringwise
less than, equal to, or greater than the right argument.
Any lowercaser function you specify is called as
$lowercased = $lowercaser->($original)
. The routine
must not modify its $_[0]
.
Often, when sorting non-string values like this:
@objects_sorted = sort { $a->tag cmp $b->tag } @objects;
...or even in a Schwartzian transform, like this:
@strings = map $_->[0] sort { $a->[1] cmp $b->[1] } map { [$_, make_a_sort_key_from($_) ] @_ ;
...you wight want something that replaces not sort
, but cmp
.
That's what Sort::Naturally's ncmp
function is for. Call it with
the syntax ncmp($left,$right)
instead of $left cmp $right
,
but otherwise it's a fine replacement:
@objects_sorted = sort { ncmp($a->tag,$b->tag) } @objects;
@strings = map $_->[0] sort { ncmp($a->[1], $b->[1]) } map { [$_, make_a_sort_key_from($_) ] @_ ;
Just as with nsort
can take different a string-comparator
and/or lowercaser, you can do the same with ncmp
, by passing
an arrayref as the first argument:
ncmp( [ \&string_comparator, # optional \&lowercaser_function # optional ], $left, $right )
You might get string comparators from Sort::ArbBiLex.
nsort
) which I eventually
discarded in favor of my algorithm. If you are having to
naturally-sort very large data sets, and sorting is getting
ridiculously slow, you might consider trying one of those
discarded functions -- I have a feeling they might be faster on
large data sets. Benchmark them on your data and see. (Unless
you need the speed, don't bother. Hint: substitute sort
for nsort
in your code, and unless your program speeds up
drastically, it's not the sorting that's slowing things down.
But if it is nsort
that's slowing things down, consider
just:
if(@set >= SOME_VERY_BIG_NUMBER) { no locale; # vroom vroom @sorted = sort(@set); # feh, good enough } elsif(@set >= SOME_BIG_NUMBER) { use locale; @sorted = sort(@set); # feh, good enough } else { # but keep it pretty for normal cases @sorted = nsort(@set); }If you do adapt the routines in this module, email me; I'd just be interested in hearing about it. Thanks to the EFNet #perl people for encouraging this module, especially magister and a-mused.
Copyright 2001, Sean M. Burke sburke@cpan.org
, all rights
reserved. This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
Sean M. Burke sburke@cpan.org
Sort::Naturally -- sort lexically, but sort numeral parts numerically |