 |
 |
 |
 |
| Using Fedora General support for current versions. Ask questions about Fedora and it's software that do not belong in any other forum. |

31st December 2009, 07:12 AM
|
|
Registered User
|
|
Join Date: Nov 2007
Posts: 20

|
|
|
File name globbing no longer case sensitive in Fedora Core 12?
Some of my files and directories were mysteriously disappearing and some of my shell scripts were failing after the upgrade to Fedora Core 12. After some debugging I found out that file name globbing is no longer case sensitive in Fedora Core 12, that is
rm -rf [a-z]*
now also trashes all files and directories starting with [A-Z], which explains the removed files and directories
and
ls [a-z]*
now also includes files and directories starting with [A-Z], which caused my shell scripts to fail.
Is this a bug or a 'feature'?
Gijsbert
|

31st December 2009, 07:25 AM
|
|
Registered User
|
|
Join Date: Sep 2006
Posts: 52

|
|
|
do a shopt -u nocaseglob and see if it works
|

31st December 2009, 08:12 AM
|
 |
Banned (for/from) behaving just like everybody else!
|
|
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307

|
|
|
This is probably a huge bug in BASH 4. Even with nocaseglob off (default), [a-z]* matches names starting with non-lower letters (this is with LC_COLLATE=C).
---------- Post added at 04:12 PM CST ---------- Previous post was at 03:41 PM CST ----------
BTW [[:lower:]] works as intended
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
|

31st December 2009, 08:32 AM
|
|
Registered User
|
|
Join Date: Nov 2007
Posts: 20

|
|
|
Meanwhile I have found a workaround. If you set the environment variable LC_ALL to C
export LC_ALL=C
file name globbing works as it should.
Gijsbert
|

31st December 2009, 09:17 AM
|
 |
Banned (for/from) behaving just like everybody else!
|
|
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307

|
|
oops, I made a mistake there. Forgot to export LC_COLLATE
And it seems to me that this is no bug, but intended behavior. Since Fedora uses unicode for most locales nowadays, care must be taken when dealing with the sorting order of characters. As said in the Unicode docs ( http://unicode.org/reports/tr10/#Multi_Level_Comparison), collation order is completely different from alphabetical (or Unicode codepoint) order. By assuming that (collating order == codepoint order), we were led to the conclusion that the range [a-z] didn't cover upper-case letters (in whatever charset backward-compatible with ASCII, including Unicode). However, it is the collation order that determines whether a range is valid, and if valid, what is/isn't covered by it.
The moral: our minds are not yet full Unicode-compatible  And in portable programs we should use range regexps sparingly. Instead, use POSIX character classes such as "lower".
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
|

31st December 2009, 12:28 PM
|
 |
Retired Community Manager -- Banned from Texas by popular demand.
|
|
Join Date: Sep 2007
Location: NYC
Posts: 8,142

|
|
|
Ah brilliant. Let's change the way Unix has done things for years--and better yet, let's only put the information somewhere that you probably won't look until it bites you.
And the best part of all is that it *is* documented somewhere, so we can call newcomers (and old timers) stupid for not having found it.
__________________
--
http://home.roadrunner.com/~computertaijutsu
Do NOT PM forum members with requests for technical support. Ask your questions on the forum.
"I don't know why there is the constant push to break any semblance of compatibility" --anon
|

31st December 2009, 01:06 PM
|
 |
Banned (for/from) behaving just like everybody else!
|
|
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307

|
|
Well, we learn something new every day
And a big "old-skool" Unix fan as I am (or I think I am), I have to accept the fact that the Unix culture must adapt for today's people and their needs. Unix used to be The Way, but ways change. Unix used to be ASCII-only, but these days many people simply can't live (me included) without Unicode support, and I'll have to understand it better.
And BTW I really admire the goal of the Unicode project -- breaking cultural barriers for computing.
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
|

31st December 2009, 01:58 PM
|
 |
Retired Community Manager -- Banned from Texas by popular demand.
|
|
Join Date: Sep 2007
Location: NYC
Posts: 8,142

|
|
|
As someone who needs Japanese, I agree. However, the case sensitivity thing will, as noted in the post that started this thread, cause some unexpected results, and has not been mentioned in too many places that I've seen.
I agree--Unicode is a good thing. As it is, even my wife can use her netbook with Ubuntu, with ease, thanks to Unicode. Changing back and forth for her is only a little more difficult than doing it with a Mac.
My annoyance with it isn't so much the change as the lack of its mention.
__________________
--
http://home.roadrunner.com/~computertaijutsu
Do NOT PM forum members with requests for technical support. Ask your questions on the forum.
"I don't know why there is the constant push to break any semblance of compatibility" --anon
|

31st December 2009, 07:55 PM
|
 |
Registered User
|
|
Join Date: Jul 2009
Location: London,England
Posts: 1,095

|
|
If it's not a bug, it's certainly illogical and it's not posix correct, bash needs to be compiled with -DUSE_POSIX_GLOB_LIBRARY for the correct behaviour, python is ok (for example)
Code:
$ echo h | python -c 'import glob; print glob.glob("[A-Z]")'
[]
$ echo h | python -c 'import glob; print glob.glob("[a-z]")'
['h']
bash (in Fedora 12) isn't, and neither is grep
Code:
$ echo h | grep '[A-Z]'
h
$ echo h | grep '[a-z]'
h
which doesn't seem very sensible
see https://bugzilla.redhat.com/show_bug.cgi?id=217359
|

1st January 2010, 05:05 AM
|
 |
Banned (for/from) behaving just like everybody else!
|
|
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307

|
|
You can't use Python's glob module that way. The glob.glob() function takes its arguments as a pathname. It searches the filesystem with paths matching the pathname. It is not intended to work like grep(1) that filtrates the input.
Nonetheless, there are discrepancies between the globbing behavior of Fedora's BASH and Python under the same LC_COLLATE setting. I tested it under Python 2.6, and it turned out that Python's glob.glob() never includes capital letters in the range [a-z] even if LC_COLLATE is not set to C. They agree with each other only under the "POSIX" (aka "C") locale.
Problem is, that Python's doc of the glob module claims neither POSIX compliance nor respect for LC_COLLATE environment variable, and I don't think it's a good measurement of "sensibleness" when dealing with the semantics of range expressions in globs. Actually, Python is know to have problems with collation, and an example was given here ( http://www.cmlenz.net/archives/2008/...thon#collation). From the linked article:
Quote:
|
Unfortunately, Python does not (yet) come with support for unicode collation, and instead uses the code point comparison approach.
|
On the other hand, BASH's doc says it will respect the current LC_COLLATE setting as required by POSIX.
grep(1) is a slightly different beast, since the input pattern is interpreted as regular expressions, not globs. And this gem from the POSIX standard WRT regular expressions:
Quote:
|
In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched.
|
( http://www.opengroup.org/onlinepubs/...bd_chap09.html)
The phrase "unspecified behavior" does not mean that Unix apps such as BASH and grep can do whatever they want. In fact, the POSIX standard for the shell and grep requires them to honor LC_COLLATE.
Further digging into the POSIX standard reveals this:
Quote:
|
The description of basic regular expression bracket expressions in the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE Bracket Expression (which is quoted above -- poster) shall also apply to the pattern bracket expression, except that the exclamation mark character ( '!' ) shall replace the circumflex character ( '^' ) in its role in a "non-matching list" in the regular expression notation.
|
( http://www.opengroup.org/onlinepubs/...l#tag_02_13_01)
Ahh, so the range expr. in globs and range expr. in regexps are mostly identical, only differing in the syntax of negation, which is irrelevant here anyway. That explains the agreement between BASH and grep.
And to me, it appears that both BASH and grep are using the locale information correctly (see file /usr/share/i18n/locales/iso14651_t1_common, the one included by most xx_XX.UTF-8 locale definition files, for the gory details). So do other commonly used tools e.g. sort and ls.
So no, I don't think BASH or grep is at fault here. I don't know whether Python's range rule is POSIX-compliant or not, but it seems to be at odds with most Unix tools (shell/grep/sort/etc...)
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
|

1st January 2010, 10:45 AM
|
 |
Registered User
|
|
Join Date: Jul 2009
Location: London,England
Posts: 1,095

|
|
The python example was simplified for illustration, we could also do:
Code:
$ mkdir globtest
$ cd globtest
$ touch h
$ python -c 'import glob; print glob.glob("[A-Z]")'
[]
$ python -c 'import glob; print glob.glob("[a-z]")'
['h']
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
...
Python's glob module uses the fnmatch function,
Quote:
|
The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. No tilde expansion is done, but *, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.listdir() and fnmatch.fnmatch() functions in concert, and not by actually invoking a subshell. (For tilde and shell variable expansion, use os.path.expanduser() and os.path.expandvars().)
|
Whatever the deficiencies of python's globbing wrt unicode, I think most people would expect that in all cases the range match '[A-Z]' should only match A-Z, and not lowercase letters.
This nonintuitive behaviour introduced by unicode collation is not sensible imho, and there are other significant issues like the fact that unicode locales can cause tools like grep to massively slow down (there are several bug reports).
Unicode support is nice, but you'd think they'd manage to implement it without ****ing up the basic tools, performing basic tasks. And with discrepancies amongst the various tools it looks like a bit of a mess.
Last edited by Gödel; 1st January 2010 at 10:50 AM.
|

1st January 2010, 11:34 AM
|
 |
Banned (for/from) behaving just like everybody else!
|
|
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307

|
|
The behaviors of GNU tools has been quite consistent with each other, and for me they don't appear to be ****ed up. And I'm quite happy with grep as long as it works correctly and fast enough (for me).
Different people has different ideas of intuition. If you can convince the glibc developers that changing the glibc collation definitions is good for the public, go for it. Be warned! "There Be Ulrich Drepper"
And there are always wctype(3)-compatible character classes (e.g. [[:lower:]] or [[:upper:]]). I for one find them the true solution to most regexp/glob problems where ranges are expected. YMMV.
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
|

1st January 2010, 01:51 PM
|
 |
Registered User
|
|
Join Date: Jul 2009
Location: London,England
Posts: 1,095

|
|
Ah ok, 'man 7 glob' explains the situation as you described (I did 'man glob' and was reading the section 3 man page):
Quote:
Character classes and Internationalization
Of course ranges were originally meant to be ASCII ranges, so that '[
-%]' stands for '[ !"#$%]' and '[a-z]' stands for "any lowercase let-
ter". Some Unix implementations generalized this so that a range X-Y
stands for the set of characters with code between the codes for X and
for Y. However, this requires the user to know the character coding in
use on the local system, and moreover, is not convenient if the collat-
ing sequence for the local alphabet differs from the ordering of the
character codes. Therefore, POSIX extended the bracket notation
greatly, both for wildcard patterns and for regular expressions. In
the above we saw three types of item that can occur in a bracket
expression: namely (i) the negation, (ii) explicit single characters,
and (iii) ranges. POSIX specifies ranges in an internationally more
useful way and adds three more types:
(iii) Ranges X-Y comprise all characters that fall between X and Y
(inclusive) in the currect collating sequence as defined by the LC_COL-
LATE category in the current locale.
(iv) Named character classes, like
[:alnum:] [:alpha:] [:blank:] [:cntrl:]
[:digit:] [:graph:] [:lower:] [:print:]
[:punct:] [:space:] [:upper:] [:xdigit:]
so that one can say '[[:lower:]]' instead of '[a-z]', and have things
work in Denmark, too, where there are three letters past 'z' in the
alphabet. These character classes are defined by the LC_CTYPE category
in the current locale.
|
So maybe python 2.6 is out-of-step. Still, this is a gotcha that will break many old-style scripts, so good to be aware of the situation
(performance of some tools, like grep performance is terrible though unless you specify LANG=C, google "grep performance UTF-8" )
|

2nd January 2010, 03:12 AM
|
 |
Banned (for/from) behaving just like everybody else!
|
|
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307

|
|
Unsurprisingly, the performance of locale-aware grep still beats that of locale-blind Python. The problem is especially acute with "ill-conditioned" regular expressions ( http://swtch.com/~rsc/regexp/regexp1.html). grep deals with them robustly and efficiently, while the "fancy" tools either take exponential time to complete the match or simply crash.
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
|

2nd January 2010, 11:54 AM
|
 |
Registered User
|
|
Join Date: Jul 2009
Location: London,England
Posts: 1,095

|
|
it depends what you're doing with grep, we tested one particular task a while back, under unicode locales grep's word regex matching gets thrashed by alternative algorithms using awk, sed, lua, python, cut, ruby and perl.
http://forums.fedoraforum.org/showpo...5&postcount=77
It was bad in F11 but in F12 grep is so slow with that example you need to test a smaller file of 100,000 lines rather than 10,000,000 lines. textparse_nocache2.sh
(rupertpupkin pointed out that under C locale grep performance is much better)
Code:
$ sudo Download/textparse_nocache2.sh
Checking file1
Generating 100,000 line test file, file1 ...
DONE
C ...
real 0m0.422s
user 0m0.036s
sys 0m0.022s
awk ...
real 0m0.710s
user 0m0.086s
sys 0m0.024s
sed ...
real 0m0.450s
user 0m0.153s
sys 0m0.013s
python ...
real 0m1.389s
user 0m0.182s
sys 0m0.071s
lua ...
real 0m0.530s
user 0m0.155s
sys 0m0.017s
cut ...
real 0m0.310s
user 0m0.168s
sys 0m0.008s
perl ...
real 0m0.977s
user 0m0.388s
sys 0m0.032s
grep ...
real 1m13.176s
user 1m12.641s
sys 0m0.075s
Code:
$ export LC_ALL=C
$ sudo Download/textparse_nocache2.sh
Checking file1
DONE
....
grep ...
real 0m0.661s
user 0m0.382s
sys 0m0.017s
Which demonstrates that (in certain operations) grep is pretty much broken in unicode locales
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
Current GMT-time: 07:47 (Tuesday, 21-05-2013)
|
|
 |
 |
 |
 |
|
|