Fedora Linux Support Community & Resources Center

Go Back   FedoraForum.org > Fedora 17/18 > Using Fedora
FedoraForum Search

Forgot Password? Join Us!

Using Fedora General support for current versions. Ask questions about Fedora and it's software that do not belong in any other forum.

Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 31st December 2009, 07:12 AM
gwiesenekker Offline
Registered User
 
Join Date: Nov 2007
Posts: 20
macossafari
File name globbing no longer case sensitive in Fedora Core 12?

Some of my files and directories were mysteriously disappearing and some of my shell scripts were failing after the upgrade to Fedora Core 12. After some debugging I found out that file name globbing is no longer case sensitive in Fedora Core 12, that is

rm -rf [a-z]*

now also trashes all files and directories starting with [A-Z], which explains the removed files and directories

and

ls [a-z]*

now also includes files and directories starting with [A-Z], which caused my shell scripts to fail.

Is this a bug or a 'feature'?

Gijsbert
Reply With Quote
  #2  
Old 31st December 2009, 07:25 AM
ghostdog74 Offline
Registered User
 
Join Date: Sep 2006
Posts: 52
linuxfedorafirefox
do a shopt -u nocaseglob and see if it works
Reply With Quote
  #3  
Old 31st December 2009, 08:12 AM
aleph's Avatar
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307
linuxfedorafirefox
This is probably a huge bug in BASH 4. Even with nocaseglob off (default), [a-z]* matches names starting with non-lower letters (this is with LC_COLLATE=C).

---------- Post added at 04:12 PM CST ---------- Previous post was at 03:41 PM CST ----------

BTW [[:lower:]] works as intended
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
Reply With Quote
  #4  
Old 31st December 2009, 08:32 AM
gwiesenekker Offline
Registered User
 
Join Date: Nov 2007
Posts: 20
macossafari
Meanwhile I have found a workaround. If you set the environment variable LC_ALL to C

export LC_ALL=C

file name globbing works as it should.

Gijsbert
Reply With Quote
  #5  
Old 31st December 2009, 09:17 AM
aleph's Avatar
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307
linuxfedorafirefox
oops, I made a mistake there. Forgot to export LC_COLLATE

And it seems to me that this is no bug, but intended behavior. Since Fedora uses unicode for most locales nowadays, care must be taken when dealing with the sorting order of characters. As said in the Unicode docs (http://unicode.org/reports/tr10/#Multi_Level_Comparison), collation order is completely different from alphabetical (or Unicode codepoint) order. By assuming that (collating order == codepoint order), we were led to the conclusion that the range [a-z] didn't cover upper-case letters (in whatever charset backward-compatible with ASCII, including Unicode). However, it is the collation order that determines whether a range is valid, and if valid, what is/isn't covered by it.

The moral: our minds are not yet full Unicode-compatible And in portable programs we should use range regexps sparingly. Instead, use POSIX character classes such as "lower".
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
Reply With Quote
  #6  
Old 31st December 2009, 12:28 PM
scottro's Avatar
scottro Offline
Retired Community Manager -- Banned from Texas by popular demand.
 
Join Date: Sep 2007
Location: NYC
Posts: 8,142
linuxopera
Ah brilliant. Let's change the way Unix has done things for years--and better yet, let's only put the information somewhere that you probably won't look until it bites you.

And the best part of all is that it *is* documented somewhere, so we can call newcomers (and old timers) stupid for not having found it.
__________________
--
http://home.roadrunner.com/~computertaijutsu

Do NOT PM forum members with requests for technical support. Ask your questions on the forum.


"I don't know why there is the constant push to break any semblance of compatibility" --anon
Reply With Quote
  #7  
Old 31st December 2009, 01:06 PM
aleph's Avatar
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307
linuxfedorafirefox
Well, we learn something new every day

And a big "old-skool" Unix fan as I am (or I think I am), I have to accept the fact that the Unix culture must adapt for today's people and their needs. Unix used to be The Way, but ways change. Unix used to be ASCII-only, but these days many people simply can't live (me included) without Unicode support, and I'll have to understand it better.

And BTW I really admire the goal of the Unicode project -- breaking cultural barriers for computing.
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
Reply With Quote
  #8  
Old 31st December 2009, 01:58 PM
scottro's Avatar
scottro Offline
Retired Community Manager -- Banned from Texas by popular demand.
 
Join Date: Sep 2007
Location: NYC
Posts: 8,142
linuxopera
As someone who needs Japanese, I agree. However, the case sensitivity thing will, as noted in the post that started this thread, cause some unexpected results, and has not been mentioned in too many places that I've seen.

I agree--Unicode is a good thing. As it is, even my wife can use her netbook with Ubuntu, with ease, thanks to Unicode. Changing back and forth for her is only a little more difficult than doing it with a Mac.

My annoyance with it isn't so much the change as the lack of its mention.
__________________
--
http://home.roadrunner.com/~computertaijutsu

Do NOT PM forum members with requests for technical support. Ask your questions on the forum.


"I don't know why there is the constant push to break any semblance of compatibility" --anon
Reply With Quote
  #9  
Old 31st December 2009, 07:55 PM
Gödel's Avatar
Gödel Offline
Registered User
 
Join Date: Jul 2009
Location: London,England
Posts: 1,095
linuxfedorafirefox
If it's not a bug, it's certainly illogical and it's not posix correct, bash needs to be compiled with -DUSE_POSIX_GLOB_LIBRARY for the correct behaviour, python is ok (for example)

Code:
$ echo h | python -c 'import glob; print glob.glob("[A-Z]")'
[]
$ echo h | python -c 'import glob; print glob.glob("[a-z]")'
['h']
bash (in Fedora 12) isn't, and neither is grep

Code:
$ echo h | grep '[A-Z]'
h
$ echo h | grep '[a-z]'
h
which doesn't seem very sensible

see https://bugzilla.redhat.com/show_bug.cgi?id=217359
Reply With Quote
  #10  
Old 1st January 2010, 05:05 AM
aleph's Avatar
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307
linuxfedorafirefox
You can't use Python's glob module that way. The glob.glob() function takes its arguments as a pathname. It searches the filesystem with paths matching the pathname. It is not intended to work like grep(1) that filtrates the input.

Nonetheless, there are discrepancies between the globbing behavior of Fedora's BASH and Python under the same LC_COLLATE setting. I tested it under Python 2.6, and it turned out that Python's glob.glob() never includes capital letters in the range [a-z] even if LC_COLLATE is not set to C. They agree with each other only under the "POSIX" (aka "C") locale.

Problem is, that Python's doc of the glob module claims neither POSIX compliance nor respect for LC_COLLATE environment variable, and I don't think it's a good measurement of "sensibleness" when dealing with the semantics of range expressions in globs. Actually, Python is know to have problems with collation, and an example was given here (http://www.cmlenz.net/archives/2008/...thon#collation). From the linked article:
Quote:
Unfortunately, Python does not (yet) come with support for unicode collation, and instead uses the code point comparison approach.
On the other hand, BASH's doc says it will respect the current LC_COLLATE setting as required by POSIX.

grep(1) is a slightly different beast, since the input pattern is interpreted as regular expressions, not globs. And this gem from the POSIX standard WRT regular expressions:
Quote:
In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched.
(http://www.opengroup.org/onlinepubs/...bd_chap09.html)
The phrase "unspecified behavior" does not mean that Unix apps such as BASH and grep can do whatever they want. In fact, the POSIX standard for the shell and grep requires them to honor LC_COLLATE.

Further digging into the POSIX standard reveals this:
Quote:
The description of basic regular expression bracket expressions in the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE Bracket Expression (which is quoted above -- poster) shall also apply to the pattern bracket expression, except that the exclamation mark character ( '!' ) shall replace the circumflex character ( '^' ) in its role in a "non-matching list" in the regular expression notation.
(http://www.opengroup.org/onlinepubs/...l#tag_02_13_01)

Ahh, so the range expr. in globs and range expr. in regexps are mostly identical, only differing in the syntax of negation, which is irrelevant here anyway. That explains the agreement between BASH and grep.

And to me, it appears that both BASH and grep are using the locale information correctly (see file /usr/share/i18n/locales/iso14651_t1_common, the one included by most xx_XX.UTF-8 locale definition files, for the gory details). So do other commonly used tools e.g. sort and ls.

So no, I don't think BASH or grep is at fault here. I don't know whether Python's range rule is POSIX-compliant or not, but it seems to be at odds with most Unix tools (shell/grep/sort/etc...)
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
Reply With Quote
  #11  
Old 1st January 2010, 10:45 AM
Gödel's Avatar
Gödel Offline
Registered User
 
Join Date: Jul 2009
Location: London,England
Posts: 1,095
linuxfedorafirefox
The python example was simplified for illustration, we could also do:

Code:
$ mkdir globtest
$ cd globtest
$ touch h
$ python -c 'import glob; print glob.glob("[A-Z]")'
[]
$ python -c 'import glob; print glob.glob("[a-z]")'
['h']

$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
...
Python's glob module uses the fnmatch function,

Quote:
The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. No tilde expansion is done, but *, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.listdir() and fnmatch.fnmatch() functions in concert, and not by actually invoking a subshell. (For tilde and shell variable expansion, use os.path.expanduser() and os.path.expandvars().)
Whatever the deficiencies of python's globbing wrt unicode, I think most people would expect that in all cases the range match '[A-Z]' should only match A-Z, and not lowercase letters.

This nonintuitive behaviour introduced by unicode collation is not sensible imho, and there are other significant issues like the fact that unicode locales can cause tools like grep to massively slow down (there are several bug reports).

Unicode support is nice, but you'd think they'd manage to implement it without ****ing up the basic tools, performing basic tasks. And with discrepancies amongst the various tools it looks like a bit of a mess.

Last edited by Gödel; 1st January 2010 at 10:50 AM.
Reply With Quote
  #12  
Old 1st January 2010, 11:34 AM
aleph's Avatar
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307
linuxfedorafirefox
The behaviors of GNU tools has been quite consistent with each other, and for me they don't appear to be ****ed up. And I'm quite happy with grep as long as it works correctly and fast enough (for me).

Different people has different ideas of intuition. If you can convince the glibc developers that changing the glibc collation definitions is good for the public, go for it. Be warned! "There Be Ulrich Drepper"

And there are always wctype(3)-compatible character classes (e.g. [[:lower:]] or [[:upper:]]). I for one find them the true solution to most regexp/glob problems where ranges are expected. YMMV.
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
Reply With Quote
  #13  
Old 1st January 2010, 01:51 PM
Gödel's Avatar
Gödel Offline
Registered User
 
Join Date: Jul 2009
Location: London,England
Posts: 1,095
linuxfedorafirefox
Ah ok, 'man 7 glob' explains the situation as you described (I did 'man glob' and was reading the section 3 man page):

Quote:
Character classes and Internationalization

Of course ranges were originally meant to be ASCII ranges, so that '[
-%]' stands for '[ !"#$%]' and '[a-z]' stands for "any lowercase let-
ter". Some Unix implementations generalized this so that a range X-Y
stands for the set of characters with code between the codes for X and
for Y. However, this requires the user to know the character coding in
use on the local system, and moreover, is not convenient if the collat-
ing sequence for the local alphabet differs from the ordering of the
character codes. Therefore, POSIX extended the bracket notation
greatly, both for wildcard patterns and for regular expressions. In
the above we saw three types of item that can occur in a bracket
expression: namely (i) the negation, (ii) explicit single characters,
and (iii) ranges. POSIX specifies ranges in an internationally more
useful way and adds three more types:

(iii) Ranges X-Y comprise all characters that fall between X and Y
(inclusive) in the currect collating sequence as defined by the LC_COL-
LATE category in the current locale.

(iv) Named character classes, like
[:alnum:] [:alpha:] [:blank:] [:cntrl:]
[:digit:] [:graph:] [:lower:] [:print:]
[:punct:] [:space:] [:upper:] [:xdigit:]
so that one can say '[[:lower:]]' instead of '[a-z]', and have things
work in Denmark, too, where there are three letters past 'z' in the
alphabet. These character classes are defined by the LC_CTYPE category
in the current locale.
So maybe python 2.6 is out-of-step. Still, this is a gotcha that will break many old-style scripts, so good to be aware of the situation

(performance of some tools, like grep performance is terrible though unless you specify LANG=C, google "grep performance UTF-8" )
Reply With Quote
  #14  
Old 2nd January 2010, 03:12 AM
aleph's Avatar
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Beijing, China
Posts: 1,307
linuxfedorafirefox
Unsurprisingly, the performance of locale-aware grep still beats that of locale-blind Python. The problem is especially acute with "ill-conditioned" regular expressions (http://swtch.com/~rsc/regexp/regexp1.html). grep deals with them robustly and efficiently, while the "fancy" tools either take exponential time to complete the match or simply crash.
__________________
I believe in nerditarianism. I read FedoraForum for the Fedora-related posts.
Reply With Quote
  #15  
Old 2nd January 2010, 11:54 AM
Gödel's Avatar
Gödel Offline
Registered User
 
Join Date: Jul 2009
Location: London,England
Posts: 1,095
linuxfedorafirefox
it depends what you're doing with grep, we tested one particular task a while back, under unicode locales grep's word regex matching gets thrashed by alternative algorithms using awk, sed, lua, python, cut, ruby and perl.

http://forums.fedoraforum.org/showpo...5&postcount=77

It was bad in F11 but in F12 grep is so slow with that example you need to test a smaller file of 100,000 lines rather than 10,000,000 lines. textparse_nocache2.sh

(rupertpupkin pointed out that under C locale grep performance is much better)

Code:
$ sudo Download/textparse_nocache2.sh 
Checking file1
Generating 100,000 line test file, file1 ...
DONE

C ...
real	0m0.422s
user	0m0.036s
sys	0m0.022s

awk ...
real	0m0.710s
user	0m0.086s
sys	0m0.024s

sed ...
real	0m0.450s
user	0m0.153s
sys	0m0.013s

python ...
real	0m1.389s
user	0m0.182s
sys	0m0.071s

lua ...
real	0m0.530s
user	0m0.155s
sys	0m0.017s

cut ...
real	0m0.310s
user	0m0.168s
sys	0m0.008s

perl ...
real	0m0.977s
user	0m0.388s
sys	0m0.032s

grep ...
real	1m13.176s
user	1m12.641s
sys	0m0.075s
Code:
$ export LC_ALL=C
$ sudo Download/textparse_nocache2.sh 
Checking file1
DONE

....
grep ...
real	0m0.661s
user	0m0.382s
sys	0m0.017s
Which demonstrates that (in certain operations) grep is pretty much broken in unicode locales
Reply With Quote
Reply

Tags
lc_collate, unicode

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
sort is no longer case sensitive Roelkluin Using Fedora 4 16th December 2009 02:48 PM
How to enable Apache not to be case sensitive like IIS ntwdavid Using Fedora 3 28th September 2008 06:54 PM
can linux be forced not to be case sensitive with html files while displaying them sourin Servers & Networking 7 15th December 2005 02:24 PM


Current GMT-time: 07:47 (Tuesday, 21-05-2013)

TopSubscribe to XML RSS for all Threads in all ForumsFedoraForumDotOrg Archive
logo

All trademarks, and forum posts in this site are property of their respective owner(s).
FedoraForum.org is privately owned and is not directly sponsored by the Fedora Project or Red Hat, Inc.

Privacy Policy | Term of Use | Posting Guidelines | Archive | Contact Us | Founding Members

Powered by vBulletin® Copyright ©2000 - 2012, vBulletin Solutions, Inc.

FedoraForum is Powered by RedHat