How to bulk-rename files with invalid encoding or bulk-replace invalid encoded characters?

up vote
13
down vote

favorite

I have a debian server and I'm hosting music for an internet radio station. I have trouble with file names and paths because a lot of files got an invalid encoding, for example:

./music/BÃ¤ndname - Some Title - additional Info/B�ndname - 07 - This Title Is Cörtain, The EncÃding Not.mp3

Ideally, I would like to remove everything that is not letters A-Z/a-z or numbers 0-9 or dash -/underscore _... The result should look like something like that:

./music/Bndname-SomeTitle-additionalInfo/Bndname-07-ThisTitleIsCrtain,TheEncdingNot.mp3

How to achieve this for a batch of a lot of files and directories?

I've seen this similar question: bulk rename (or correctly display) files with special characters

But this only fixes the encoding, I would prefer a more strict approach as described above.

edited Apr 13 '17 at 12:37

Community♦

asked Jan 18 '13 at 10:49

Afri

59741227

add a comment |

up vote
13
down vote

favorite

I have a debian server and I'm hosting music for an internet radio station. I have trouble with file names and paths because a lot of files got an invalid encoding, for example:

./music/BÃ¤ndname - Some Title - additional Info/B�ndname - 07 - This Title Is Cörtain, The EncÃding Not.mp3

Ideally, I would like to remove everything that is not letters A-Z/a-z or numbers 0-9 or dash -/underscore _... The result should look like something like that:

./music/Bndname-SomeTitle-additionalInfo/Bndname-07-ThisTitleIsCrtain,TheEncdingNot.mp3

How to achieve this for a batch of a lot of files and directories?

I've seen this similar question: bulk rename (or correctly display) files with special characters

But this only fixes the encoding, I would prefer a more strict approach as described above.

edited Apr 13 '17 at 12:37

Community♦

asked Jan 18 '13 at 10:49

Afri

59741227

add a comment |

up vote
13
down vote

favorite

I have a debian server and I'm hosting music for an internet radio station. I have trouble with file names and paths because a lot of files got an invalid encoding, for example:

./music/BÃ¤ndname - Some Title - additional Info/B�ndname - 07 - This Title Is Cörtain, The EncÃding Not.mp3

Ideally, I would like to remove everything that is not letters A-Z/a-z or numbers 0-9 or dash -/underscore _... The result should look like something like that:

./music/Bndname-SomeTitle-additionalInfo/Bndname-07-ThisTitleIsCrtain,TheEncdingNot.mp3

How to achieve this for a batch of a lot of files and directories?

I've seen this similar question: bulk rename (or correctly display) files with special characters

But this only fixes the encoding, I would prefer a more strict approach as described above.

edited Apr 13 '17 at 12:37

Community♦

asked Jan 18 '13 at 10:49

Afri

59741227

I have a debian server and I'm hosting music for an internet radio station. I have trouble with file names and paths because a lot of files got an invalid encoding, for example:

./music/BÃ¤ndname - Some Title - additional Info/B�ndname - 07 - This Title Is Cörtain, The EncÃding Not.mp3

Ideally, I would like to remove everything that is not letters A-Z/a-z or numbers 0-9 or dash -/underscore _... The result should look like something like that:

./music/Bndname-SomeTitle-additionalInfo/Bndname-07-ThisTitleIsCrtain,TheEncdingNot.mp3

How to achieve this for a batch of a lot of files and directories?

I've seen this similar question: bulk rename (or correctly display) files with special characters

But this only fixes the encoding, I would prefer a more strict approach as described above.

linux batch encoding bulk

edited Apr 13 '17 at 12:37

Community♦

asked Jan 18 '13 at 10:49

Afri

59741227

edited Apr 13 '17 at 12:37

Community♦

asked Jan 18 '13 at 10:49

Afri

59741227

edited Apr 13 '17 at 12:37

Community♦

edited Apr 13 '17 at 12:37

Community♦

edited Apr 13 '17 at 12:37

Community♦

asked Jan 18 '13 at 10:49

Afri

59741227

asked Jan 18 '13 at 10:49

Afri

59741227

asked Jan 18 '13 at 10:49

Afri

59741227

add a comment |

3 Answers
3

active

oldest

votes

up vote
13
down vote

accepted

You're going to run in some problems if you want to rename files and directories at the same time. Renaming just a file is easy enough. But you want to make sure the directories are also renamed. You can't simply mv Motörhead/Encöding Motorhead/Encoding since Motorhead won't exist at the time of the call.

So, we need a depth-first traversal of all files and folders, and then rename the current file or folder only. The following works with GNU find and Bash 4.2.42 on my OS X.

#!/usr/bin/env bash

find "$1" -depth -print0 | while IFS= read -r -d '' file; do

  d="$( dirname "$file" )"

  f="$( basename "$file" )"

  new="${f//[^a-zA-Z0-9/._-]/}"

  if [ "$f" != "$new" ]      # if equal, name is already clean, so leave alone

  then

    if [ -e "$d/$new" ]

    then

      echo "Notice: "$new" and "$f" both exist in "$d":"

      ls -ld "$d/$new" "$d/$f"

    else

      echo mv "$file" "$d/$new"      # remove "echo" to actually rename things

    fi

  fi

done

You may change the regex by using new="${f//[\/:*?"<>|]/}" if you want to replace anything that Windows cannot handle.

Save this script as rename.sh, make it executable with chmod +x rename.sh. Then, call it like rename.sh /some/path.

Make sure to resolve any file name collisions (“Notice” announcements).

If you're absolutely sure it does the right replacements, remove the echo from the script to actually rename things instead of just printing what it does.

To be safe, I'd recommend testing this on a small subset of files first.

Options explained

To explain what goes on here:

-depth will ensure directories are recursed depth-first, so we can "roll up" everything from the end. Usually, find traverses differently (but not breadth-first).

-print0 ensures the find output is null-delimited, so we can read it with read -d '' into the file variable. Doing so helps us deal with all kinds of weird file names, including ones with spaces, and even newlines.

We'll get the directory of the file with dirname. Don't forget to always quote your variables properly, otherwise any path with spaces or globbing characters would break this script.

We'll get the actual filename (or directory name) with basename.

Then, we remove any invalid character from $f using Bash's string replacement capabilities. Invalid means anything that's not a lower- or uppercase letter, a digit, a slash (/), a dot (.), an underscore, or a minus-hyphen.

If $f is already clean (the cleaned name is identical to the current name), skip it.

If $new already exists in directory $d (e.g., you have files named resume and résumé in the same directory), issue a warning. You don't want to rename it, because, on some systems, mv foo foo causes a problem. Otherwise,

We finally rename the original file (or directory) to its new name

Since this will only act on the deepest hierarchy, renaming Motörhead/Encöding to Motorhead/Encoding is done in two steps:

mv Motörhead/Encöding Motörhead/Encoding

mv Motörhead Motorhead

This ensures all replacements are done in the correct order.

Example files and test run

Let's assume some files in a base folder called test:

test

test/Motörhead

test/Motörhead/anöther_file.mp3

test/Motörhead/Encöding

test/Randöm

test/Täst

test/Täst/Töst

test/with space

test/with-hyphen.txt

test/work

test/work/resume

test/work/résumé

test/work/schedule

Here is the output from a run in debug mode (with the echo in front of the mv),
i.e., the commands that would be called, and the collision warnings:

mv test/Motörhead/anöther_file.mp3 test/Motörhead/another_file.mp3

mv test/Motörhead/Encöding test/Motörhead/Encoding

mv test/Motörhead test/Motorhead

mv test/Randöm test/Random

mv test/Täst/Töst test/Täst/Tost

mv test/Täst test/Tast

mv test/with space test/withspace

Notice: "resume" and "résumé" both exist in test/work:

-rw-r—r--  …  …  test/work/resume

-rw-r—r--  …  …  test/work/résumé

Notice the absence of messages for with-hyphen.txt, schedule, and test itself.

edited Nov 19 at 9:18

answered Jan 18 '13 at 15:44

slhck

158k47436461

1

You might want to add logic to handle the case where the destination of the mv already exists, which can happen (1) if you have files that are already clean (resulting in mv foo foo), or (2) if you have files with the same name except for the special characters (e.g., mv Encöding Encoding, where you already have an Encoding file in addition to Encöding).
– Scott
Jan 18 '13 at 21:00

Good idea, thanks. Any specific suggestions on what to do in that case? Granted – achieving this in a clean and sane manner is harder than it seems at first. If you have something, feel free to edit of course.
– slhck
Jan 18 '13 at 21:12

I don’t believe it makes sense to think about handling the collisions automatically –– just identify them to the user and let him handle them. I’ve edited your answer, as you suggested.
– Scott
Jan 19 '13 at 0:48

+1 for using the example with "Encöding" Too much fön!:-)
– Marcel
Mar 22 '14 at 21:25

After three years I still come back here. so usefull! :-)
– Afri
Apr 16 '16 at 12:08

|
show 1 more comment

up vote
14
down vote

I know that it's not exactly what you wanted, but if you know the original encoding, perhaps you can use convmv to change the encoding to UTF-8, which should fix most problems.

This worked for me on a folder with some invalid-encoded Polish filenames:

convmv -f cp1250 -t utf8 -r .

Note that this command doesn't actually rename anything; add --notest option to really rename the files.

edited Aug 30 '13 at 19:18

answered Aug 30 '13 at 19:00

mik01aj

6471814

1

For those who have a static set (or don't have a diverse mix of charsets), the convmv option is amazingly simple and perfect. For OP, having a potential multitude of charsets, this would could be merged with the other answer, since convmv seems to know when it or when it doesn't encounter the correct format. By looping through the charsets, via convmv --list, one would get them properly encoded.
– user273265
Nov 11 '13 at 20:14

1

By this I mean, if, as OP, runs a Debian server, one certainly would assume UTF8 these days, in which case, one can keep the original letters. I had the a folder of some nordic chars, and used: convmv -t utf8 --nfc -f iso-8859-1 --notest -r . – The --nfc was to conform to Linux ahead of OS X or so, simply typing convmv gives up the (useful) options.
– user273265
Nov 11 '13 at 20:14

add a comment |

up vote
0
down vote

I know, you asked about renaming.

But you can dodge the problem quite easily using software like MusicBrainz Picard.

It is capable of identifying music (audio fingerprinting), downloading all the necessary data (including cover images, where available) from the huge MusicBrainz database and moving the files around so that your collection can fit any pattern you like. I'm using it for years and it always worked perfectly with anything from Cyrilic to Arabic; and of course (at least for Latin-based scripts) it can also do the conversion to ASCII.

With this approach it does not really matter how messy/badly named your collection really is, as long as the files are readable and complete.

(Did I mention it's free? Both as in free speech and as in free beer? Both the software and the database..?)

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f538161%2fhow-to-bulk-rename-files-with-invalid-encoding-or-bulk-replace-invalid-encoded-c%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
13
down vote

accepted

So, we need a depth-first traversal of all files and folders, and then rename the current file or folder only. The following works with GNU find and Bash 4.2.42 on my OS X.

#!/usr/bin/env bash

find "$1" -depth -print0 | while IFS= read -r -d '' file; do

  d="$( dirname "$file" )"

  f="$( basename "$file" )"

  new="${f//[^a-zA-Z0-9/._-]/}"

  if [ "$f" != "$new" ]      # if equal, name is already clean, so leave alone

  then

    if [ -e "$d/$new" ]

    then

      echo "Notice: "$new" and "$f" both exist in "$d":"

      ls -ld "$d/$new" "$d/$f"

    else

      echo mv "$file" "$d/$new"      # remove "echo" to actually rename things

    fi

  fi

done

You may change the regex by using new="${f//[\/:*?"<>|]/}" if you want to replace anything that Windows cannot handle.

Save this script as rename.sh, make it executable with chmod +x rename.sh. Then, call it like rename.sh /some/path.

Make sure to resolve any file name collisions (“Notice” announcements).

If you're absolutely sure it does the right replacements, remove the echo from the script to actually rename things instead of just printing what it does.

To be safe, I'd recommend testing this on a small subset of files first.

Options explained

To explain what goes on here:

-depth will ensure directories are recursed depth-first, so we can "roll up" everything from the end. Usually, find traverses differently (but not breadth-first).

-print0 ensures the find output is null-delimited, so we can read it with read -d '' into the file variable. Doing so helps us deal with all kinds of weird file names, including ones with spaces, and even newlines.

We'll get the directory of the file with dirname. Don't forget to always quote your variables properly, otherwise any path with spaces or globbing characters would break this script.

We'll get the actual filename (or directory name) with basename.

Then, we remove any invalid character from $f using Bash's string replacement capabilities. Invalid means anything that's not a lower- or uppercase letter, a digit, a slash (/), a dot (.), an underscore, or a minus-hyphen.

If $f is already clean (the cleaned name is identical to the current name), skip it.

If $new already exists in directory $d (e.g., you have files named resume and résumé in the same directory), issue a warning. You don't want to rename it, because, on some systems, mv foo foo causes a problem. Otherwise,

We finally rename the original file (or directory) to its new name

Since this will only act on the deepest hierarchy, renaming Motörhead/Encöding to Motorhead/Encoding is done in two steps:

mv Motörhead/Encöding Motörhead/Encoding

mv Motörhead Motorhead

This ensures all replacements are done in the correct order.

Example files and test run

Let's assume some files in a base folder called test:

test

test/Motörhead

test/Motörhead/anöther_file.mp3

test/Motörhead/Encöding

test/Randöm

test/Täst

test/Täst/Töst

test/with space

test/with-hyphen.txt

test/work

test/work/resume

test/work/résumé

test/work/schedule

Here is the output from a run in debug mode (with the echo in front of the mv),
i.e., the commands that would be called, and the collision warnings:

mv test/Motörhead/anöther_file.mp3 test/Motörhead/another_file.mp3

mv test/Motörhead/Encöding test/Motörhead/Encoding

mv test/Motörhead test/Motorhead

mv test/Randöm test/Random

mv test/Täst/Töst test/Täst/Tost

mv test/Täst test/Tast

mv test/with space test/withspace

Notice: "resume" and "résumé" both exist in test/work:

-rw-r—r--  …  …  test/work/resume

-rw-r—r--  …  …  test/work/résumé

Notice the absence of messages for with-hyphen.txt, schedule, and test itself.

edited Nov 19 at 9:18

answered Jan 18 '13 at 15:44

slhck

158k47436461

1

You might want to add logic to handle the case where the destination of the mv already exists, which can happen (1) if you have files that are already clean (resulting in mv foo foo), or (2) if you have files with the same name except for the special characters (e.g., mv Encöding Encoding, where you already have an Encoding file in addition to Encöding).
– Scott
Jan 18 '13 at 21:00

Good idea, thanks. Any specific suggestions on what to do in that case? Granted – achieving this in a clean and sane manner is harder than it seems at first. If you have something, feel free to edit of course.
– slhck
Jan 18 '13 at 21:12

I don’t believe it makes sense to think about handling the collisions automatically –– just identify them to the user and let him handle them. I’ve edited your answer, as you suggested.
– Scott
Jan 19 '13 at 0:48

+1 for using the example with "Encöding" Too much fön!:-)
– Marcel
Mar 22 '14 at 21:25

After three years I still come back here. so usefull! :-)
– Afri
Apr 16 '16 at 12:08

|
show 1 more comment

up vote
13
down vote

accepted

So, we need a depth-first traversal of all files and folders, and then rename the current file or folder only. The following works with GNU find and Bash 4.2.42 on my OS X.

#!/usr/bin/env bash

find "$1" -depth -print0 | while IFS= read -r -d '' file; do

  d="$( dirname "$file" )"

  f="$( basename "$file" )"

  new="${f//[^a-zA-Z0-9/._-]/}"

  if [ "$f" != "$new" ]      # if equal, name is already clean, so leave alone

  then

    if [ -e "$d/$new" ]

    then

      echo "Notice: "$new" and "$f" both exist in "$d":"

      ls -ld "$d/$new" "$d/$f"

    else

      echo mv "$file" "$d/$new"      # remove "echo" to actually rename things

    fi

  fi

done

You may change the regex by using new="${f//[\/:*?"<>|]/}" if you want to replace anything that Windows cannot handle.

Save this script as rename.sh, make it executable with chmod +x rename.sh. Then, call it like rename.sh /some/path.

Make sure to resolve any file name collisions (“Notice” announcements).

If you're absolutely sure it does the right replacements, remove the echo from the script to actually rename things instead of just printing what it does.

To be safe, I'd recommend testing this on a small subset of files first.

Options explained

To explain what goes on here:

-depth will ensure directories are recursed depth-first, so we can "roll up" everything from the end. Usually, find traverses differently (but not breadth-first).

-print0 ensures the find output is null-delimited, so we can read it with read -d '' into the file variable. Doing so helps us deal with all kinds of weird file names, including ones with spaces, and even newlines.

We'll get the directory of the file with dirname. Don't forget to always quote your variables properly, otherwise any path with spaces or globbing characters would break this script.

We'll get the actual filename (or directory name) with basename.

Then, we remove any invalid character from $f using Bash's string replacement capabilities. Invalid means anything that's not a lower- or uppercase letter, a digit, a slash (/), a dot (.), an underscore, or a minus-hyphen.

If $f is already clean (the cleaned name is identical to the current name), skip it.

If $new already exists in directory $d (e.g., you have files named resume and résumé in the same directory), issue a warning. You don't want to rename it, because, on some systems, mv foo foo causes a problem. Otherwise,

We finally rename the original file (or directory) to its new name

Since this will only act on the deepest hierarchy, renaming Motörhead/Encöding to Motorhead/Encoding is done in two steps:

mv Motörhead/Encöding Motörhead/Encoding

mv Motörhead Motorhead

This ensures all replacements are done in the correct order.

Example files and test run

Let's assume some files in a base folder called test:

test

test/Motörhead

test/Motörhead/anöther_file.mp3

test/Motörhead/Encöding

test/Randöm

test/Täst

test/Täst/Töst

test/with space

test/with-hyphen.txt

test/work

test/work/resume

test/work/résumé

test/work/schedule

Here is the output from a run in debug mode (with the echo in front of the mv),
i.e., the commands that would be called, and the collision warnings:

mv test/Motörhead/anöther_file.mp3 test/Motörhead/another_file.mp3

mv test/Motörhead/Encöding test/Motörhead/Encoding

mv test/Motörhead test/Motorhead

mv test/Randöm test/Random

mv test/Täst/Töst test/Täst/Tost

mv test/Täst test/Tast

mv test/with space test/withspace

Notice: "resume" and "résumé" both exist in test/work:

-rw-r—r--  …  …  test/work/resume

-rw-r—r--  …  …  test/work/résumé

Notice the absence of messages for with-hyphen.txt, schedule, and test itself.

edited Nov 19 at 9:18

answered Jan 18 '13 at 15:44

slhck

158k47436461

1

You might want to add logic to handle the case where the destination of the mv already exists, which can happen (1) if you have files that are already clean (resulting in mv foo foo), or (2) if you have files with the same name except for the special characters (e.g., mv Encöding Encoding, where you already have an Encoding file in addition to Encöding).
– Scott
Jan 18 '13 at 21:00

Good idea, thanks. Any specific suggestions on what to do in that case? Granted – achieving this in a clean and sane manner is harder than it seems at first. If you have something, feel free to edit of course.
– slhck
Jan 18 '13 at 21:12

I don’t believe it makes sense to think about handling the collisions automatically –– just identify them to the user and let him handle them. I’ve edited your answer, as you suggested.
– Scott
Jan 19 '13 at 0:48

+1 for using the example with "Encöding" Too much fön!:-)
– Marcel
Mar 22 '14 at 21:25

After three years I still come back here. so usefull! :-)
– Afri
Apr 16 '16 at 12:08

|
show 1 more comment

up vote
13
down vote

accepted

So, we need a depth-first traversal of all files and folders, and then rename the current file or folder only. The following works with GNU find and Bash 4.2.42 on my OS X.

#!/usr/bin/env bash

find "$1" -depth -print0 | while IFS= read -r -d '' file; do

  d="$( dirname "$file" )"

  f="$( basename "$file" )"

  new="${f//[^a-zA-Z0-9/._-]/}"

  if [ "$f" != "$new" ]      # if equal, name is already clean, so leave alone

  then

    if [ -e "$d/$new" ]

    then

      echo "Notice: "$new" and "$f" both exist in "$d":"

      ls -ld "$d/$new" "$d/$f"

    else

      echo mv "$file" "$d/$new"      # remove "echo" to actually rename things

    fi

  fi

done

You may change the regex by using new="${f//[\/:*?"<>|]/}" if you want to replace anything that Windows cannot handle.

Save this script as rename.sh, make it executable with chmod +x rename.sh. Then, call it like rename.sh /some/path.

Make sure to resolve any file name collisions (“Notice” announcements).

If you're absolutely sure it does the right replacements, remove the echo from the script to actually rename things instead of just printing what it does.

To be safe, I'd recommend testing this on a small subset of files first.

Options explained

To explain what goes on here:

-depth will ensure directories are recursed depth-first, so we can "roll up" everything from the end. Usually, find traverses differently (but not breadth-first).

-print0 ensures the find output is null-delimited, so we can read it with read -d '' into the file variable. Doing so helps us deal with all kinds of weird file names, including ones with spaces, and even newlines.

We'll get the directory of the file with dirname. Don't forget to always quote your variables properly, otherwise any path with spaces or globbing characters would break this script.

We'll get the actual filename (or directory name) with basename.

Then, we remove any invalid character from $f using Bash's string replacement capabilities. Invalid means anything that's not a lower- or uppercase letter, a digit, a slash (/), a dot (.), an underscore, or a minus-hyphen.

If $f is already clean (the cleaned name is identical to the current name), skip it.

If $new already exists in directory $d (e.g., you have files named resume and résumé in the same directory), issue a warning. You don't want to rename it, because, on some systems, mv foo foo causes a problem. Otherwise,

We finally rename the original file (or directory) to its new name

Since this will only act on the deepest hierarchy, renaming Motörhead/Encöding to Motorhead/Encoding is done in two steps:

mv Motörhead/Encöding Motörhead/Encoding

mv Motörhead Motorhead

This ensures all replacements are done in the correct order.

Example files and test run

Let's assume some files in a base folder called test:

test

test/Motörhead

test/Motörhead/anöther_file.mp3

test/Motörhead/Encöding

test/Randöm

test/Täst

test/Täst/Töst

test/with space

test/with-hyphen.txt

test/work

test/work/resume

test/work/résumé

test/work/schedule

Here is the output from a run in debug mode (with the echo in front of the mv),
i.e., the commands that would be called, and the collision warnings:

mv test/Motörhead/anöther_file.mp3 test/Motörhead/another_file.mp3

mv test/Motörhead/Encöding test/Motörhead/Encoding

mv test/Motörhead test/Motorhead

mv test/Randöm test/Random

mv test/Täst/Töst test/Täst/Tost

mv test/Täst test/Tast

mv test/with space test/withspace

Notice: "resume" and "résumé" both exist in test/work:

-rw-r—r--  …  …  test/work/resume

-rw-r—r--  …  …  test/work/résumé

Notice the absence of messages for with-hyphen.txt, schedule, and test itself.

edited Nov 19 at 9:18

answered Jan 18 '13 at 15:44

slhck

158k47436461

So, we need a depth-first traversal of all files and folders, and then rename the current file or folder only. The following works with GNU find and Bash 4.2.42 on my OS X.

#!/usr/bin/env bash

find "$1" -depth -print0 | while IFS= read -r -d '' file; do

  d="$( dirname "$file" )"

  f="$( basename "$file" )"

  new="${f//[^a-zA-Z0-9/._-]/}"

  if [ "$f" != "$new" ]      # if equal, name is already clean, so leave alone

  then

    if [ -e "$d/$new" ]

    then

      echo "Notice: "$new" and "$f" both exist in "$d":"

      ls -ld "$d/$new" "$d/$f"

    else

      echo mv "$file" "$d/$new"      # remove "echo" to actually rename things

    fi

  fi

done

You may change the regex by using new="${f//[\/:*?"<>|]/}" if you want to replace anything that Windows cannot handle.

Save this script as rename.sh, make it executable with chmod +x rename.sh. Then, call it like rename.sh /some/path.

Make sure to resolve any file name collisions (“Notice” announcements).

If you're absolutely sure it does the right replacements, remove the echo from the script to actually rename things instead of just printing what it does.

To be safe, I'd recommend testing this on a small subset of files first.

Options explained

To explain what goes on here:

-depth will ensure directories are recursed depth-first, so we can "roll up" everything from the end. Usually, find traverses differently (but not breadth-first).

-print0 ensures the find output is null-delimited, so we can read it with read -d '' into the file variable. Doing so helps us deal with all kinds of weird file names, including ones with spaces, and even newlines.

We'll get the directory of the file with dirname. Don't forget to always quote your variables properly, otherwise any path with spaces or globbing characters would break this script.

We'll get the actual filename (or directory name) with basename.

Then, we remove any invalid character from $f using Bash's string replacement capabilities. Invalid means anything that's not a lower- or uppercase letter, a digit, a slash (/), a dot (.), an underscore, or a minus-hyphen.

If $f is already clean (the cleaned name is identical to the current name), skip it.

If $new already exists in directory $d (e.g., you have files named resume and résumé in the same directory), issue a warning. You don't want to rename it, because, on some systems, mv foo foo causes a problem. Otherwise,

We finally rename the original file (or directory) to its new name

Since this will only act on the deepest hierarchy, renaming Motörhead/Encöding to Motorhead/Encoding is done in two steps:

mv Motörhead/Encöding Motörhead/Encoding

mv Motörhead Motorhead

This ensures all replacements are done in the correct order.

Example files and test run

Let's assume some files in a base folder called test:

test

test/Motörhead

test/Motörhead/anöther_file.mp3

test/Motörhead/Encöding

test/Randöm

test/Täst

test/Täst/Töst

test/with space

test/with-hyphen.txt

test/work

test/work/resume

test/work/résumé

test/work/schedule

Here is the output from a run in debug mode (with the echo in front of the mv),
i.e., the commands that would be called, and the collision warnings:

mv test/Motörhead/anöther_file.mp3 test/Motörhead/another_file.mp3

mv test/Motörhead/Encöding test/Motörhead/Encoding

mv test/Motörhead test/Motorhead

mv test/Randöm test/Random

mv test/Täst/Töst test/Täst/Tost

mv test/Täst test/Tast

mv test/with space test/withspace

Notice: "resume" and "résumé" both exist in test/work:

-rw-r—r--  …  …  test/work/resume

-rw-r—r--  …  …  test/work/résumé

Notice the absence of messages for with-hyphen.txt, schedule, and test itself.

edited Nov 19 at 9:18

answered Jan 18 '13 at 15:44

slhck

158k47436461

edited Nov 19 at 9:18

answered Jan 18 '13 at 15:44

slhck

158k47436461

answered Jan 18 '13 at 15:44

slhck

158k47436461

answered Jan 18 '13 at 15:44

slhck

158k47436461

1

You might want to add logic to handle the case where the destination of the mv already exists, which can happen (1) if you have files that are already clean (resulting in mv foo foo), or (2) if you have files with the same name except for the special characters (e.g., mv Encöding Encoding, where you already have an Encoding file in addition to Encöding).
– Scott
Jan 18 '13 at 21:00

Good idea, thanks. Any specific suggestions on what to do in that case? Granted – achieving this in a clean and sane manner is harder than it seems at first. If you have something, feel free to edit of course.
– slhck
Jan 18 '13 at 21:12

I don’t believe it makes sense to think about handling the collisions automatically –– just identify them to the user and let him handle them. I’ve edited your answer, as you suggested.
– Scott
Jan 19 '13 at 0:48

+1 for using the example with "Encöding" Too much fön!:-)
– Marcel
Mar 22 '14 at 21:25

After three years I still come back here. so usefull! :-)
– Afri
Apr 16 '16 at 12:08

|
show 1 more comment

1

You might want to add logic to handle the case where the destination of the mv already exists, which can happen (1) if you have files that are already clean (resulting in mv foo foo), or (2) if you have files with the same name except for the special characters (e.g., mv Encöding Encoding, where you already have an Encoding file in addition to Encöding).
– Scott
Jan 18 '13 at 21:00

Good idea, thanks. Any specific suggestions on what to do in that case? Granted – achieving this in a clean and sane manner is harder than it seems at first. If you have something, feel free to edit of course.
– slhck
Jan 18 '13 at 21:12

I don’t believe it makes sense to think about handling the collisions automatically –– just identify them to the user and let him handle them. I’ve edited your answer, as you suggested.
– Scott
Jan 19 '13 at 0:48

+1 for using the example with "Encöding" Too much fön!:-)
– Marcel
Mar 22 '14 at 21:25

After three years I still come back here. so usefull! :-)
– Afri
Apr 16 '16 at 12:08

You might want to add logic to handle the case where the destination of the mv already exists, which can happen (1) if you have files that are already clean (resulting in mv foo foo), or (2) if you have files with the same name except for the special characters (e.g., mv Encöding Encoding, where you already have an Encoding file in addition to Encöding).
– Scott
Jan 18 '13 at 21:00

Good idea, thanks. Any specific suggestions on what to do in that case? Granted – achieving this in a clean and sane manner is harder than it seems at first. If you have something, feel free to edit of course.
– slhck
Jan 18 '13 at 21:12

I don’t believe it makes sense to think about handling the collisions automatically –– just identify them to the user and let him handle them. I’ve edited your answer, as you suggested.
– Scott
Jan 19 '13 at 0:48

+1 for using the example with "Encöding" Too much fön!:-)
– Marcel
Mar 22 '14 at 21:25

After three years I still come back here. so usefull! :-)
– Afri
Apr 16 '16 at 12:08

|
show 1 more comment

up vote
14
down vote

I know that it's not exactly what you wanted, but if you know the original encoding, perhaps you can use convmv to change the encoding to UTF-8, which should fix most problems.

This worked for me on a folder with some invalid-encoded Polish filenames:

convmv -f cp1250 -t utf8 -r .

Note that this command doesn't actually rename anything; add --notest option to really rename the files.

edited Aug 30 '13 at 19:18

answered Aug 30 '13 at 19:00

mik01aj

6471814

1

For those who have a static set (or don't have a diverse mix of charsets), the convmv option is amazingly simple and perfect. For OP, having a potential multitude of charsets, this would could be merged with the other answer, since convmv seems to know when it or when it doesn't encounter the correct format. By looping through the charsets, via convmv --list, one would get them properly encoded.
– user273265
Nov 11 '13 at 20:14

1

By this I mean, if, as OP, runs a Debian server, one certainly would assume UTF8 these days, in which case, one can keep the original letters. I had the a folder of some nordic chars, and used: convmv -t utf8 --nfc -f iso-8859-1 --notest -r . – The --nfc was to conform to Linux ahead of OS X or so, simply typing convmv gives up the (useful) options.
– user273265
Nov 11 '13 at 20:14

add a comment |

up vote
14
down vote

I know that it's not exactly what you wanted, but if you know the original encoding, perhaps you can use convmv to change the encoding to UTF-8, which should fix most problems.

This worked for me on a folder with some invalid-encoded Polish filenames:

convmv -f cp1250 -t utf8 -r .

Note that this command doesn't actually rename anything; add --notest option to really rename the files.

edited Aug 30 '13 at 19:18

answered Aug 30 '13 at 19:00

mik01aj

6471814

1

For those who have a static set (or don't have a diverse mix of charsets), the convmv option is amazingly simple and perfect. For OP, having a potential multitude of charsets, this would could be merged with the other answer, since convmv seems to know when it or when it doesn't encounter the correct format. By looping through the charsets, via convmv --list, one would get them properly encoded.
– user273265
Nov 11 '13 at 20:14

1

By this I mean, if, as OP, runs a Debian server, one certainly would assume UTF8 these days, in which case, one can keep the original letters. I had the a folder of some nordic chars, and used: convmv -t utf8 --nfc -f iso-8859-1 --notest -r . – The --nfc was to conform to Linux ahead of OS X or so, simply typing convmv gives up the (useful) options.
– user273265
Nov 11 '13 at 20:14

add a comment |

up vote
14
down vote

I know that it's not exactly what you wanted, but if you know the original encoding, perhaps you can use convmv to change the encoding to UTF-8, which should fix most problems.

This worked for me on a folder with some invalid-encoded Polish filenames:

convmv -f cp1250 -t utf8 -r .

Note that this command doesn't actually rename anything; add --notest option to really rename the files.

edited Aug 30 '13 at 19:18

answered Aug 30 '13 at 19:00

mik01aj

6471814

I know that it's not exactly what you wanted, but if you know the original encoding, perhaps you can use convmv to change the encoding to UTF-8, which should fix most problems.

This worked for me on a folder with some invalid-encoded Polish filenames:

convmv -f cp1250 -t utf8 -r .

Note that this command doesn't actually rename anything; add --notest option to really rename the files.

edited Aug 30 '13 at 19:18

answered Aug 30 '13 at 19:00

mik01aj

6471814

edited Aug 30 '13 at 19:18

answered Aug 30 '13 at 19:00

mik01aj

6471814

answered Aug 30 '13 at 19:00

mik01aj

6471814

answered Aug 30 '13 at 19:00

mik01aj

6471814

1

For those who have a static set (or don't have a diverse mix of charsets), the convmv option is amazingly simple and perfect. For OP, having a potential multitude of charsets, this would could be merged with the other answer, since convmv seems to know when it or when it doesn't encounter the correct format. By looping through the charsets, via convmv --list, one would get them properly encoded.
– user273265
Nov 11 '13 at 20:14

1

By this I mean, if, as OP, runs a Debian server, one certainly would assume UTF8 these days, in which case, one can keep the original letters. I had the a folder of some nordic chars, and used: convmv -t utf8 --nfc -f iso-8859-1 --notest -r . – The --nfc was to conform to Linux ahead of OS X or so, simply typing convmv gives up the (useful) options.
– user273265
Nov 11 '13 at 20:14

add a comment |

1

For those who have a static set (or don't have a diverse mix of charsets), the convmv option is amazingly simple and perfect. For OP, having a potential multitude of charsets, this would could be merged with the other answer, since convmv seems to know when it or when it doesn't encounter the correct format. By looping through the charsets, via convmv --list, one would get them properly encoded.
– user273265
Nov 11 '13 at 20:14

1

By this I mean, if, as OP, runs a Debian server, one certainly would assume UTF8 these days, in which case, one can keep the original letters. I had the a folder of some nordic chars, and used: convmv -t utf8 --nfc -f iso-8859-1 --notest -r . – The --nfc was to conform to Linux ahead of OS X or so, simply typing convmv gives up the (useful) options.
– user273265
Nov 11 '13 at 20:14

For those who have a static set (or don't have a diverse mix of charsets), the convmv option is amazingly simple and perfect. For OP, having a potential multitude of charsets, this would could be merged with the other answer, since convmv seems to know when it or when it doesn't encounter the correct format. By looping through the charsets, via convmv --list, one would get them properly encoded.
– user273265
Nov 11 '13 at 20:14

By this I mean, if, as OP, runs a Debian server, one certainly would assume UTF8 these days, in which case, one can keep the original letters. I had the a folder of some nordic chars, and used: convmv -t utf8 --nfc -f iso-8859-1 --notest -r . – The --nfc was to conform to Linux ahead of OS X or so, simply typing convmv gives up the (useful) options.
– user273265
Nov 11 '13 at 20:14

add a comment |

up vote
0
down vote

I know, you asked about renaming.

But you can dodge the problem quite easily using software like MusicBrainz Picard.

With this approach it does not really matter how messy/badly named your collection really is, as long as the files are readable and complete.

(Did I mention it's free? Both as in free speech and as in free beer? Both the software and the database..?)

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

add a comment |

up vote
0
down vote

I know, you asked about renaming.

But you can dodge the problem quite easily using software like MusicBrainz Picard.

With this approach it does not really matter how messy/badly named your collection really is, as long as the files are readable and complete.

(Did I mention it's free? Both as in free speech and as in free beer? Both the software and the database..?)

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

add a comment |

up vote
0
down vote

I know, you asked about renaming.

But you can dodge the problem quite easily using software like MusicBrainz Picard.

With this approach it does not really matter how messy/badly named your collection really is, as long as the files are readable and complete.

(Did I mention it's free? Both as in free speech and as in free beer? Both the software and the database..?)

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

I know, you asked about renaming.

But you can dodge the problem quite easily using software like MusicBrainz Picard.

With this approach it does not really matter how messy/badly named your collection really is, as long as the files are readable and complete.

(Did I mention it's free? Both as in free speech and as in free beer? Both the software and the database..?)

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

answered Oct 16 '15 at 4:45

Alois Mahdal

1,37931333

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Super User!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfrktyl