News:

PD.com: can increase your susceptibility to cancer, dementia, heart disease, diabetes, influenza, rheumatoid arthritis, lupus - even the common cold.

Main Menu

Anyone help me fix my bash script?

Started by bugmenоt, February 27, 2017, 01:20:33 PM

Previous topic - Next topic

bugmenоt

I want to automatically replace html entities by their according characters in bash. For example, the string "Fremdsprachige Bücher" should become "Fremdsprachige Bücher".

The function should take any string as input and replace any html entity in it by the character it represents. It will use the file "html_entities.csv" as a lookup table, which looks like this:

Code (html_entities.csv) Select

...
upsilon;υ
Uuml;Ü
uuml;ü
weierp;℘
Xi;Ξ
...


This is the script so far:

Code (sanitize.sh) Select

sanitizeHTML() {
OUTSTRING="$1"
cat "html_entities.csv" | while read LINE; do
  IFS=';' read FROM TO <<< "$LINE"
  MATCH="&$FROM;"
  while [[ $OUTSTRING =~ (.*)$MATCH(.*) ]]; do
    OUTSTRING="${BASH_REMATCH[1]}$TO${BASH_REMATCH[2]}"
    echo "DURING LOOP IT DOES THIS: $OUTSTRING"
  done
done
echo "AFTER LOOP IT DOES THIS: $OUTSTRING"
}

sanitizeHTML "Fremdsprachige B&uuml;cher"


But the output is this:

DURING LOOP IT DOES THIS: Fremdsprachige Bücher
AFTER LOOP IT DOES THIS: Fremdsprachige B&uuml;cher


The replacement seems to work during the loop, but why does $OUTSTRING contain the original html entities after the loop? I'd like to solve this without grep or sed by the way.

tyrannosaurus vex

#1
In Bash, while loops are executed in a subshell which has a different variable scope than the main script. So any variables you set (or "change") inside the while loop will be lost when the loop exits. To avoid that you need to use command substitution to change the input to the while loop


i put code here but it was wrong
Evil and Unfeeling Arse-Flenser From The City of the Damned.

tyrannosaurus vex

#2

sanitizeHTML() {
OUTSTRING="$1"
while read LINE; do
  IFS=';' read FROM TO <<< "$LINE"
  MATCH="&$FROM;"
  while [[ $OUTSTRING =~ (.*)$MATCH(.*) ]]; do
    OUTSTRING="${BASH_REMATCH[1]}$TO${BASH_REMATCH[2]}"
    echo "DURING LOOP IT DOES THIS: $OUTSTRING"
  done
done < <(cat html_entities.csv)
echo "AFTER LOOP IT DOES THIS: $OUTSTRING"
}

sanitizeHTML "Fremdsprachige B&uuml;cher"



explanation: just put "cat html_entitiies" in its own process substitution subshell and redirect it into the while loop, rather than piping it in at the beginning. This gives the loop all the data it needs at the beginning without having to re-execute for every line in the stream (and thus creating a subshell to deal with the input of unknown length).
Evil and Unfeeling Arse-Flenser From The City of the Damned.

bugmenоt

Works like a charm! Subshells and how they behave have always been some kind of black box to me. I still don't get why a subshell is needed if the input has an unknown length. Thanks for making my script work and for the search terms I can use for further research.

tyrannosaurus vex

it's sort of black magic to me too, it's just bit me a few times before so i remember it. I'm pretty sure it's because when you pipe a stream to a loop like that, the loop can't know ahead of time how much data there is so it has to set up a subshell to process the lines as they come in. but when you do the whole 'cat' operation in its own subshell and then redirect it into the loop, there's a known quantity that can be iterated over. i'm probably wrong about that but whatevs.
Evil and Unfeeling Arse-Flenser From The City of the Damned.