Author Topic: Anyone help me fix my bash script?  (Read 485 times)

bugmenоt

  • Outlandish
  • ***
  • Posts: 4016
    • View Profile
Anyone help me fix my bash script?
« on: February 27, 2017, 01:20:33 pm »
I want to automatically replace html entities by their according characters in bash. For example, the string "Fremdsprachige Bücher" should become "Fremdsprachige Bücher".

The function should take any string as input and replace any html entity in it by the character it represents. It will use the file "html_entities.csv" as a lookup table, which looks like this:

Code: (html_entities.csv) [Select]
...
upsilon;υ
Uuml;Ü
uuml;ü
weierp;℘
Xi;Ξ
...

This is the script so far:

Code: (sanitize.sh) [Select]
sanitizeHTML() {
OUTSTRING="$1"
cat "html_entities.csv" | while read LINE; do
  IFS=';' read FROM TO <<< "$LINE"
  MATCH="&$FROM;"
  while [[ $OUTSTRING =~ (.*)$MATCH(.*) ]]; do
    OUTSTRING="${BASH_REMATCH[1]}$TO${BASH_REMATCH[2]}"
    echo "DURING LOOP IT DOES THIS: $OUTSTRING"
  done
done
echo "AFTER LOOP IT DOES THIS: $OUTSTRING"
}

sanitizeHTML "Fremdsprachige B&uuml;cher"

But the output is this:
Code: [Select]
DURING LOOP IT DOES THIS: Fremdsprachige Bücher
AFTER LOOP IT DOES THIS: Fremdsprachige B&uuml;cher

The replacement seems to work during the loop, but why does $OUTSTRING contain the original html entities after the loop? I'd like to solve this without grep or sed by the way.

tyrannosaurus vex

  • a gas giant of idiots
  • Deserved It
  • ****
  • Posts: 26657
    • View Profile
Re: Anyone help me fix my bash script?
« Reply #1 on: February 27, 2017, 03:23:12 pm »
In Bash, while loops are executed in a subshell which has a different variable scope than the main script. So any variables you set (or "change") inside the while loop will be lost when the loop exits. To avoid that you need to use command substitution to change the input to the while loop

Code: [Select]
i put code here but it was wrong
« Last Edit: February 27, 2017, 03:27:57 pm by tyrannosaurus vex »
Evil and Unfeeling Arse-Flenser From The City of the Damned.

tyrannosaurus vex

  • a gas giant of idiots
  • Deserved It
  • ****
  • Posts: 26657
    • View Profile
Re: Anyone help me fix my bash script?
« Reply #2 on: February 27, 2017, 03:41:37 pm »
Code: [Select]
sanitizeHTML() {
OUTSTRING="$1"
while read LINE; do
  IFS=';' read FROM TO <<< "$LINE"
  MATCH="&$FROM;"
  while [[ $OUTSTRING =~ (.*)$MATCH(.*) ]]; do
    OUTSTRING="${BASH_REMATCH[1]}$TO${BASH_REMATCH[2]}"
    echo "DURING LOOP IT DOES THIS: $OUTSTRING"
  done
done < <(cat html_entities.csv)
echo "AFTER LOOP IT DOES THIS: $OUTSTRING"
}

sanitizeHTML "Fremdsprachige B&uuml;cher"


explanation: just put "cat html_entitiies" in its own process substitution subshell and redirect it into the while loop, rather than piping it in at the beginning. This gives the loop all the data it needs at the beginning without having to re-execute for every line in the stream (and thus creating a subshell to deal with the input of unknown length).
« Last Edit: February 27, 2017, 03:46:02 pm by tyrannosaurus vex »
Evil and Unfeeling Arse-Flenser From The City of the Damned.

bugmenоt

  • Outlandish
  • ***
  • Posts: 4016
    • View Profile
Re: Anyone help me fix my bash script?
« Reply #3 on: February 27, 2017, 09:20:54 pm »
Works like a charm! Subshells and how they behave have always been some kind of black box to me. I still don't get why a subshell is needed if the input has an unknown length. Thanks for making my script work and for the search terms I can use for further research.

tyrannosaurus vex

  • a gas giant of idiots
  • Deserved It
  • ****
  • Posts: 26657
    • View Profile
Re: Anyone help me fix my bash script?
« Reply #4 on: February 27, 2017, 09:57:40 pm »
it's sort of black magic to me too, it's just bit me a few times before so i remember it. I'm pretty sure it's because when you pipe a stream to a loop like that, the loop can't know ahead of time how much data there is so it has to set up a subshell to process the lines as they come in. but when you do the whole 'cat' operation in its own subshell and then redirect it into the loop, there's a known quantity that can be iterated over. i'm probably wrong about that but whatevs.
Evil and Unfeeling Arse-Flenser From The City of the Damned.