Difference between revisions of "Perl"

From Organic Design wiki
(See also: Good blog item on Perl and UTF-8)
(Perl and UTF-8)
Line 1: Line 1:
{{stub}}
+
== Perl and UTF-8 ==
== Perl one liners ==
+
[https://blog.summercat.com/perl-and-character-encoding.html This excellent blog item on Perl's UTF-8 handling] clears up the confusion that can arise when dealing with UTF-8 in Perl. The main complexity is that Perl has an internal character encoding. This encoding allows Perl to hold Unicode characters in strings and treat them as single characters. This is useful as Unicode characters may be represented by multiple bytes in a Unicode encoding such as UTF-8. This means there are two types of strings: Those with characters in its internal encoding, and those not. The latter typically can be treated as binary data ("octets"), though strictly speaking they are octets only if they come from outside a program. i.e., not strings defined in source code.
If you type;
 
<source lang="perl">
 
perl --help
 
</source>
 
  
on the command line, Perl will provide details on its usage and command line switches. The switches we are interested in are
+
One approach that can work is to have ''binmode'' set on all filehandles and ''stdout/stdin''. Then ''Encode::decode()'' at the point data enters the program, and ''Encode::encode()'' on its way out. This lets you avoid the complexity of I/O layers, and means you have a defined boundary where translation occurs.
<pre>
 
  -d[:debugger]  run program under debugger
 
  -e program      one line of program (several -e's allowed, omit programfile)
 
  -n              assume 'while (<>) { ... }' loop around program
 
</pre>
 
  
== Using -e switch ==
+
Remember also to always use "UTF-8" rather than "utf8" as the encoding in the ''Encode module'' as only the former produces correct UTF-8, while the latter is something Perl specific and is not UTF-8.
<source lang="perl">
 
# Unix/Linux
 
perl -e 'print "It matches\n" if "Hello World" =~ /World/;'
 
# Windows
 
perl -e "print \"It matches\n\" if "Hello World" =~ /World/;"
 
</source>
 
* See [http://www.perlmonks.org/?node_id=945 string literals] for details on differences between single and double quotes.
 
  
== Using -de switch ==
+
See also [https://perl-begin.org/topics/files-and-directories/ Reading and writing UTF-8 examples with IO::All].
To initiate the debugger for a one line program type;
 
<source lang="perl">
 
perl -de 42
 
</source>
 
  
Now type 'h' to obtain for help commands used within the debugger. Commands useful for one liners in the debugger are;
 
*'x' to eval an expression in list context, and print the result.
 
*'q' to quit
 
  
 
== See also ==
 
== See also ==
Line 37: Line 14:
 
*[http://perl.com/pub/1999/10/DBI.html Basic DBI tutorial]
 
*[http://perl.com/pub/1999/10/DBI.html Basic DBI tutorial]
 
*[[Regular expressions]] and [http://perldoc.perl.org/functions/quotemeta.html Quotemeta]
 
*[[Regular expressions]] and [http://perldoc.perl.org/functions/quotemeta.html Quotemeta]
*[https://perl-begin.org/topics/files-and-directories/ Reading and writing UTF-8 examples with IO::All]
 
*[https://blog.summercat.com/perl-and-character-encoding.html Good blog item on Perl and UTF-8]
 
 
[[Category:Programming languages]]
 
[[Category:Programming languages]]

Revision as of 20:55, 9 January 2019

Perl and UTF-8

This excellent blog item on Perl's UTF-8 handling clears up the confusion that can arise when dealing with UTF-8 in Perl. The main complexity is that Perl has an internal character encoding. This encoding allows Perl to hold Unicode characters in strings and treat them as single characters. This is useful as Unicode characters may be represented by multiple bytes in a Unicode encoding such as UTF-8. This means there are two types of strings: Those with characters in its internal encoding, and those not. The latter typically can be treated as binary data ("octets"), though strictly speaking they are octets only if they come from outside a program. i.e., not strings defined in source code.

One approach that can work is to have binmode set on all filehandles and stdout/stdin. Then Encode::decode() at the point data enters the program, and Encode::encode() on its way out. This lets you avoid the complexity of I/O layers, and means you have a defined boundary where translation occurs.

Remember also to always use "UTF-8" rather than "utf8" as the encoding in the Encode module as only the former produces correct UTF-8, while the latter is something Perl specific and is not UTF-8.

See also Reading and writing UTF-8 examples with IO::All.


See also