Difference between revisions of "Perl"
(→See also: Good blog item on Perl and UTF-8) |
m (→Opening pipes in Perl) |
||
(8 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | == Opening pipes in Perl == | |
− | == Perl | + | [https://perldoc.perl.org/5.8.6/perlopentut Using open with pipes in Perl] has some good examples and explanations of the ''open'' command. One of the most useful of these constructs for us at Organic Design is to run a shell command such that any output sent to STDOUT is available as a pipe to Perl so that the command's output can be processed line-by-line as if it were a file. |
− | + | ||
+ | For example, here we use the ''pdftotext'' command using "-" as the output file to indicate to shell that the output file is STDOUT, and then we use a pipe at the end of the command to indicate to Perl that the output of the command should be made available as a local pipe handle. | ||
<source lang="perl"> | <source lang="perl"> | ||
− | + | open( PDF, "pdftotext -layout foo.pdf - |" ) || die "can't fork pdftotext: $!"; | |
+ | while( <PDF> ) { | ||
+ | # Each line of text in the PDF is in $_ here | ||
+ | } | ||
+ | close(PDF); | ||
</source> | </source> | ||
− | + | == Perl and UTF-8 == | |
− | + | [https://blog.summercat.com/perl-and-character-encoding.html This excellent blog item on Perl's UTF-8 handling] clears up the confusion that can arise when dealing with UTF-8 in Perl. The main complexity is that Perl has an internal character encoding. This encoding allows Perl to hold Unicode characters in strings and treat them as single characters. This is useful as Unicode characters may be represented by multiple bytes in a Unicode encoding such as UTF-8. This means there are two types of strings: Those with characters in its internal encoding, and those not. The latter typically can be treated as binary data ("octets"), though strictly speaking they are octets only if they come from outside a program. i.e., not strings defined in source code. | |
− | |||
− | |||
− | |||
− | |||
− | + | One approach that can work is to have ''binmode'' set on all filehandles and ''stdout/stdin''. Then ''Encode::decode()'' at the point data enters the program, and ''Encode::encode()'' on its way out. This lets you avoid the complexity of I/O layers, and means you have a defined boundary where translation occurs. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Remember also to always use "UTF-8" rather than "utf8" as the encoding in the ''Encode module'' as only the former produces correct UTF-8, while the latter is something Perl specific and is not UTF-8. | |
− | |||
− | |||
− | |||
− | |||
− | + | See also [https://perl-begin.org/topics/files-and-directories/ Reading and writing UTF-8 examples with IO::All], [https://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html/ Perl unicode cookbook] and [http://www.modernperlbooks.com/mt/2013/01/why-unicode-normalization-matters.html Why Unicode Normalization Matters]. | |
− | |||
− | |||
== See also == | == See also == | ||
Line 37: | Line 25: | ||
*[http://perl.com/pub/1999/10/DBI.html Basic DBI tutorial] | *[http://perl.com/pub/1999/10/DBI.html Basic DBI tutorial] | ||
*[[Regular expressions]] and [http://perldoc.perl.org/functions/quotemeta.html Quotemeta] | *[[Regular expressions]] and [http://perldoc.perl.org/functions/quotemeta.html Quotemeta] | ||
− | *[https:// | + | *[https://www.evanmiller.org/a-review-of-perl-6.html A review of Perl6] |
− | + | [[Category:Programming languages]][[Category:PERL]] | |
− | [[Category:Programming languages]] |
Latest revision as of 16:49, 22 December 2022
Opening pipes in Perl
Using open with pipes in Perl has some good examples and explanations of the open command. One of the most useful of these constructs for us at Organic Design is to run a shell command such that any output sent to STDOUT is available as a pipe to Perl so that the command's output can be processed line-by-line as if it were a file.
For example, here we use the pdftotext command using "-" as the output file to indicate to shell that the output file is STDOUT, and then we use a pipe at the end of the command to indicate to Perl that the output of the command should be made available as a local pipe handle.
open( PDF, "pdftotext -layout foo.pdf - |" ) || die "can't fork pdftotext: $!";
while( <PDF> ) {
# Each line of text in the PDF is in $_ here
}
close(PDF);
Perl and UTF-8
This excellent blog item on Perl's UTF-8 handling clears up the confusion that can arise when dealing with UTF-8 in Perl. The main complexity is that Perl has an internal character encoding. This encoding allows Perl to hold Unicode characters in strings and treat them as single characters. This is useful as Unicode characters may be represented by multiple bytes in a Unicode encoding such as UTF-8. This means there are two types of strings: Those with characters in its internal encoding, and those not. The latter typically can be treated as binary data ("octets"), though strictly speaking they are octets only if they come from outside a program. i.e., not strings defined in source code.
One approach that can work is to have binmode set on all filehandles and stdout/stdin. Then Encode::decode() at the point data enters the program, and Encode::encode() on its way out. This lets you avoid the complexity of I/O layers, and means you have a defined boundary where translation occurs.
Remember also to always use "UTF-8" rather than "utf8" as the encoding in the Encode module as only the former produces correct UTF-8, while the latter is something Perl specific and is not UTF-8.
See also Reading and writing UTF-8 examples with IO::All, Perl unicode cookbook and Why Unicode Normalization Matters.