Программирование Perl LWP: основы и примеры использования

Cookies are used in CGI

One of the great disadvantages of the http protocol is that the user’s identity is not judged, which is a great inconvenience to the programmer, and the emergence of cookie functions to make up for this deficiency.

Cookies are when the customer accesses the script, through the customer’s browser, writes the record data on the customer’s hard drive, and when the customer accesses the script the next time the data information is taken back, thus achieving the function of identity identification, cookies are often used in identity verification.

The syntax of cookies

The http cookie is sent through the http head, which is delivered earlier than the file, and the syntax of the head set-cookie is as follows:

Set-cookie:name=name;expires=date;path=path;domain=domain;secure 

  • name-name: the

    value of the cookie needs to be set (name cannot be

    ;


    “and”

    ,

    “number” when there are multiple name values used ”

    ;


    ” Separation, e.g.


    name1=name1; n


    ame2=name2; n


    ame3=name3




  • expires=date:

    the expiration date of the cookie, format: expires=”Wdy, DD-Mon-YYYYY HH:MM:SS”

  • path: Set the path

    supported by the cookie, if path is a path, then the cookie takes effect on all files and subdirectts in this directory, e.g.: path is “/cgi-bin/” and if path is a file, the cookie means effective for that file, e.g., path=/cgi bin/cookie.cgi”.

  • domain-domain: Domain

    name effective for cookies, e.g.: domain-“www.w3cschool.cn”

  • Secure: If

    this flag is given, the cookie can only be passed through the https server of the SSL protocol.

  • The receipt of cookies is achieved by HTTP_COOKIE environment variable, which the CGI program can retrieve to obtain cookie information.

Автоматизация работы в Web с помощью LWP

Ставили ли вы когда-нибудь перед собой задачу проверить Web-документ на предмет наличия “мертвых” ссылок, найти его название или выяснить, какие из его ссылок обновлялись с прошлого четверга? Может быть, вы хотели загрузить изображения, которые содержатся в каком-либо документе, или зеркально скопировать целый каталог? Что будет, если вам придется проходить через proxy-сервер или проводить переадресацию?

Сейчас вы могли бы сделать все это вручную с помощью броузера, но, поскольку графические интерфейсы совершенно не приспособлены для автоматизации программирования, это был бы медленный и утомительный процесс, требующий большего терпения и меньшей лености*, чем присущи большинству из нас.

* Помните, что по Ларри Уоллу три главных достоинства программыста есть Леность, Нетерпение и Гордость.

Модули LWP (Library for WWW access in Perl — библиотека для доступа к WWW на Perl) из CPAN решают за вас все эти задачи — и даже больше. Например, обращение в сценарии к Web-документу с помощью этих модулей осуществляется настолько просто, что его можно выполнить с помощью одностроковой программы. Чтобы, к примеру, получить документ /perl/in-dex.html с узла www.perl.com, введите следующую строку в свой shell или интерпретатор команд:

perl -MLWP::Simple -e "getprint 'http://www.perl.com/perl/index.html'"

За исключением модуля LWP::Simple, большинство модулей комплекта LWP в значительной степени объектно-ориентированы. Вот, например, крошечная программа, которая получает URL как аргументы и выдает их названия:

#!/usr/local/bin/perl
use LWP;
$browser = LWP::UserAgent->new(); # создать виртуальный броузер 
$browser->agent("Mothra/126-Paladium:); # дать ему имя
foreeach $url (@ARGV) { # ожидать URL как аргументы

  # сделать GET-запрос по URL через виртуальный броузер
  $webdoc = $browser->request(HTTP::Request->new(GET => $url));

  if($webdoc->is success) { # нашли
    print STDOUT "$url::,$result->title, "\n";

  } else { # что-то не так
    print STDERR "$0: Couldn't fetch $url\n";

  }
}

Как видите, усилия, потраченные на изучение объектов Perl, не пропали даром. Но не забывайте, что, как и модуль CGI.pm, модули LWP скрывают большую часть сложной работы.

Этот сценарий работает так. Сначала создается объект — пользовательский агент (нечто вроде автоматизированного виртуального броузера). Этот объект используется для выдачи запросов на удаленные серверы. Дадим нашему виртуальному броузеру какое-нибудь глупое имя, просто чтобы сделать файлы регистрации пользователей более интересными. Затем получим удаленный документ, направив HTTP-запрос GET на удаленный сервер. Если результат успешный, выведем на экран URL и имя сервера; в противном случае немножко поплачем.

Вот программа, которая выводит рассортированный список уникальных ссылок и изображений, содержащихся в URL, переданных в виде аргументов командной строки.

#!/usr/local/bin/perl -w
use strict;
use LWP 5.000;
use URI::URL;
use HTML::LinkExtor;

my($url, $browser, %saw);
$browser ” LPW::UserAgent->new(); # создать виртуальный броузер 
fоreach $url ( @ARGV ) { # выбрать документ через виртуальный броузер
  my $webdoc = $browser->request(HTTP: :Request->new(GET => $url).);
  next unless $webdoc->is_success;
  next unless $webdoc->content_type eq 'text/html';

  # не могу разобрать GIF-файлы
  my $base = $webdoc->base;

  # теперь извлечь все ссылки типа <А ...> и <IMG...>
  foreach (HTML::LinkExtor->new->parse($webdoc->content)->eof->links) { 
    my($tag, %links) = @$_;
    next unless $tag eq "a" or $tag eq "img";
    my $1ink;

    foreach $1ink (values %links) {
     $saw{ uri($1ink,$base)->abs->as_string }++;
    }
  }
}
print join("\n",sort keys %saw), "\n";

На первый взгляд все кажется очень сложным, но вызвано это, скорее всего, недостаточно четким пониманием того, как работают различные объекты и их методы. Мы не собираемся здесь давать пояснения по этим вопросам, потому что книга и так получилась уже достаточно объемной. К счастью, в LWP можно найти обширную документацию и примеры.


Назад | Вперед
Содержание (общее) | Содержание раздела

Web Browsing

To understand the concept of CGI, lets see what happens when we click a hyper link available on a web page to browse a particular web page or URL.

  • Your browser contacts web server using HTTP protocol and demands for the URL, i.e., web page filename.

  • Web Server will check the URL and will look for the filename requested. If web server finds that file then it sends the file back to the browser without any further execution otherwise sends an error message indicating that you have requested a wrong file.

  • Web browser takes response from web server and displays either the received file content or an error message in case file is not found.

However, it is possible to set up HTTP server in such a way so that whenever a file in a certain directory is requested that file is not sent back; instead it is executed as a program, and whatever that program outputs as a result, that is sent back for your browser to display. This can be done by using a special functionality available in the web server and it is called Common Gateway Interface or CGI and such programs which are executed by the server to produce final result, are called CGI scripts. These CGI programs can be a PERL Script, Shell Script, C or C++ program, etc.

2.7 А есть ли где в интернете хорошие доки по перлу?

Да. На мой взгляд, серия статей Рандала Шварца для Unix Review
Perl Columns — лучшее введение в перл, и намного интереснее и
полезнее книг Llama и Camel (мнения авторов не всегда совпадают с мнением
координатора — Аммосов). Почитать их можно на
http://w3.stonehenge.com:80/merlyn/UnixReview/.

Почему я считаю это лучшим введением в перл? Потому что это отдельные
небольшие статьи, каждая из которых иллюстрирует конкретные
возможности перла на примере написания программы для достаточно
простой задачи. Все же статьи охватывают практически весь спектр
возможностей перла — от написания скрипта в одну строку, который
может поменять Иванов на Сидоров во всех файлах в дереве директорий,
до основ объектно-ориентированного программирования и принципов
создания собственных модулей и библиотек.

Passing Radio Button Data to CGI Program

Radio Buttons are used when only one option is required to be selected. Here is an example HTML code for a form with two radio button −

<form action = "/cgi-bin/radiobutton.cgi" method = "POST" target = "_blank">
<input type = "radio" name = "subject" value = "maths"> Maths
<input type = "radio" name = "subject" value = "physics"> Physics
<input type = "submit" value = "Select Subject">
</form>

The result of this code is the following form −

Below is radiobutton.cgi script to handle input given by the web browser for radio button.

#!/usr/bin/perl

local ($buffer, @pairs, $pair, $name, $value, %FORM);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "POST") {
   read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
} else {
   $buffer = $ENV{'QUERY_STRING'};
}
# Split information into name/value pairs
@pairs = split(/&/, $buffer);
foreach $pair (@pairs) {
   ($name, $value) = split(/=/, $pair);
   $value =~ tr/+/ /;
   $value =~ s/%(..)/pack("C", hex($1))/eg;
   $FORM{$name} = $value;
}
$subject = $FORM{subject};

print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Radio - Fourth CGI Program</title>";
print "</head>";
print "<body>";
print "<h2> Selected Subject is $subject</h2>";
print "</body>";
print "</html>";

1;

Building Infrastructure To Support Web Scraping With Perl

Proxies act as intermediaries between the user and the website, masking your IP address and allowing you to access blocked or restricted content. With proxies, you can quickly make multiple requests from different IP addresses with less chance of getting blocked by anti-scraping measures. This makes it easier for Perl scripts to scrape data efficiently from large amounts of websites quickly and reliably.

But you’re going to need the right provider with the right services. You’ll need reliable proxy servers to make sure your project runs smoothly, especially if you’re aiming for scalability. Rayobyte offers a range of proxies and an advanced Scraping Robot that can simplify the automation process. Check out our various proxy solutions now!

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.

Processing input

The first way to pass data is with the query string, (the portion of a URI beginning with ), which you see in URLs like . This uses the “GET” request method, and becomes available to the program as , which in this case is (CGI programs receive their arguments as environment variables). But CGI provides the method which parses the query string into key value pairs, so you can work with them like a hash:

So, now, let’s make a web page like this:

And click submit. The browser will send an HTTP “POST” request, with the form input as key value pairs in the request body. CGI handles this and places the data in , just like with “GET”. Only, with “POST” the size of input can be much larger (URL’s are generally limited to 2048 bytes by browsers).

Before you start

The examples to be used in this tutorial are based on Unix servers running the Apache

web server. In order to make the best use of this tutorial, please make sure you meet the

following minimum requirements:

  1. An account on a Unix system;
  2. Perl Interpreter is installed on the Unix system (which is the case on most, if not all

    of modern Unix boxes);

  3. The Unix system you will be using has the Apache web server installed;
  4. You can publish webpages under the public_html directory (which is under your home

    directory) and you can run CGI script under the public_html folder.

If you are not clear about the above requirements or have difficulty in meeting these

requirements, please contact your system administrator for help.

We will assume that you know the basics of using a Unix system. This means that, for

example,

  1. You know how to login and logout a Unix system from a telnet connection;
  2. You know what a shell prompt is. The most common prompt we use is the percentile sign

    (%). Other symbols can be used, e.g., >, #, etc.. In this tutorial, we will use the

    sign to indicate the Shell prompt;

  3. You know what the word ‘path’ refers to and what the command ‘pwd’ does;
  4. You know how to list, copy and/or delete files;
  5. You know the difference between a text file and a binary file;
  6. You know how to use a text editor on a Unix system. For example, vi, emacs

    and pico are the most popular text editors on a Unix system. If you do not know

    how to use a text editor, give pico a try. It is very easy to learn. In this

    tutorial, we will use pico as our text editor. If you know how to use vi

    or emacs, you’ve probably already got enough programming experiences and can

    learn Perl programming on your own :-).

If the above does not make sense to you, you may want to take a look at a quick Unix

tutorial (e.g. http://www.utexas.edu/cc/docs/ccrl20.html).

Further, we will assume that you are familiar with the basics of HTML page authoring.

That is, if you look at the following sample HTML source, it makes sense to you:

Passing Drop Down Box Data to CGI Program

A drop down box is used when we have many options available but only one or two will be selected. Here is example HTML code for a form with one drop down box

<form action = "/cgi-bin/dropdown.cgi" method = "POST" target = "_blank">
<select name = "dropdown">
<option value = "Maths" selected>Maths</option>
<option value = "Physics">Physics</option>
</select>
<input type = "submit" value = "Submit">
</form>

The result of this code is the following form −

Below is the dropdown.cgi script to handle input given by web browser.

#!/usr/bin/perl

local ($buffer, @pairs, $pair, $name, $value, %FORM);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "POST") {
   read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
} else {
   $buffer = $ENV{'QUERY_STRING'};
}
# Split information into name/value pairs
@pairs = split(/&/, $buffer);
foreach $pair (@pairs) {
   ($name, $value) = split(/=/, $pair);
   $value =~ tr/+/ /;
   $value =~ s/%(..)/pack("C", hex($1))/eg;
   $FORM{$name} = $value;
}
$subject = $FORM{dropdown};

print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Dropdown Box - Sixth CGI Program</title>";
print "</head>";
print "<body>";
print "<h2> Selected Subject is $subject</h2>";
print "</body>";
print "</html>";

1;

What Is Web Scraping?

Web scraping is the process of automatically extracting data, such as names, addresses, and phone numbers, from websites to save time while researching or collating information.

You can use scraping to retrieve:

  • Pricing information
  • Product descriptions
  • Contact details
  • Any other data on the web

While it can provide great insight into the online world, there are also legal considerations when using such techniques. Web scraping projects should be undertaken with care. While web scraping can offer a wide range of benefits, such as streamlining the research process or uncovering hidden insights about competitors, unauthorized scraping of content or personal data from websites can lead to legal action for copyright infringement and other unlawful activities. Always make sure the website allows scraping before you begin a project.

What is an API?

An application programming interface (API) enables two applications to communicate. It is a set of protocols, routines, and tools that allow developers to create software applications. An API also helps developers access existing services without understanding their underlying architectures.

APIs are used extensively in web development, providing access to application resources such as databases or hardware devices without requiring direct user interaction. They allow apps to be customized quickly and easily, making them a key component of modern technology ecosystems.

What is a scraping API?

A scraping API is a tool that enables developers to programmatically access and extract data from websites. This data can be used for various purposes, such as creating custom dashboards or populating a database with valuable information. A scraping API takes the raw HTML of a webpage and parses it into structured formats such as JSON or XML.

The significant advantage of using an API is that it allows the user to quickly download large amounts of data with minimal effort. Additionally, APIs provide access to more complex data sets than manual scraping and often integrate well with third-party services and applications.

First CGI Program

Here is a simple link which is linked to a CGI script called hello.cgi. This file has been kept in /cgi-bin/ directory and it has the following content. Before running your CGI program, make sure you have change mode of file using chmod 755 hello.cgi UNIX command.

#!/usr/bin/perl

print "Content-type:text/html\r\n\r\n";
print '<html>';
print '<head>';
print '<title>Hello Word - First CGI Program</title>';
print '</head>';
print '<body>';
print '<h2>Hello Word! This is my first CGI program</h2>';
print '</body>';
print '</html>';

1;

Now if you click hello.cgi link then request goes to web server who search for hello.cgi in /cgi-bin directory, execute it and whatever result got generated, web server sends that result back to the web browser, which is as follows −

Hello Word! This is my first CGI program

This hello.cgi script is a simple Perl script which is writing its output on STDOUT file, i.e., screen. There is one important and extra feature available which is first line to be printed Content-type:text/html\r\n\r\n. This line is sent back to the browser and specifies the content type to be displayed on the browser screen. Now you must have undertood basic concept of CGI and you can write many complicated CGI programs using Perl. This script can interact with any other exertnal system also to exchange information such as a database, web services, or any other complex interfaces.

Frequently Asked Questions about Embedding Perl in Web Pages

What is Perl and why is it used in web development?

Perl is a high-level, general-purpose, interpreted, dynamic programming language. It was originally developed for text manipulation but now it is used for a wide range of tasks including system administration, web development, network programming, GUI development, and more. Perl is known for its flexibility and power, and it’s particularly good at processing text. It’s often used for CGI scripts, which are a common way to add interactivity to web pages.

How can I embed Perl in HTML?

Embedding Perl in HTML can be done using a variety of methods. One common method is to use the CGI.pm module, which allows you to create HTML forms and parse their contents. Another method is to use embedded Perl (ePerl), which allows you to embed Perl code directly into your HTML files. This can be useful for creating dynamic web pages.

What is the difference between Perl and other scripting languages like Python or Ruby?

Perl, Python, and Ruby are all powerful scripting languages, but they each have their own strengths and weaknesses. Perl is known for its flexibility and power, especially when it comes to text processing. Python is known for its simplicity and readability, making it a great choice for beginners. Ruby is known for its elegance and its strong support for object-oriented programming.

How can I execute Perl scripts on a web server?

To execute Perl scripts on a web server, you first need to ensure that the server is configured to handle Perl scripts. This usually involves setting the correct permissions on the script and placing it in the correct directory. Once this is done, you can execute the script by accessing its URL in a web browser.

What are the benefits of using Perl for web development?

Perl offers several benefits for web development. Its powerful text processing capabilities make it ideal for handling HTML, XML, and other markup languages. It also has strong support for regular expressions, which are often used in web development for pattern matching. Additionally, Perl’s flexibility allows it to easily integrate with other technologies used in web development, such as databases and web servers.

How can I learn Perl?

There are many resources available for learning Perl. The official Perl website offers a wealth of information, including tutorials and documentation. There are also many books and online courses available. One of the best ways to learn Perl is to start writing scripts and experimenting with the language.

What is CGI in Perl?

CGI stands for Common Gateway Interface. It’s a standard that allows web servers to interact with scripts or programs to create dynamic web pages. In Perl, the CGI.pm module provides a simple and convenient way to create CGI scripts.

How can I debug Perl scripts?

Perl provides several tools for debugging scripts. The Perl debugger is a powerful tool that allows you to step through your code, set breakpoints, and inspect variables. There are also several modules available on CPAN that can help with debugging, such as Devel::Trace and Devel::DProf.

What is CPAN in Perl?

CPAN stands for Comprehensive Perl Archive Network. It’s a repository of over 150,000 Perl modules written by programmers around the world. These modules provide solutions for a wide range of tasks, from web development to system administration.

How can I install Perl modules?

Perl modules can be installed using the CPAN module, which is included with Perl. To install a module, you simply need to run the command ‘install Module::Name’ from the CPAN shell. You can also use the cpanm tool, which provides a more streamlined interface for installing modules.

Basic concepts of Perl programming in the Unix environment

In the rest of this tutorial, we will use the percentage sign ‘%’ to indicate the Shell

prompt.

2.1 Our first example

Now Let us get our hands wet. Login to your Unix account and change to the public_html

directory. If a public_html directory is not created, use the following command to create

it:

To change into the public_html directory from your home directory, use the following

command:

We now create an text file and save it as myfirst.cgi under the (public_html

directory) with the following command:

% pico myfirst.cgi

Enter the following lines into the editing window:

Save the file and quit pico. At the Shell prompt, make the program executable:

At the prompt, issue the following command:

What you will see on the screen is the following line:

2.2 A few concepts

What we have been doing so far is the following:

  • Frist, we create a text file and enter a few lines of text.
  • Next, we make the text file executable by using the command ‘chmod’.
  • Then, we run the script by issuing the command ‘./myfirst.cgi’.

In the first Perl program we have just created:

the first line tells the Unix system that this is a program written in Perl. Very

often, we just refer to this kind of program as Perl script.

Technically speaking, this line specifies the location of the Perl Interpreter on the

Unix system, which is in the /usr/local/bin directory. Sometimes, a system administrator

will install the Perl Interpreter under the /usr/bin directory, in which case, we will

modify the line accordingly. If you don’t know where the Perl Interpreter is, try the

command

and you should see something on the screen as follows:

We can use the second half of the above output as the path to the Perl Interpreter in

our program.

The second line assigns a value to a variable. Here

the variable is $mystring and its value is everything contained in the double quotation

(except the quotation marks themselves). In Perl, a variable can be denoted with the

dollar sign ($).

The third line prints out the value of the variable in the standard output

(e.g., the screen). Standard output is the default place where a program

dumps its output if no other locations are specified.

2.3 A more interesting example

The first script we have tried is not very interesting. It just prints out a line of

dummy text. Let us try a more interesting example that is INTERACTIVE.

We edit the myfirst.cgi file as follows:

Please pay attention to spelling and make sure that lowercase and uppercase letters are

entered exactly as what you see above.

To run the program, issue the following command:

You will be prompted to enter your name. Once you hit the Return key on your keyboard,

you will see a greeting message. E.g., if you enter ‘John Foo’, you will see

Repeat running the program several times and entering different text each time. Pay

attention to what has changed.

What you have done is that you entered a (text) string from the standard

input (e.g., the keyboard) and some output (e.g. the greeting message) was

printed on the standard output (e.g., the screen). Standard input

is where the input to a program comes from. A string is simply a sequence

of symbols such as letters, digits or even punctuation marks.

By now you should have noticed that in a Perl script,

  • Every line (except the first one) is terminated by a semi-colon (;);
  • A string is quoted with double quotation marks;
  • A variable is always denoted with a dollar sign ($).

2.2 Где взять стандартную документацию про такое-то свойство перла?

Перл приходит с полным набором документации и набором программ
для перевода в разные форматы. Обычно для подробного ознакомления с
некоторой особенностью перла пишут “perldoc perlсвойство”
или “man perlсвойство”. Базовый набор “свойств” таков:

  • Основы perldata, perlvar, perlsyn, perlop, perlsub
  • Запуск perlrun, perldebug
  • Функции perlfunc
  • Objects perlref, perlmod, perlobj, perltie
  • Data Structures perlref, perllol, perldsc
  • Modules perlmod, perlmodlib, perlsub
  • Regexps perlre, perlfunc, perlop, perllocale
  • Moving to perl5 perltrap, perl
  • Linking w/C perlxstut, perlxs, perlcall, perlguts, perlembed
  • Various
    http://www.perl.com/CPAN/doc/FMTEYEWTK/index.html
    (not a man-page but still useful)
  • perl О перле вообще
  • perldelta Что нового в последней версии перла
  • perlfaq FAQ
  • perltoc Подробное оглавление ко всей документации
  • perldata Типы данных
  • perlsyn Синтаксис языка
  • perlop Арифметические, логические, строковые
    операции и их приоритет
  • perlre Регулярные выражения (обработка текста и поиск)
  • perlrun Опции командной строки
  • perlfunc Встроенные функции
  • perlvar Специальные переменные
  • perlsub Как писать свои функции (процедуры)
  • perlmod Устройство и принцип работы модулей
  • perlmodlib Модули: создание собственных библиотек
  • perlmodinstall Поиск и установка модулей и библиотек на CPAN
  • perlform “Форматы”, или шаблоны для выводимых данных
  • perllocale Поддержка интернационализации
  • perlref Ссылки и указатели на данные
  • perldsc Введение в структурные типы данных
  • perllol Структуры данных: массивы и списки
  • perltoot Введение в объектно-ориентированное
    программирование
  • perlobj Объекты в перле
  • perltie Связь объектов с обыкновенными переменными
  • perlbot Perl OO tricks and examples
  • perlipc Связь между процессами: pipes, sockets,
    сигналы и др.
  • perldebug Отладка программ
  • perldiag Сообщения об ошибках
  • perlsec Вопросы безопасности
  • perltrap Возможные грабли и ловушки
  • perlport Как писать портабельные программы
  • perlstyle Стиль программирования на перле
  • perlpod Формат стандартной документации и документация,
    встраиваемая в исходные тексты программ
  • perlbook О книгах про перл

    — (для совсем крутых) —

  • perlembed Способы внедрения перл-программ в программы на C/C++
  • perlapio Собственный API, используемый в исходниках перла
  • perlxs XS — программирование перловских библиотек,
    используемых вместе с библиотеками на C
  • perlxstut Учебник по XS
  • perlguts Внутренние функции перла для разработчиков
  • perlcall Соглашения о вызове перловских функций из C
  • perlhist История и полный список всех версий перла

Использование готовых скриптов

Если CGI-скрипт уже готов (используется стандартный модуль), перед использованием его нужно скопировать на сервер хостинга. Выполняется процедура при помощи файлового менеджера в панели управления или через FTP-доступ, например программой FileZilla. Выбор расположения остается на усмотрение пользователя; к программе можно обращаться независимо от имени каталога или подкаталога, но рекомендуется придерживаться определенных стандартов.

Так, внутри скриптов обязательно указывается путь к интерпретатору используемой платформы для программирования:

  1. Perl – /usr/bin/perl.
  2. Python – /usr/local/bin/python.

Если здесь допустить ошибку, программный код исполнятся не будет. То же происходит, если в ПО предусмотрено обращение к базе данных MySQL – пользователю понадобится внести в программу путь к ней, логин и пароль доступа. При изменении последнего корректировка CGI-скрипта обязательна, иначе он перестанет работать. Местоположение файлов в популярных CMS выясняется в службе технической поддержки или в служебной документации.

Написание простейшего CGI-скрипта

При работе в среде операционной системы Windows для написания кода понадобится специальная программа. Например, подойдет специализированный текстовый редактор Notepad++ (стандартный Блокнот для таких целей не подходит). Сам код строится вокруг переменных окружения и потоков ввода-вывода информации. По сути, CGI-скрипты являются обработчиками отдельных команд и не являются «полноценной» программой.

Более детально:

  1. Стандартный поток ввода (stdin) – скрипт получает информацию с клавиатуры, сокета, из локального (удаленного) файла или из результатов работы основной программы.
  2. Переменные окружения (Environment Variables) – переменные, требуемые при выполнении кода скрипта. Определяются пользователем или сервером.
  3. Стандартный поток вывода (stdout) – означает вывод на экран результатов вычислений или их сохранение в файл, передача в сокет, входной поток другой программы или на принтер.

Главное, избегать использования SHELL, который снижает безопасность сайта. В качестве примера простейшего CGI-скрипта приведем код для вывода текущей даты и команду HTML, при помощи которой будет происходить обращение к программе (из любой части страницы, хоть в нескольких местах одновременно).

#!/bin/sh

 echo Content-type: text/html

 echo

 echo "<h2>Today is "

 date

 echo "</h2>"

При помощи специального виджета, установленного в CMS, или путем ручного редактирования шаблона в нужный участок HTML вставляется код:

<a href="/cgi-bin/examples/today.cgi">

В приведенном примере решена типовая ошибка новичков. Она заключается в отсутствии метки о типе выводимого результата (строка Content-type: text/html). После нее располагается пустая строка для указания, что следом идет непосредственно программный код.

Applications of Web Scraping

We’ve already discussed some benefits of scraping earlier, but the list wasn’t exhaustive.

Web scrapers can make it possible to plug the gaps in API response data and retrieve data otherwise not included by the API maintainer. For example, the doesn’t send lyrics as part of its API response – that this article will show you how to fix with scraping.

Information collection is a huge part of how scraping can fill these gaps. Some companies scrape their competitors’ websites to make sure they are offering good prices for their products and not being undercut. Or they may scrape multiple review pages that review their products and aggregate them into one shared document for easier viewing.

The applications of scraping are almost endless, as the insights and value a company can derive from retrieving data from (almost) anywhere on the internet is huge.

cURL

The most simple demonstration of scraping is a cURL request to a website:

The server response, essentially composed of the markup on the Genius landing page and defining response headers, is as follows:

Alongside the HTML, we also have the HTTP headers that genius.com responded with, things like content-type, cookies, and caching.

The content type, alternatively referred to as MIME type, is the response data format—which is HTML in this example.

In this case, the HTTP cache-control header is a set of directives for how things need to be cached. In our response above, the directives indicate the HTML can be cached for 180 seconds.

Cookies are short strings that contain data sent from server to client in the Set-Cookie header. The server header, like the user-agent, identifies the server. If you want to know more about the portions of metadata in this response that are not discussed in detail here, you can visit MDN.

Python vs. Perl for Web Scraping

Python offers an intuitive and concise syntax, making it easier to learn. But Perl has efficient code, which makes it faster in terms of runtime performance.

Python is often used for small-scale projects due to its user-friendly syntax, while Perl is better suited for large-scale tasks as it can scale quickly and efficiently. Both languages offer modules for interfacing with HTML and XML structures, allowing developers to easily manipulate web page elements.

However, Perl’s CPAN library may provide more flexibility when dealing with complex scraping tasks than Python’s standard libraries. Ultimately, both languages have their advantages when used for web scraping, depending on the complexity of the task.

Like this post? Please share to your friends: