Documentation


Introduction

pymeta.org/memoria is closely tied to the site Fundação Biblioteca Nacional, the Brazilian Digital Library (BDL). The BDL enables searches for words across a wide array of periodicals and time periods. Search results are displayed as page images with the search term highlighted in green.

pymeta.org/memoria automates a BDL search (via: Capture Search) as if an invisible web user:

  1. went to the BDL site
  2. clicked on the third Locality tab
  3. entered an area
  4. defaulted to include all periodicals and time periods for that area
  5. pressed Search
  6. waited for a display of area documents
  7. entered the search term and pressed Search at the top of the documents page

When BDL has finished the search, further scraping activities are performed to capture doc and edition information. Database records are created to persist the results. The entire process: search entry, information-scraping, and record creation takes place in a background task. When the process finishes, a user may use the captured results that include URLs to the page matches.

The records (via: Make Selections) permit filtering by year, by document, and by other more detailed criteria. It is possible via Set Display to customize the display of the tabular information shown.

Four types of records are created in a search capture:

  • search: search criteria
  • doc: information about the periodicals containing matches
  • edn:information about the periodicals' editions containing matches
  • page:information about the match pages themselves

It is the page records that contain URLs to the match pages.

The records created in a "search capture" form a hierarchy. A single search record is at the top, containing the user's criteria for a search: area, search term, and optionally a year-range. It also contains a count of the total number of page images found that contain at least one match for the search term. The next level in the hierarchy are doc records, each associated with the "parent" search record, and containing information about the periodicals that have matched page images. Below docs are year-edition records associated with a parent doc record. At the lowest level are the page records, each associated with a parent edition record.

Although it is the list of page records with their URLs to the match page images that are likely to be of most interest, the doc and edition records can also be useful, either for display or filtering purposes. For example, the user may restrict the list of doc records to one or several of interest, and this has a cascading effect, forming a new hierarchy with sublists of edition and page records determined by the shortened list of doc records.

Table of Contents

Signin/Login

In order to use the pymeta.org functionality, a user is required to sign in and enter an email and password. This allows database records created for a BDL search to be associated with the user who created them, and (s)he can then create or delete them without affecting other users.

In the event that the pymeta.org password is forgotten, there is a facility for resetting the password upon submission of the email address supplied during signin.

Table of Contents

Capture Search

This presents a simple form. Some entries are self-explanatory:

  • a dropdown element to choose the area
  • an input box for the search term or short double-quoted search phrase
  • an input box for the optional entry of a year-range
  • A button to start the background search

The first dropbox allows the choice of "chrome" or "firefox." To understand it, a brief explanation of the mechanics of the background search is required.

pymeta.org uses a tool called Selenium that can simulate a user interacting with a web site. Selenium in turn uses a web driver which for our purposes can be either of type "chrome" or "firefox." These are specialized browsers that run "headlessly" in our application; that is, they do not display the Biblioteca web page involved in the interactions. Usually there is little difference between the two, but "chrome" is somewhat faster and "firefox" somewhat more reliable. The web drivers are programs on the server and independent from the type of "real" browser being used to access pymeta.org. Selenium together with the web driver are said to be "scraping" information from the BDL site.

After the Search button is pressed a task with the search particulars is queued for background execution and the pymeta.org page changes to one labelled "Search Capture Process." There are 2 queues accepting tasks. If other users are submitting tasks your task may wait in a queue until preceding tasks finish. This may entail a noticeable wait; search captures in the area "RJ" for example can require close to 15 minutes (!) to complete once they begin executing. The purpose of the queues is to offload time-consuming work from the direct purview of the pymeta.org web server, which otherwise might time-out before a search capture finishes.

A "feedback" page appears immediately after the Start Search Capture Process button is pressed. A message at the top of the page indicates that the task has been queued. This message is static, but when the process gets its turn to execute, process activity is reported as it occurs in the box beneath Feedback (from running process). Here is sample output produced for a successful search in area "AL" for term "mameluco":

    2020-06-11 20:03:16 - capture_search: START
    2020-06-11 20:03:16 - (Using webdriver: chrome)
    2020-06-11 20:03:16 - Enter search criteria: AL, mameluco...
    2020-06-11 20:03:32 - OK: search criteria entered
    2020-06-11 20:03:32 - Start Biblioteca search...
    2020-06-11 20:03:36 - OK: Biblioteca search completed
    2020-06-11 20:03:36 - Collect doc info...
    2020-06-11 20:03:36 - OK
    2020-06-11 20:03:36 - Collect edition info...
    2020-06-11 20:03:39 - OK
    2020-06-11 20:03:39 - Create database search, doc, edition and page records...
    2020-06-11 20:03:40 - OK
    2020-06-11 20:03:40 - capture_search SUCCESS: with AL-mameluco

TIP: AL-mameluco is a useful trial search because it only results in a few records and finishes quickly. (The search can be deleted easily from the Selections page.)

Unfortunately, successful completion of a search capture task is not guaranteed. In case of a failure, the feedback messages will end before the line indicating "success." If there is a failure, use the back button to return to the entry form and re-try the capture, perhaps with the other type of webdriver. The most likely failure point is during search entry. The most time-consuming part of the capture is when edition information is being collected, although if it reaches that point it is likely to finish successfully.

It is an aggravating reality of these scraping tasks that failures will occur and could even convince a first-time user of the site that fundamental capabilities are broken. The reasons for the failures are hard to pin down, and seem to be caused by activity on the BDL site that interferes with the operation of the web driver. Every effort has been made to make the tasks as reliable as possible, but if you retry tasks and still get failures, the best bet is to resume some time later. You will have the best outcomes if you run the capture search at a time when the BDL site is less busy.

For slow jobs when the feedback hasn't started or seems to have stopped, there is a Check State... button on the page that indicates whether the capture process is still queued or still active in the background. Note however that the button no longer produces useful output for a task that terminated approximately 10 minutes earlier.

Table of Contents

Make Selections

page organization

NOTE: If this page sits idle in a browser tab for some time, it can go into an unresponsive state. Also, you may get a surprising message saying that the CSRF token has expired. In these cases, either refresh the page or click on the Memoria navigation tab and return with a click on the Make Selections tab. (The CSRF token is an internal mechanism to guard against "cross site request forgery" and won't be discussed here; a transient message about expiration is not a cause for concern.)

The Make Selections page has an organization that reflects the 4 types of records: search, doc, edition, page that are created by a search capture. Actions that may be performed on the record types are contained within four grey bands. The actions are filtering and viewing. At the top of the page are buttons that toggle the visiblity of the middle doc and edition bands. (Hiding the doc and edition bands might be a useful display simplification in cases where only the search and doc records are of interest.)

The upper greyed-band with 2 columns involves search selections. You may:

  • (column 1) Create a New search to add to the dropdown box or select a search you've previously captured, either to work with it (press Refresh) or to Delete it.
  • (column 2) View the search record

Having chosen a search capture, you can move on to the next row with doc selections. Similar capabilities are available in the two columns.

  • (column 1) Accept all in the Doc Selections dropdown, or choose a more restrictive filter that was previously created with the Add button. (An explanation of the Add button functionality for Doc, Edition and Page selections follows later.) Refresh or delete the selection (all can't be deleted).
  • (column 2) Get a count of the doc records associated with the search, as filtered by the column 1 filter or view them. Alternatively, request a download of the records in CSV (comma-separated-values) format, suitable for display in a spreadsheet. An input field is presented in the doc, edition and count bands that optionally permits entering a range of records to limit record output in the view or download.

And likewise going down, there are corresponding capabilities for edition and page records.

Remember that the 4 types of records form a strict hierarchy. If a filter is defined at the doc level, that directly impacts the child edition and page records. The choice all is supplied as the description for "no filter" at a level so when a search capture is first made, all will be the only dropdown choice at the doc, edition and page levels. As noted, the all "filter" can't be removed. Also, realize that "all" can refer to different lists depending on whether a filter is in effect at a level above. For example, if a filter has been applied to limit doc records to those with a publication range overlapping 1900-1920 (say), the edition selection "all" now means all edition records that are found in docs so filtered. Adding a new filter at a level automatically creates all filters in the levels below.

Record views are restricted to at most 2000 at a time, although csv-downloads are not limited by count. In order to view a large number of records at some level, the record range can be used to specify counts of 1-2000, 2001-4000, etc. as necessary. Note that record range does not act as a filter; it simply results in a table display with the requested range of records for the type of records in its band.

Table of Contents

Adding filters with the Add button

For docs, editions, and pages a filter may be created that restricts the selection of doc, edition and page records. When a new search capture is created, default filters of all are established at the three levels below search. As the name suggests, all means "no restriction." "All" doc records means: a record for every periodical containing match pages for the entered area, search term, and year range. "All" edition records means: a record for every year-edition containing match pages that is associated with a selected doc record. And likewise "all" page records should be interpreted to mean every page record identifying a match page, with a parent in a selected edition record. The format of possible filters that may be added at each level is described briefly on the form produced when the Add button is pressed. Behind the scenes, the short filter representations are translated into query language understood by the postgreSQL database in use.

When a filter expresses a range it is inclusive. Single records can be expressed without a hyphenated range, e.g., Y1925. For Docs and Pages, compose a filter in one of the formats. For Editions, compose a filter using a format in one or both categories.

  • Doc Selection Filters:
    • Laaaaaa: only records for the single doc with dlbl aaaaaa
    • Yyyyy-yyyy: only doc records for docs whose years of publication overlap the year range yyyy-yyyy

  • Edition Selection Filters:
  • These filters can be composed from either or both categories:
    • category 1
      • Yyyyy-yyyy: only edition records for editions with issue year in the year range yyyy-yyyy
      • . Note that this selector is much more precise than the year range selector for docs.
    • category 2
      • F0 or F1: only edition records that are "unfilled" or "filled", respectively. Unfilled means that the URL to one or more match pages in the edition is presently unknown (empcnt > urlcnt: the edition's match page count is greater than its match URL count. The items empcnt and urlcnt are display column names; these will discussed later). Similarly, F1 means only "filled" records (empcnt = urlcnt for an edition record). 0 represents false and 1 represents true.
      • M0 or M1: only edition records that have one match page or more than one match page, respectively. "M" is for "multiple" and 0 stands for false, 1 for true.

        If a "double" filter is built with a selector from each category, the category 1 part comes first and the two parts are concatenated with no spaces.

  • Page Selection Filter:
  • Only one filter is available for page records.
    • U: only page records for which the the url to the match page is known (i.e., purl/plink have values

Table of Contents

Explanation: the "M" and "F" edition filters

When a search capture completes it has an accurate count of the number of match pages that exist. However, not all urls to those match pages may be defined. In particular, whenever an edition-year (represented by an edition record) contains more than one match page, only the url to the first match page is captured initially. A page record is created for every match page for a year-edition, so page records representing second and subsequent match pages in the year-edition contain fields purl and plink that are blank.

The capture search process gathers as much information, including URLs to the match pages, as quickly as it can. However the BDL site is "stingy" with respect to sharing URLs after the first in a year-edition, so the initial capture process only gathers what can be gathered quickly. Capturing missing URLs involves waiting for the "spinner" in each case and is time-consuming. See Capture Missing Page URLs for a way to capture the missing URLs and persist them in the page records.

The search capture "AL-mameluco" can serve as an example. When the search first completes, the selections page shows counts of: 4 doc records, 8 edition records and 11 page records. When the search record is viewed, the field smpcnt ("search-match-page-count") shows 11, the total number of match pages found. When the 11 page records are viewed the plink and purl fields are empty for 3 records. Looking in the empcnt "edition-match-page-count") column, every time the value is 2 there will be 2 page records each with the same eyr, elbl values. For example, there are two page records for the year-edition 1927-00002 but only the URL in the first record is defined.

Table of Contents

Viewing Selected Records

(Note: if the display unexpectedly shows an empty table even though the record count is not 0, simply refresh the page.) There are several things to note about the display of records. To change the number of rows per page for a single viewing session, make a selection in the applicable dropbox. Click on the Persist button afterwards to maintain this choice for all viewing sessions. The Page control moves the view forward or back through the record displays. To go to a particular page of the displays, enter the page number in the input box between the "back" and "forward" controls and then press enter. In cases where there are thousands of records in the search capture, keep in mind that the maximum number of rows that can be displayed at one time is 2000. The record fields that may be displayed can be customized; see Set Display. If the mouse pointer is allowed to rest on a column head for a moment, a tool-tip is shown that translates the name into something more meaningful.

Table of Contents

Downloading Selected Records as a CSV file

As with viewing selected records the procedure for downloading selected records as a CSV (comma-separated-values) file is straightforward. Note that there is no limit as to the number of records in the file requested for download. After the Download button is pressed a page appears that permits changing the name of the file to be downloaded. The default name is composed by joining the operative selection filters with underscores: e.g. "AL-mameluco_all_all_all."

Table of Contents

Set Display

set row ordering

This section allows you to change the ordering of rows in various ways. When the Default radio button is checked, rows are returned in the way that the BDL orders them: documents with the most match pages come first, and ties are broken by sorting rows alphabetically on the doc label.

For the next paragraph, recall the column interpretations:

  • eyr: edition year
  • dlbl: doc label, short identifier for the doc
  • ddesc: doc description, long name
  • Clicking on the By year/doc radio button opens a view with two dropdown boxes labelled "Primary" and "Secondary." The primary selections are "eyr", "dlbl" and "ddesc", in pairs that are capitalized or not. Uncapitalized means "sort in ascending order" while capitalized means "sort in descending order" (mnemonic: "bigggest first"). Once the primary row ordering is set, the secondary row ordering comes into play and serves as a tie-breaker for dealing with groups of rows that all happen to end up with the same primary value.

    For example, if the sort order is: primary="eyr" and secondary="ddesc", rows are sorted so that those with the earliest edition year values come first and when two rows happen to have the same eyr value, the row with the earlier (alphabetic) ddesc comes before the other. Using the same kind of nomenclature, the Default ordering can be described as "Dmpcnt-dlbl", that is, the primary sort is by dmpcnt values (descending) and the secondary is by dlbl values (ascending).

    Table of Contents

    select display columns

    In each of the 4 parts left-to-right are multiple-choice dropboxes for selecting desired columns for the view or download of Search, Doc, Edition or Page selections, respectively. Hold down the Control key (Mac users: the Cmd key) when clicking on a dropbox option in order to toggle that single choice on or off. In this one-at-a-time manner, highlight the set of options for columns you want in the display. After your changes be sure to press the Persist button to cause them to be remembered. Clicking on the Set Defaults button make a reasonable selection for you. Default choices are marked with an asterisk in the dropboxes.

    In this section the fields are listed for each record type and brief explanations are provided. The doc, edition and page displays are not restricted to fields of their own record type.

    • search records
    • sid
      search record id
      albl
      area label
      sterm
      search term
      created
      date and time when the record was created
      smpcnt
      Search match page count
    • doc records
    • recno
      record number -  This column may be chosen for doc, edition or page results. It works the same way in each case: records are numbered from 1 to the value of total count, and if a record range is specified, the record numbers reflect it.
      did
      doc record id
      dlbl
      doc label
      ddesc
      doc description
      dyrbeg
      year when publication began
      dyrend
      year when publication ended
      dmpcnt
      doc match page count
    • edition records
    • recno
      record number
      eid
      edition record id
      elbl
      edition label
      eyr
      edition year
      empcnt
      edition match page count
      urlcnt
      edition url count - this value is always 1 when a seach capture completes. If the value of empcnt is greater than 1 then the edition record is "unfilled" and the missing edition page URLs must be captured by other means.
    • page records
    • recno
      record number
      pid
      page record id
      libpid
      BDL page id - Knowing the libpid of a match page makes it easy to create an URL to the page, but BDL does not provide the value in search results. It must be scraped out of the result page source.
      mnum
      match number - the BDL numbers the match pages within each document that has matches. If the doc with label dlbl has a doc match page count of 25 (say), then the 25 page records will have match number values running from 1 to 25.
      purl
      URL to match page showing showing the green-highlighted match(es).
      plink
      Not actually a page record field; rather it is a constructed field, an HTML link that uses the purl value and that has text consisting of the hyphenated dlbl and mnum values. Note that it is very convenient in record views but HTML links in a CSV downloads can be problematic, so purl is probably a better choice there.

    Table of Contents

    Capture Missing Page URLs

    When a Page view is requested, a link to this page appears above the tabular display. Clicking on it brings up a very simple form allowing you to choose either the chrome or firefox webdriver. Pressing a start button initiates a "scraping" process that locates missing match page URLs and updates the page records with the values. Finding the missing URLs is time-consuming (think: the "wait spinner" appears before each URL can be captured... something you can't see because the scraping is "headless"). For that reason a maximum of 100 edition records containing missing URLs is processed. It's a good idea to set a record range in the view to create a selection that limits the number of editions to be processed to 10 or 20 at a time.

    If the process finishes and there are still missing URLs, you can use the back button to return to the simple form and restart the scraping process. This time a maximum of 100 edition records will again be processed, including any records for which there was a failure on the previous round. (Repeat as necessary...)

    The URLs to match pages take two forms: long and short. When a search capture completes, a page URL (purl) is in the long form. Some characters that are problematic in URLS are "url encoded": space is given as %20 and "\" as %5C. So the long form:

        "http://memoria.bn.br/DocReader/DocReader.aspx?bib=260959&pasta=ano%201897%5Cedicao%2000022&pesq=mameluco"
    
    represents:
    
        "http://memoria.bn.br/DocReader/DocReader.aspx?bib=260959&pasta=ano 1897\edicao 00022&pesq=mameluco".
    
    In this format, "bib" names the doc label (dlbl), "pasta" gives the year and edition label (eyr, elbl), and "pesq" gives the search term (sterm).

    By contrast, the short form leading to the same match page looks like:
        http://memoria.bn.br/docreader/260959/5294?pesq=mameluco
    
    in which "2690959" is the doc label, "5294" is the library's page id (libpid) and "pesq" gives the search term.

    The short form of the URL is preferable but the BDL requires some programming effort to retrieve it. Each page image has the Biblioteca libpid value embedded in the source HTML, and that is how the "capture missing urls" process finds and extracts it. Note that for every edition that has multiple match pages (i.e., empcnt is greater that 1), when the capture process runs it converts the known first match page URL from long form into short form and returns the initially missing subsequent URLs in short form also.

    Table of Contents