Quantcast
Channel: ProgrammableWeb - Node.js
Viewing all articles
Browse latest Browse all 1601

How to Turn Existing Web Pages Into RESTful APIs With Import.io

$
0
0
Primary Target Audience: 
Primary Channel: 
Primary category: 
Secondary category: 
Includes Video Embed: 
Yes
Related Companies: 
Related APIs: 
import.io
Related Platform / Languages: 

APIs are all the rage these days and with good reason. Gone are the days when people used desktop applications, either web-based or native, to interact with data that lived on proprietary backends. These days, more often than not, the primary computing device will be a cell phone or tablet. The apps on the mobile devices might be browser based or native, with many consuming data from no single source. Data is coming from everywhere, literally. How are applications accessing this landscape of distributed data sources? Typically via an API.

However, there is still a lot of data out there on web pages that's not accessible via API (i.e., the website operator doesn't offer one), including data that's of interest to developers, business analysts, report writers, and other parties. And even in cases where the operator of the site offers an API, the skillset required to access it requires a good degree of programming know-how that's not typical for most business folks. Yet, they still need that data. What's to be done?

This is where using Web Page technology comes into play. The Case for Using Web Page Adapters to Create Structured Data Web pages have been publishing structured data since the time when JavaServer Pages (JSP), Active Server Pages (ASP), and Adobe's ColdFusion showed up on the scene. Using web pages to format and display data retrieved from a database was an inevitability, particularly when it came to publishing merchandise catalogs online.

However, as the web has matured, data-driven pages have become meaningful well beyond their consumption by human eyes. There is analytic inspection going on behind the scenes — think web crawlers and Search Engine Optimization (SEO) engines. People want that data for more than viewing purposes. Web page data is useful in terms of determining other things (e.g., context) provided you can get at it.

For example, imagine that you want to know the dentists in the Los Angeles area within a certain zip code, and you want to know the number of reviews for each of the dentists in the list. Typically you would go to a rating web page such as Yelp, do a search on dentists according to city, and view the results. (Please see Figure 1.)

Figure 1: Many web pages have data that can be useful when applied to an abstract structure.

Figure 1: Many web pages have data that can be useful when applied to an abstract structure.

Using a Yelp page visually is fine if all you want to do is view the results. But what if you want to load that data into a spreadsheet, such as the one shown next in Figure 2?

Figure 2: Once data is extracted into an abstract structure, it can be applied to a spreadsheet.

Figure 2: Once data is extracted into an abstract structure, it can be applied to a spreadsheet.

Or what if you want to create a comparative graph of that data (see Figure 3)? In order to meet your graphing need without the help of tooling, you're in for a lot cutting and pasting at best, or more likely, a lot of manual keyboard entry.

Figure 3: Spreadsheet data extracted from a web page can be easily converted to a chart.

Figure 3: Spreadsheet data extracted from a web page can be easily converted to a chart.

And, finally, what if you want to make that data available as JavaScript Object Notation (JSON) via API as shown in Figure 4? Well, as we say in the trade, fuggetaboutit.

Figure 4: A good extraction technology can convert data in a web page to JSON.

Figure 4: A good extraction technology can convert data in a web page to JSON.

Getting structured data out of web pages — often referred to as "web scraping"— is a real need, particularly for people whose job it is to prepare and analyze the information that's presently available in web pages. Meeting this need is right up the alley of a data extraction tool, such as Import.io.

(This article is paginated. Use the pagination control below to retrieve the next  page)

Continued from page 1. 

Import.io allows you to ingest data in a web page and convert it into structured data that you can use in a spreadsheet or express in JSON. In fact, the scenario described above in which data was extracted from a Yelp page of dentists and converted to a Google spreadsheet and an array of JSON objects was done with Import.io. (The graph was made using Google Sheet's Insert Graph Feature.)

Import.io has a lot to offer when it comes to data extraction. In this tutorial, you'll learn:

  • How Import.io can be "trained" to determine various fields of data on a web page.
  • How to organize fields you extract into a consistent data structure.
  • How to make these data structures available for display in Google Sheets and within a structured JSON object, which can in turn be made available as an API.
  • How, in an advanced section at the end of this piece, to aggregate data from a variety of like-minded websites into a single, normalized data structure using the Import.io API and then consume that API with a custom application coded especially for this article.

To get the full benefit from reading the piece, I'm assuming that you know how to work with structured data and tables. I'm also assuming you have a working knowledge of Google Sheets and that you understand JSON.

With regard to the last section that discusses using the Import.io API and the article's custom application, you should also have an understanding of how to read server-side JavaScript (aka Node.js), understand how APIs work in general, and have some understanding of how an API is implemented in Node.js.

Let's start by taking a look at Import.io

Understanding Import.io

As mentioned previously, Import.io is a service that allows you convert the information found on web pages into structured data. The service works on a try-before-you-buy basis. To get started, you'll create an account and then access your account by way of the Import.io Dashboard. (Figure 5 next shows the Import.io home page. Notice the Dashboard button in the upper right of the web page.)

Figure 5: Import.io is a platform that allows data to be extracted from a web page in a structured manner.

Figure 5: Import.io is a platform that allows data to be extracted from a web page in a structured manner.

The Dashboard is the place where you'll do the work of creating, configuring, and editing extractors. An extractor is the technology that allows you scrape the information off a web page and into a data structure that meets your need.

Understanding Extractors

As just described, an exactor is an Import.io technology that has the ability to inspect a web page and determine the underlying data structures within the page. In most cases, an Import.io extractor can infer a web page's data structures automatically. (See Figure 6.)

Figure 6: An Import.io extractor can determine the data structures within a web page.

Figure 6: An Import.io extractor can determine the data structures within a web page.

However, there are times you'll want to train and configure an extractor to structure data in a way that's special to a given need. Customizing an Import.io extractor is a pretty straightforward process that we'll cover soon. But first, let's start using Import.io to create a simple extractor. Then, once the extractor is created, you'll customize the columns to better describe the data at hand. Then, you'll use the Export to Google Sheets feature to display the extracted data.

Creating an Extractor

You're going to create an extractor that describes a list of jobs from a jobs site. You're going to create the extractor against the following URL (which lists a bunch of jobs): http://www.jobserve.com/us/en/JobListing.aspx?shid=414ABF05F8664A7B5A

Figure 7 shows the web page associated with the URL. Just by looking at the user interface and scanning from one job listing to the next, you can assume that a fairly standard data structure lives beneath the surface:

Figure 7: A web page well suited for Import.io extraction.

Figure 7: A web page well suited for Import.io extraction.

To create an extractor, go to the Import.io dashboard and click the New Extractor button, located on the upper-left side as shown next in Figure 8:

Figure 8: You create a new extractor by clicking the New Extractor button in the Import.io Dashboard.

Figure 8: You create a new extractor by clicking the New Extractor button in the Import.io Dashboard.

Clicking the button presents a dialog into which you enter the URL of the web page from which you want to extract data. Enter the URL and then click the Go button on the lower right of the dialog. (Please see Figure 9, next.)

Figure 9: Enter the URL and click the button labeled Go.

Figure 9: Enter the URL and click the button labeled Go.

After you click Go, Import.io goes out to the URL and analyses the web page to determine the data that's on the web page. Also, Import.io will infer the structure of the data on that page. While all this is going on, you're presented with the message shown next in Figure 10.

Figure 10: Import.io technology does a lot of work behind the scenes figuring out the data in a web page and how to structure that information.

Figure 10: Import.io technology does a lot of work behind the scenes figuring out the data in a web page and how to structure that information.

Upon finishing the web page analysis, Import.io presents the data extracted from the web page in tabular format, as shown next in Figure 11.

Figure 11: Import.io has automation that inspects a web page to determine the data structures within the page.

Figure 11: Import.io has automation that inspects a web page to determine the data structures within the page.

The extraction intelligence in Import.io will discover the data fields that exist on the web page being inspected, as you've seen in Figure 11. However, just because Import.io will retrieve all the data you want, it does not always follow that all the data extracted is necessary. In addition, you might find that the names that Import.io associates with each field in the extraction are not to your liking. Don't fret. Import.io allows you to customize field names according to your need. And you can format the data displayed in each field to meet your needs as well.

Customizing an Extractor

As you've seen in Figure 11 previously, Import.io structured all the data on the Job Listings web page and displayed its findings accordingly. In fact, Import.io is displaying too much data for what you need. Let's say that the only information of interest on the web page is Job Title, Location, and Description. Can you adjust the extractor be that specific? Absolutely.

(This article is paginated. Use the pagination control below to retrieve the next or previous page)

Continued from page 2. 

The first thing you need to do is put the extractor into edit mode. You'll do this by selecting the extractor you want to edit in the menu bar on the left side of the Import.io Dashboard as shown next in Figure 12.

Figure 12: Import.io has a feature that allow you to edit an extractor's columns.

Figure 12: Import.io has a feature that allow you to edit an extractor's columns.

Then, once selected, click the Edit button on the upper right of the Dashboard. (See Figure 12.)

Adding a Custom Column

Once the extractor is in edit mode, you'll proceed with the customization. The Edit page displays all of the columns that the extractor discovered automatically. You're going to delete all the columns and replace them with the columns for job Title, Description, City, and State. In order to delete all the columns click the Delete all columns button as shown in Figure 13, below.

Figure 13: Deleting all the columns from an extractor allows you to create a fresh customization.

Figure 13: Deleting all the columns from an extractor allows you to create a fresh customization.

Clicking the Delete all columns button removes all columns from the UI, but leaves behind a default column labeled, New column, as shown next in Figure 14. This column is not bound to any data, nor is the label, New column, currently of any particular use, so you need to do two things. First, you'll need to give the column a useful name. Then, you'll need to bind the column to data from the target web page.

To rename the column, click the down arrow icon on the on the right side of the column. Clicking the down arrow displays a context menu that contains the Rename column item. (Please see Figure 14, next.)

Figure 14: Click the down arrow icon to display the edit choices for a given column.

Figure 14: Click the down arrow icon to display the edit choices for a given column.

Clicking the Rename column context menu item puts the column header into edit mode. Once in edit mode, enter a new value for the column header. In this case you'll enter the column header value, Title. (Please see Figure 15, callout (1) shown next.)

Now you need to bind some data to the column. You'll do this by clicking on the field on the web page that contains the data that represents the field of interest, starting at the first occurrence of that field on the web page. After you click the first occurrence, click the next occurrence as shown in Figure 15, callout (2) below. Clicking the same field in sequence trains the extractor to identify all field occurrences in the page. You can tell that the extractor has been properly trained to identify a field's pattern by looking at the the column dialog on the right side of the web page, as shown in Figure 15, callout (3) next. Import.io automatically fills the column dialog with data the extractor has been trained to identify.

Figure 15: Import.io allows you to train an extractor to identify information on a web page as fields to be bound to a column of structured data.

Figure 15: Import.io allows you to train an extractor to identify information on a web page as fields to be bound to a column of structured data.

Having trained the extractor to identify data for the Title column, you'll use the same training technique to create columns for Description, City, and State.

Take a look at the video for creating a custom column
We've made a video that you can view on YouTube that walks you through the details of creating a custom column in Import.io. You can View the video here.
 

Formatting Data in a Custom Column with Regular Expressions

You'll have occasions when a column displays information that's really two pieces of data. The most common example is the city, state string value, in which the single string implicitly contains two data fields, City and State. Figure 16 (next) shows you that an extractor has been trained to identify a field of data that contains both City and State information that's separated by a comma. As you can see in Figure 16, when it comes to having valid data in a City column, the value Boston, MA will not suffice. All you want is the value, Boston.

Fortunately Import.io allows you to display a column's data according to a Regular Expression you specify. Thus, to satisfy the previous example, you can separate City from State and show only city values. Let's take a look at how to do this.

As shown in Figure 16 callout (1), you click the down arrow icon on the right side of the City column label. Clicking the icon displays a context menu that has the Set regular expression item.

Figure 16: Select [Set regular expression] (1) to apply a regular expression to a column's data.

Figure 16: Select [Set regular expression] (1) to apply a regular expression to a column's data.

Click theSet regular expression context menu item, as shown in Figure 16 callout (1). Clicking the context menu item displays the Set regular expression dialog box shown next in Figure 17.

Enter the regular expression, ^(.*), as shown below in Figure 17, callout (1). The meaning of that particular regular expression is "capture all the characters from the beginning of a line up to the first comma encountered." In the Replace text field, enter $1. This tells Import.io to replace the values in each row of the column with the group of characters captured by the regular expression. In this case the group is identified as $1, the first group. You can see by looking in the column dialog on the right (on Figure 17 callout (2)) that the regular expression is working as desired.

Figure 17: The dialog, [Set regular expression] allows you to apply a regular expression to all data in a column.

Figure 17: The dialog, [Set regular expression] allows you to apply a regular expression to all data in a column.

You'll take the same approach to editing the State column. The extractor for the State column has been trained to identify the City, State sequence of characters, but you're only interested in the characters relevant to State. So you'll click the down arrow icon in the State column as you did with the City column to select, Set regular expression. However, this time you're going to set a regular expression that starts at the first comma encountered and captures all characters after that comma.

Figure 18 callout (1) next shows that you entered the regular expression, ,\s(.*), and replaces the captured characters accordingly. Figure 18 callout (2) show the result of applying the regular expression.

Figure 18: Applying the regular expression.

Figure 18: Applying the regular expression.

Adding Default Values

Defining a default value for a column makes it so that when no data is present, you can provide a character or number rather than have a blank present. Set a default by clicking the column's Set default value context menu item. (Remember, you access a column's context menu by clicking the down arrow icon, as shown in Figure 19 callout (1).)

Figure 19: You apply a default value for column by selecting Set default value from the Edit menu.

Figure 19: You apply a default value for column by selecting Set default value from the Edit menu.

Clicking Set default value in a column's context menu shows the Enter default value dialog. (See Figure 20, next.) You enter the value you want to use as default in the Default value text field. In this case we enter the string, Unknown. (Please see Figure 20, callout (1).) Then click the Save and Close button.

(This article is paginated. Use the pagination controls below to retrieve the next or previous page)

Continued from page 3. 

Now, every time a blank value is encountered, the extractor displays the value, Unknown,in the State column, as shown next in the Figure 20 callout (2).

Figure 20: Setting a default value for a column avoids having rows of blank data.

Figure 20: Setting a default value for a column avoids having rows of blank data.

Let's review the process for editing an extractor.

You've selected an extractor and set it into edit mode using the Import.io Dashboard. You deleted all the columns from the extractor and added new ones for Title, Description, City, and State. You trained the extractor to identify values for Title, Description, City, and State. You applied regular expressions to the values in the City and State columns to show the data relevant to the given column. Also, you provided the default value Unknown,in both the City and State columns.

Now you'll run the extractor and display the result in a Google spreadsheet.

Creating a Simple Report for Google Sheets Using Import.io

Import.io allows you to render an extractor's data in a variety of formats. One format is a directive you insert into a Google spreadsheet cell that will populate the sheet with the extractor's data in tabular format.

To get the directive go to the Dashboard for you Import.io account. Select your extractor from the menu on the left side of the Dashboard web page. On the top of the Dashboard you'll see the Integrate button , as shown next in Figure 21 callout (1). Click Integrate to display the variety of rendering formats available for the selected extractor. Copy the format into the Google Sheets text area, which is on the lower part of the Integrate page as shown in Figure 21 callout (2).

Figure 21: Import.io allows you to display data in a number of ways (JSON, CSV, and in Google Sheets).

Figure 21: Import.io allows you to display data in a number of ways (JSON, CSV, and in Google Sheets).

Open a new spreadsheet in Google Sheets. Then in the first cell, A1, paste the directive text copied from Import.io as shown next in Figure 22.

Figure 22: Import.io creates a IMPORTDATA directive that will insert an extractor's data as row and columns in a Google Sheet.

Figure 22: Import.io creates a IMPORTDATA directive that will insert an extractor's data as row and columns in a Google Sheet.

Hit the Enter key on your keyboard to get the pasted directive to invoke. The spreadsheet will populate with the data from your extractor in tabular format. (Please see Figure 23.)

Figure 23: The IMPORTDATA directive inserts data according to all the fields associated with the given extractor.

Figure 23: The IMPORTDATA directive inserts data according to all the fields associated with the given extractor.

The IMPORTDATA directive constructed by Import.io will dump all columns associated with the related extractor, even the "hidden" ones you didn't define when you created the extractor. This is OK. If you want to make the Google Sheet to conform to your custom specification, you can simply hide the column. Do not delete any columns. Just hide them. The internals of the Import.io require that all columns imported using the extractor's Google Sheet directive be available in the spreadsheet; however, not all need to be visible. (See Figure 24.)

Figure 24: You hide unnecessary columns to get the Google Sheet to conform to your display specification.

Figure 24: You hide unnecessary columns to get the Google Sheet to conform to your display specification.

Once you have your extractor's data in a Google Sheet, you can do things such as create charts (as we did earlier in the article, up at Figure 3).

Referring back to Figure 21, in addition to having extractor data available to Google Sheets, you can also take advantage of the fact that Import.io can emit your extractor's data in JSON format (the third of the four options displayed in Figure 21). Using the export as JSON feature of Import.io in conjunction with the Import.io API provides a dimension of use that makes applying Import.io to your application development needs easy. Those of us who use APIs as the mainstay of our programming activity will really appreciate this.

Let's take a look.

Creating an Aggregation Report Using the Import.io API

Referring back to Figure 21, you can see that another option is the Live Query API. In other words, Import.io provides a REST API that allows you to do everything that's possible through the the target web page's UI and more. For example, you can work with the Import.io API to aggregate information from many extractors (each connected to different websites) into a single feed, which is shown next. The way you'll do this is to create a Job Listings website named Job-Track-O-Tron. The website will allow a user to view a list of extractors from which to choose. Then, once he or she chooses an extractor, the user can click to show the job details from a list of jobs that are particular to the extractor. A user can also view an aggregated list of the jobs from all the extractors. (Please see Figure 25.)

Figure 25: The sample application, Job-Track-O-Tron, gets job lists and job details using the Import.io API.

Figure 25: The sample application, Job-Track-O-Tron, gets job lists and job details using the Import.io API.

Take a look at the video for working with the Proxy-Client application
We've created a short, 4 minute video that describes concepts behind the architecture and components for the Proxy-Client application. You can view the video here.
 

Getting an API Key

In order to work with the Import.io API, you need to have an API Key. You're assigned an API Key when you create an account with Import.io. Your API Key is displayed in your account profile, as shown next in Figure 26.

Figure 26: Your Import.io API Key is displayed in your account profile.

Figure 26: Your Import.io API Key is displayed in your account profile.

Once you have an API key in hand, you have complete access to Import.io's REST API. You'll use the API key every time you use an API call to an Import.io endpoint.

Working with the API

All data used by the Job-Track-O-Tron website resides within Import.io As mentioned previously, the purpose of the application is to list all the Job Site extractors you've created in Import.io and then to be able to drill into the details of a site's job listings. Also, the sample application, Job-Track-O-Tron, will allow you to list the data in all of the extractors at once.

To gain a full understanding of how Job-Track-O-Tron works you need to understand the parts that make up the entire application.

(This article is paginated. Use the pagination controls below to retrieve the next or previous page)

Continued from page 4. 

Job-Track-O-Tron has three components: the Import.io API, the Proxy-Adapter API, and the Proxy Clients. Table 1 shown next describes each component of the application.

Component
Description
Import.io APIEmits a list of extractors, also provides the JSON data for each extractor based on its unique identifier. It requires the Import.io API key for access.
Proxy-Adapter APIGets extractor information from the Import.io API and transforms the JSON returned from the Import.io API into a simpler JSON format that's easier to consume by the Proxy-Client.
Proxy-ClientPublishes the Single Page Application (SPA) written in Angular that displays data retrieved from Proxy-Adapter API.
Table 1: The components that make up the Job-Track-O-Tron application.


Figure 27 illustrates how the components described in Table 1 interact with each other. Notice that that your Proxy-Adapter API uses data retrieved from the Import.io API. Also, notice that the Proxy-Adapter API publishes an endpoint that's used by the Proxy-Client. The purpose of the Proxy-Client is to publish a SPA that displays information retrieved by the Proxy-Adapter API.

Figure 27: This article's sample application is made up of three components.

Figure 27: This article's sample application is made up of three components.

How Aggregation is Implemented

The purpose of the Proxy-Adapter API is to encapsulate your API Key and use it to access your extractor data that resides in Import.io. So, one key function of the Proxy-Adapter API is it retrieves extractor data via the Import.io API so that you can present the list of extractors for the end user to pick from (as a means of filtering). Thus, you can think of the Proxy-Adapter API as a "pass through" to the Import.io API. However, the Proxy-Adapter API provides another service; it can aggregate data from all your extractors into a single feed. In this case, the aggregation is a list of all jobs listed in all the extractors.

The way that the Proxy-Adapter API implements the aggregation of job listings from extractors into a master list is by applying a naming convention to the name (description) of an extractor. (When you name an extractor in the Import.io website, you're actually setting the value on the extractor's description attribute in terms of the JSON emitted by the API.) In other words, instead of making it so that the Proxy-Adapter API aggregates all the data from all the extractors under your Import.io account, you add a special suffix to an extractor's name (description) as a way to tell the Proxy-Adapter API to "aggregate only those with the suffix." The special suffix you use is x-aggregate. This suffix is meaningful only to the Proxy-Adapter API.

For example, let's say you have an extractor that is labeled dice.com as shown in Figure 28.

Figure 28: Add the x-aggregate suffix to those Job Sites that you want aggregated into the master Job List.

Figure 28: Add the x-aggregate suffix to those Job Sites that you want aggregated into the master Job List.

If you want the information in that extractor to be aggregated into the master Job List, you'll change the description of the extractor to be dice.com x-aggregate. When it comes time to create an array (list) of all jobs, intelligence in the Proxy Adapter API will get all of your extractors by making a request to the Import.io API. Then, the Proxy-Adapter API code will apply a regular expression to the description attribute of each JSON object in the array of data associated with an extractor. If the value of extractor contains the characters x-aggregate at the end of the description, the Adapter-Proxy API makes a call back to the Import.io API to get the data associated with the identified extractor. Then, that extractor's data is added to the master list of all jobs. If you do not add the x-aggregate suffix, to the extractor name, the Proxy-Adapter will ignore the extractor as having data that needs to be added to the master job list. Listing 1 (next) shows the Node.js code that implements this logic.

const _ = require('lodash');

module.exports.aggregateExtractors = function aggregateExtractors(refresh){
   return getExtractors(null, refresh)
       .then(result => {
           const validObjs = _.filter(result, r => { return r.description.match(/(x-aggregate)$/g) })
           const getters = _.map(validObjs, r=> {
              return getExtractor(r.id)
           })

           return Promise.all(getters);
       })
       .then(result => {
           return _.flatten(result);
       })
};

Listing 1: The Proxy-Adapter API, written in NodeJS, inspects an extractor's description attribute for the x-aggregate suffix, thus identifying the extractor as part of the master Job List.

Working with the Proxy Adapter API

As mentioned above, the Proxy-Adapter API works with the Import.io API to get extractor information. The Proxy-Adapter API publishes a single endpoint, /api, that allows you to get a list of all your extractors, a single extractor, or an aggregation of all information combined from data in many eligible extractors.

The /api endpoint provides a variety of parameter options that allow you to retrieve the information you need. Table 2 describes these various options.

Parameterization
Description
/apiReturns a JSON array that describes all extractors according to the given API key.
/api/:idReturns the extractor JSON which includes listing data and descriptive metadata about the extractor, according to the unique identifier of the given extractor.
/api/:id?refresh=trueMakes the Proxy-Adapter API force Import.io to refresh an extractor's data before return.
/api?aggregation=trueReturns an array of JSON objects that are an aggregation of all data in extractors named with the x-aggregate suffix.
Table 2: The various parameter options for using the Proxy-Adapter.


The Proxy-Client uses the Proxy-Adapter API exclusively to get the data it needs to display.

Listing 2 (next) shows an example of JSON from an array of All Job Listings. All Job Listings is an aggregation of all jobs from all extractors names with the suffix x-aggregate.

{"Title": [{"text": "Senior Design Engineer","href": "http://www.jobserve.com/us/en/search-jobs-in-Lake-Forest,-California,-USA/SENIOR-DESIGN-ENGINEER-9843F7E77995B5FC/?utm_source=50&utm_medium=job&utm_content=1&utm_campaign=JobSearchLanding"
    }],"City": [{"text": "Lake Forest"
    }],"State": [{"text": "California"
    }],"Description": [{"text": "A very exciting worldwide prosthetic solutions medical device company that is looking for a Sr. Design Engineer with a mechanical engineering background. This will be a 8-month contract to hire position on site in Orange County. This person must be an..."
    }]
}

Listing 2: A row of custom extractor data in JSON format.

Making the Import.io API Key Known to the Proxy-Adapter API.

As mentioned above, an area of concern for the Proxy-Adapter API is to encapsulate your API Key and restrict public knowledge of that key. For example, given how the key is treated as a secret, you want to minimize the chances of a hacker breaking into the source code for the Proxy-Adapter API and seeing the API key in clear text. Only the Proxy-Adapter API knows your API Key and only the Proxy-Adapter passes that information onto the Import.io API.

The way you make your Import.io Key known only to the Proxy-Adapter API is by setting an environment variable named IMPORT_IO_KEY in the server process that is running the Proxy-Adapter API code. Thus, the environment variable IMPORT_IO_KEY will be associated with our particular API Key. You set the environment variable according to the method prescribed by the operating system of the computer running the Proxy-Adapter API. The items in the list that follows link to instructions that tell you how to set an environment variable according to operating system:

Applying of Aggregation

Having the ability to aggregate all data from a variety of extractors into a single list is a powerful feature of the Proxy-Adapter API; however there is a limitation. Every extractor that you want to aggregate into a master list must adhere to the same data structure. In other words, if you want your aggregation to publish the attributes Title, Description, City and State of all job listings, then each extractor must be configured to have the columns Title, Description, City and State. (Please see Figure 29.)

Figure 29: You need to adhere to a standard set of columns if you want to aggregate data from a variety of extractors.

Figure 29: You need to adhere to a standard set of columns if you want to aggregate data from a variety of extractors.

Nothing bad will happen should your aggregator have additional columns; however, having a consistent set of columns will make things a lot easier when it comes time to have a client consume the aggregation of all the extractors.

Let's move on to the Proxy Client.

Working with the Proxy Client

The purpose of the Proxy Client is to publish the SPA that displays the various lists of information that you can get from the Proxy-Adapter. As mentioned above, the Proxy-Client (source code) is written in Angular.

The Proxy Client is the front end to the system, so when you add an extractor using the Dashboard in Import.io that information will appear automatically in the Proxy Client provided that both the Proxy Client website and Proxy-Adapter API are up and running. And when you apply the x-aggregate suffix to an extractor's name that extractor's data will show up when it comes time to "Show All Jobs." (Please see Figure 30.)

Figure 30: The Proxy-Client uses an SPA written in Angular to display extractor data.

Figure 30: The Proxy-Client uses an SPA written in Angular to display extractor data.

Putting It All Together

We've covered a lot in this article. I showed you how to use Import.io to extract data from a web page automatically using an extractor. I showed you how to Import.io's edit feature to customize the columns in an extractor and how to train those columns as part of the editing process. Also, I showed you how to customize the display of data in a column using a regular expression and a default value.

I showed you the sample application we created for this article, Job-Track-O-Tron. I gave you an introductory explanation of the components of the application. I explained how the Proxy-Adapter API interacts with the Import.io API to get extractor data. Also, I explained at a conceptual level how the Proxy-Adapter API identifies extractors that are to be use for aggregation. Finally, I gave you an overview of the Proxy-Client component and how it uses an SPA written in Angular to display the Job Listing information contained in the various extractors.

Granted, we covered a lot, but we only scratched the surface. The Import.io service by itself is a very big product. It has many capabilities that require some time and attention to master. But each time you learn something new, you'll see an immediate benefit. It took me only 10 minutes to learn to generate my first extractor automatically. Learning customization took a little longer. But the company offers many learning aids as well as sales and a support staff who will go to great lengths to get you up and running with the product.

As far as the sample app, Job-Track-O-Tron, goes, you need to learn a lot of little details to gain a full understanding of the code and to use the example to your benefit. There is not a lot of code in play, but despite the brevity, the code does a lot. Take a look at the code. The documentation is informative and provides a good example of how to to work the Import.io API.

Also, take a look at the screencasts provided here that show you how to create an extractor and then customize it to suit a special need. In this case, you customized extractors to make them compatible for use in the sample job listing application. Of course, you can extend the customization technique to meet your own needs.

Import.io fills a real need in the world of modern data collection. The UI is solid and the API is broad. For those of us that work with APIs to make a living, we find the documentation informative. You'll will have to experiment with the product to get the hang of things. Also, remember that Import.io does a lot of inspection, inferencing, and calculation behind the scenes, so you'll want to be aware of latency and performance constraints as you get accustomed to using the API. Still, overall Import.io is a solid tool for data extraction, and that it can be used in a variety of ways to produce data in a variety of formats only enhances it power. So the next time you find that a page that has data you want, but you can't use in a structured manner, consider Import.io to be a tool to use.

As we completed our testing, one major caveat of Import.io that revealed itself was that it had no built-in way to deal with paginated data. In other words, if a job site can only display 15 jobs at a time out of a total of 500 job listings (and the remaining 485 listings are viewable on subsequent pages), Import.io cannot detect the pagination and extra data from those subsequent pages. The problem revealed itself in two scenarios: the first involving websites where each page in the paginated series has a different URL and the second involving an unchanging URL has the user move from page to page.

One brute-force solution to the problem is to exercise the target site's option to display all the listings on one page, but sites that do this are few and far between. After suggesting to Import.io that the ability to deal with paginated data was the most significant opportunity for improvement, the company updated the service in a way that deals with with the first scenario. The feature revealed itself to work as advertised after some light testing by ProgrammableWeb, but further testing is required to ensure that it works across a wide variety of sites.

According to Import.io product manager Nicolas Mondada, the company apparently plans to address the second scenario with an update to the service that's due in early July.

Getting the Code

You can get the code for the Proxy-Adapter API and the Proxy Client on GitHub:

(This article is paginated. Use the pagination control below to retrieve the previous page)
Summary: 
Getting structured data out of web pages — often referred to as "web scraping" — is a real need, particularly for people whose job it is to prepare and analyze the information that's available in web pages. Meeting this need is right up the alley of a data extraction tool, such as Import.io.
Content type group: 
Articles
Top Match Search Text: 
Using a Web Page as Data Source with Import.io
source code: 
0

Viewing all articles
Browse latest Browse all 1601

Trending Articles