Migrate HTML the easy way

Submitted by on

Since 1997 MIT's Cultura has brought students from two different parts of the world together in a series of online exchanges which help each group understand the other's culture. Students respond anonymously to thought-provoking prompts in their own languages and then discuss their classes' pair of responses bilingually.

Created by a French language class at MIT as an exchange between American students and French students, the project grew to include more than 30 schools and eight languages. A pioneer in international collaborative learning, Cultura also pioneered sharing the learning online.

Unfortunately, by 2014, most of Cultura's 18 years worth of archives were no longer online. To get them back on the web, Agaric used the Migrate module to bring their collection of HTML files into Drupal. A common approach for migrating from a list of files, each file representing what will become a node in Drupal, is to use MigrateSourceList as a source. It needs an instance of MigrateList and an instance of MigrateItem representing the collection and the individual entity.

The Migrate module provides the class MigrateItemXml for importing content from XML files, but our input happens to be HTML from the late 1990s and early 2000s. Luckily libxml which powers PHP XML support can also deal with HTML. Hence it does not require a lot of work to create a subclass of MigrateItemXml that can work with HTML files. The only method we needed to override is MigrateItemXml::loadXmlUrl which is expected to return an instance of SimpleXMLElement.

class MigrateItemHTML extends MigrateItemXML {
  protected function loadXmlUrl($item_url) {
    $dom = new DOMDocument();
    $dom->loadHTMLFile($item_url);
    return simplexml_import_dom($dom);
  }
}

This class can now be used to set up the source of a migration:

abstract class CulturaMigration extends XMLMigration {
  public function __construct($arguments) {
    // ...
    $base_dir = DRUPAL_ROOT . '/../archives';
    $directories = array(
      "$base_dir/{$arguments['directory']}",
    );
    $file_mask = '/(.*\.htm$|.*\.html$)/i';
    $list = new MigrateListFiles($directories, $base_dir, $file_mask);
    $item = new MigrateItemHtml($base_dir . ':id');
    $this->source = new MigrateSourceList($list, $item);
    // ...
  }
  // ...
}

Through the archives we can learn many interesting things, such as that some students at MIT literally don't know the meaning of solidarity.

Add new comment