Using Merlin to Migrate Your Site to Sitecore / Blogs / Perficient

Vernon June 14, 2023

0 7 minutes read

Using Merlin to Migrate Your Site to Sitecore / Blogs / Perficient

In a previous post, I shared my journey with Sitecore Data Exchange Framework. I was excited to use it to create a reusable importer process. In the end, I decided it was not the right tool for me. I ended up using Powershell. In the process of researching site migration tools, I discovered “Merlin”. Merlin uses yaml configuration files to crawl a site and create structured representations (JSON) of pages. It would be easy to create a powershell script to import this data. It would be a super-fast way to create the basic structure of your site: items in the content tree, correct hierarchy of pages, page titles and metadata, and basic page content. I’ll use the blogs.perficient.com site to show some examples.

Setup

Merlin is a php based tool. Setup was super easy! I used Chocolatey to install php and composer.

choco install php
choco install composer

I downloaded the tool from github. Under the assets section, click “merlin-framework”. This file a php archive file (phar) and will run from the command line or powershell with the php.exe command.

Crawl

Merlin includes a crawler that can be used to create a list of urls for the generate feature. The crawler can be configured (https://salsadigitalauorg.github.io/merlin-framework/docs/crawler) to cache results, include or exclude specific urls by regex pattern, follow redirects and ignore robots file.

To run the crawl command, navigate to the directory where you downloaded the tool. Use the php executable to run the tool. The -c option specifies your crawler config. The -o option specifies where to put the output files.

PS > C:\tools\php81\php.exe merlin-framework crawl -c .\prft\crawler_blogs_merlin.yml -o .\output\prft

You can use multiple crawl configs to crawl different sections of your site. The entity_type option sets the name of the output file “blog_site_structure” creates url list file called “crawled-urls-blogs_site_structure_default.yml”. This makes it easier to break the site into pages that have similar DOM layouts (ie: blogs, news, products, etc) for the generate command.

In this example, I limited my results to 50 as well as limiting the crawl to specific directories. You can use the crawler_include and include options to limit what urls will end up in the output file.

---
    domain: 
    entity_type: blogs_site_structure    
        
    options:        
        cache_enabled: true
        delay: 500
        maximum_total: 50
        urls:
            - /2023/06/            
            - /2023/05/
            - /2023/04/
            - /2023/03/
            - /2023/03/
            - /2023/02/
            - /2023/01/
            - /2022/12/            
            - /2022/11/
            - /2022/10/
            - /2022/09/
            - /2022/08/
            - /2022/07/
            - /2022/06/
        crawler_include:
            - "~/202[23]/\d{2}/\d{2}/.*~"
        include:
            - "~/202[23]/\d{2}/\d{2}/.*~"

Below is an excerpt from my output file. The include attribute prevented /2023/06/ and similar pages from being added to the output.

---
urls:
  - /2023/06/11/install-docker-on-an-amazon-ec2-instance-using-the-yum-package-manager/
  - /2023/06/10/introduction-to-terraform-day-1/
  - /2023/06/09/unlocking-digital-accessibility-exploring-the-power-of-cognitive-assistive-technologies-2-2/
  - /2023/06/09/unlocking-digital-accessibility-exploring-the-power-of-cognitive-assistive-technologies-2/
  - /2023/06/09/oci-gen-2-refresh-token-setup-troubleshooting/
  - /2023/06/09/what-if-college-was-just-a-pit-stop-an-interview-with-luca-ranzani/
  - /2023/06/09/the-ev-leadership-of-luca-ranzini-proves-automotive-has-a-bright-future/
  - /2023/06/09/unleashing-creativity-through-constraints/

Generate

In order generate your structured data, you need a list of urls to process and a list of field mappings. The generator can be configured to cache results, read from the cache to make future processing faster, use css selectors or xpath selectors to map fields to the DOM, and perform post processing on the data.

To run the generate command, navigate to the directory where you downloaded the tool. Use the php executable to run the tool. The -c option specifies your generate config. The -o option specifies where to put the output files.

PS> C:\tools\php81\php.exe merlin-framework generate -c .\prft\blogs_merlin.yml -o .\output\prft

Merlin defines several special data types to make mapping easy.

alias – The url of the page
link – Reads the href attribute and text of an anchor tag
long_text – Reads multiline text and rich text fields
meta – Reads the content attribute of a meta tag
media – Generates a separate file for a media items grouped by the specified type
static_value – Outputs a string to the output as entered
taxonomy_term – Generates a separate file for taxonomy terms grouped by the specified type
text – Reads a single line text field

The text and long_text have extra processors that can be applied to the result.

nl2br – Changes new lines to br tags
remove_empty_tags – Removes any empty tags (ie <p></p>)
replace – Regular expression based string replacement (NOTE: this uses php’s preg_replace function which functions differently than the regular expression engine in .net)
strip_tags – Removes tags not in the allowed_tags list
whitespace – Removes extra whitespace characters

Compare the following config to the source of this page and see if you can match up the fields to the html source.

---
    domain: 
    
    urls:
     - /2023/05/30/perficient-included-in-two-commerce-focused-idc-market-glances/
     - /2023/05/16/getting-to-know-sitecore-search-part-4/

    urls_file:
      - "..\output\prft\effective-urls-blogs_site_structure_default.yml" #relative to the location of this file
        
    fetch_options:  
      delay: 500
      ignore_ssl_errors: true      

    entity_type: blogs
    mappings:
      - field: url
        type: alias
        
      - field: sitecore_root_path
        type: static_value
        options:
          value: "/sitecore/content/tenant/site/home/blogs"
          
      - field: sitecore_template_id
        type: static_value
        options:
          value: "{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}"
          
      - field: meta_title
        type: text
        selector: //title            
        
      - field: meta_keywords
        type: meta
        options:
          value: keywords
          attr: name
          
      - field: meta_description
        type: meta
        options:
          value: description
          attr: name
          
      - field: meta_og_title
        type: meta
        options:
          value: og:title
          attr: property
          
      - field: meta_og_description
        type: meta
        options:
          value: og:description
          attr: property          
          
      - field: meta_og_sitename
        type: meta
        options:
          value: og:site_name
          attr: property    
          
      - field: meta_og_sitename
        type: meta
        options:
          value: og:site_name
          attr: property    
          
      - field: meta_og_type
        type: meta
        options:
          value: og:type
          attr: property
          
      - field: meta_og_image
        type: meta
        options:
          value: og:image
          attr: property      
          
      - field: meta_published_time
        type: meta
        options:
          value: article:published_time
          attr: property      
          
      - field: featured_image
        type: media
        selector: div.story-two-header-content-img img
        options:
          file: src          
          alt: alt
          type: featured_images
          
      - field: title
        selector: h1:first-of-type
        type: text
        processors:
          - processor: nl2br
          
      - field: primary_category
        selector: p.eyebrow-header-eyebrow
        type: text
        
      - field: author
        selector: h4.byline span.author a
        type: text
        
      - field: ****
        selector: h4.byline span.****
        type: text
        
      - field: content
        selector: div.entry
        type: long_text
        processors:
          - processor: nl2br
          - processor: remove_empty_tags
          - processor: whitespace

      - field: content_images
        type: media
        selector: div.entry img
        options:
          file: src          
          alt: alt
          type: content_images
          
      - field: author_page
        type: link
        selector: div.author-avatar-and-name-avatar a
        options:
            link: href            
          
      - field: author_image
        type: media
        selector: div.author-avatar-and-name-avatar img
        options:
            file: src
            alt: alt
            type: author_images
        
      - field: author_bio
        selector: div.author-avatar-and-name-description p:first-of-type
        type: text
        processors:
            - processor: replace
                pattern: "More from this Author"        
      
      - field: categories        
        selector: //div[@class="widget"]//ul/li  #Taxonomy_term only works with xpath selector        
        type: taxonomy_term
        vocab: category        
        children:
            - field: uuid
              type: uuid              
              selector: a
            - field: name
              type: text
              selector: a
            
      - field: tags
        selector: div.tags-author-info a
        type: text

The output file is in JSON format with your field names as the keys and the content of your selectors as the value. I included two fields that will help make it easier to import this data into Sitecore.

sitecore_root_path – A static_value that contains the path to use as the parent when creating the new page in Sitecore
sitecore_template_id – A static_value that contains the id of the template to use when creating the new page in Sitecore

[
    {
        "url": "/2023/05/16/getting-to-know-sitecore-search-part-4/",
        "sitecore_root_path": "/sitecore/content/tenant/site/home/blogs",
        "sitecore_template_id": "{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}",
        "meta_title": "Getting to know Sitecore Search – Part 4 / Blogs / Perficient",
        "meta_description": "Dig into the weeds of managing your search sources. Learn about triggers, extractors, attributes, excluding urls, and scan frequency.",
        "meta_og_title": "Getting to know Sitecore Search – Part 4 / Blogs / Perficient",
        "meta_og_description": "Dig into the weeds of managing your search sources. Learn about triggers, extractors, attributes, excluding urls, and scan frequency.",
        "meta_og_sitename": "Perficient Blogs",
        "meta_og_type": "article",
        "meta_og_image": "/files/forest-simon-TX0ufDSCV4-unsplash-scaled.jpg",
        "meta_published_time": "2023-05-16T13:30:04+00:00",
        "featured_image": [
            "55789139-9512-33f6-bf85-cd1897eb36fa"
        ],
        "title": "Getting to know Sitecore Search – Part 4",
        "primary_category": "Sitecore",
        "author": "Eric Sanner",
        "****": "May 16th, 2023",
        "content": {
            "format": "rich_text",
            "value": "<div id=\"bsf_rt_marker\"><p>Welcome back to getting to know Sitecore search</p>shorted for brevity<p>In the next post, we’ll build a simple UI and connect to the api to get our first real results!</p></div>"
        },
        "content_images": [
            "b1b335ec-95a7-3609-89d0-65805abd3c68",
            "2eeacd44-f2d4-337c-86eb-ee1249bcf10b",
            "974e0ccb-ce39-3fc2-bfd7-299dbfd1f9b1",
            "379645ef-b0d8-3fa9-8088-49974bc4312e"
        ],
        "author_page": [
            {
                "link": "/author/esanner/",
                "text": ""
            }
        ],
        "author_image": [
            "43d4f615-da07-3a01-ad1f-5f53d4b11835"
        ],
        "author_bio": "",
        "categories": [
            "205dd6d7-887c-3501-b20a-3a2137437a47",
            "00ba851e-b649-30d3-902a-3a32d230110f",
            "8492492a-25c4-3c45-95a1-59c3d6b59620"
        ],
        "tags": [
            "Sitecore",
            "Sitecore.Search"
        ]
    },
    {
        "url": "/2023/05/23/the-dialogue-element-modals-made-simple/",
        "sitecore_root_path": "/sitecore/content/tenant/site/home/blogs",
        "sitecore_template_id": "{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}",
        "meta_title": "The Dialogue Element: Modals Made Simple / Blogs / Perficient",
        "meta_description": "The new dialogue element makes modals simple. Learn to create user-friendly modals with ease using the new HTML dialogue element.",
        "meta_og_title": "The Dialogue Element: Modals Made Simple / Blogs / Perficient",
        "meta_og_description": "The new dialogue element makes modals simple. Learn to create user-friendly modals with ease using the new HTML dialogue element.",
        "meta_og_sitename": "Perficient Blogs",
        "meta_og_type": "article",
        "meta_og_image": "/files/Group-of-People-Holding-Speech-bubbles-scaled.jpg",
        "meta_published_time": "2023-05-23T13:17:20+00:00",
        "featured_image": [
            "f8a33d23-893e-3625-8665-76bf657c8e72"
        ],
        "title": "The Dialogue Element: Modals Made Simple",
        "primary_category": "Accessibility",
        "author": "Drew Taylor",
        "****": "May 23rd, 2023",
        "content": {
            "format": "rich_text",
            "value": "<div id=\"bsf_rt_marker\"><p>Any front-end developer has likely experienced the pain of covering an exhaustive list of accessibility and UI edge cases while implementing modals. Well guess what? Not any longer.</p>shortend for brevity<p>The dialog is just another HTML element. Style it with CSS just as any other HTML element.</p></div>"
        },
        "content_images": [
            "add188a4-2d87-32ae-9a63-77bbfbaef6fb"
        ],
        "author_page": [
            {
                "link": "/author/drewtaylor/",
                "text": ""
            }
        ],
        "author_image": [
            "3cbd7fd0-687e-3856-b212-38848e781de0"
        ],
        "author_bio": "Drew is a technical consultant at Perficient. He enjoys writing code and books, talking AI, and advocating accessibility.",
        "categories": [
            "f47d9f1b-dc4f-31fd-b896-3fa00fe4d304"
        ],
        "tags": [
            "accessibility",
            "AI",
            "modal",
            "UX"
        ]
    }
]

The taxonomy_term field type creates a guid value for the category and outputs the mapping in a separate file. These items could be imported into Sitecore first so the content can be mapped to the correct category.

{
    "data": [
        {
            "uuid": "205dd6d7-887c-3501-b20a-3a2137437a47",
            "name": "Technical"
        },
        {
            "uuid": "00ba851e-b649-30d3-902a-3a32d230110f",
            "name": "Development"
        },
        {
            "uuid": "8492492a-25c4-3c45-95a1-59c3d6b59620",
            "name": "Sitecore"
        },
        {
            "uuid": "f47d9f1b-dc4f-31fd-b896-3fa00fe4d304",
            "name": "Accessibility"
        }
    ]
}

Conclusion

Merlin is a really neat tool! I’m excited to use it next time I’m on a site migration project. I believe it has the opportunity to save tons of time migrating content. How happy would content authors be to not have to go through the process to create a new page, set the name, set the display name, set the title and meta data manually? Creating the configuration files can take some effort to tweak as you add more fields and adjust to get exactly the right output. The caching feature is helpful to avoid overloading the source website with requests. Once you have an idea of how the field types work, it becomes easier to create new configurations.

Source link