Morningstar has 0 outages in first year of re-platforming

Morningstar.com’s technical team got a 2018 mandate to rebuild the site with the dual goals of extreme speed and extreme stability—within a seven-month timeframe. They seized the opportunity to create the simple, transparent system they’d long desired. 

As Morningstar.com Senior Software Engineer Clay Gregory explains, “Our No. 1 priority was to make a simple system. We are a website that serves editorial content—we serve tools that individual investors use to understand the markets and their own investments. Because we’re not doing things with heavy technical challenges, this mandate gave us the opportunity to get back to the basics and do things right—including getting visibility into how the system operates so that we can make changes rapidly with a full understanding of their impacts.”

At the time, the Morningstar.com application was primarily leveraging an enterprise content management system along with backend services hosted in a Java application to deliver content to users on morningstar.com. However, the previous architecture had experienced several disruptive outages and suffered from some major drawbacks, including lack of scalability for spikes in traffic, inability to quickly pinpoint and debug performance issues, and slow launch velocity for new products and features. With the non-negotiable technical requirements of extremely fast and stable, along with the spiky workload patterns inherent to the investment market, utilizing AWS Lambda was a natural fit

Freed from the constraints of their legacy system, the Morningstar.com team chose to re-platform to a serverless architecture with AWS Lambda, API Gateway, S3, and CloudFront connected to existing applications and internal services to provide the speed, efficiency, and cost savings the company needed.

"We were surprised at how easy and affordable it wound up being to move to AWS Lambda,” says Morningstar Senior Software Engineer Zach Erdmann. “Coming from managed services and from this huge deployable enterprise application, it was just a thrill to get to something that provided such fast deployment times. We have a full CI pipeline up, and seeing a change go from Dev to be a production candidate in 20 minutes was huge for us.”

0
outages in 8 months of replatforming
20
minutes from dev to production with AWS Lambda

Revamping an archeological enterprise stack

Like most large enterprises with existing applications, Morningstar wasn’t starting from a clean slate in terms of architecture. With a mix of existing Java and .NET services and previous business APIs, building a greenfield application within seven months to overhaul the entire legacy system wasn’t an option. Gregory explains, “As a public-facing company with a lot of history, we sometimes keep stuff around because people are still using it—even though it’s not a high enough priority to rebuild immediately. So, you end up with a sort of geological stack of languages and applications that you've got to keep alive. AWS Lambda became the glue that pulls all this stuff together.”

Instead, Gregory and Erdmann worked with their teams to identify and decouple components that could be rebuilt or replaced with AWS Lambda functions. Serving as the connective tissue between Morningstar’s existing and net-new services, Gregory and Erdmann focused on writing functions that provided the business logic among the user experience, market and user data, and content.

As part of the new architecture, Morningstar uses CloudFront to deliver content from S3 buckets storing static site assets, a CMS-supported corporate marketing site, a variety of internal services, and Lambda functions with API Gateway to deliver dynamic content and user event data. With a multi-tier cache by default, CloudFront uses regional edge caches to improve latency and reduce the load on origin servers for Morningstar. On the backend, API Gateway serves as a crucial scaling component that allows Morningstar to process and route concurrent API calls to trigger Lambda functions at scale.

Each month, we hold an enterprise wide incident review session. Since re-platforming Morningstar.com and deploying New Relic throughout, not only have we not had any major incidents to report, but that very fact has become newsworthy, with our CTO actually calling out our success.

Making monitoring a priority for serverless

While the company had used various monitoring tools in the past to periodically verify that the system was operating and healthy—the process was more of an afterthought. And the stack itself was extremely opaque, meaning no one really understood how the system was operating end-to-end.

Erdmann explains, “When we were on the previous solution, we had monitoring, but it was secondary. It wasn't part of our practice, and it certainly wasn't intentional architecture. So, when we began re-platforming Morningstar.com, we knew that we wanted to instrument everything so that we could understand our application from top to bottom.” As the front door to the company’s offerings, Morningstar.com consumes data, tools, and services not just from other parts of Morningstar, but also from third-party services—which means that although a problem might be surfaced by a Morningstar.com user, it could originate much further upstream.

“New Relic was attractive to us because organizationally we have a very mixed environment,” says Gregory. “Our team is on the cloud, but not all parts of the company are, and we need that institutional knowledge. For that reason, it’s very important that we can plug our monitoring into—and get visibility out of—whatever kind of platform our colleagues and partners are using.”

Using New Relic, the Morningstar.com team can visualize, trace, and alert on their Lambda functions, and drill down to the invocation level for fast debugging when something isn’t behaving as expected. However, with an environment that includes a mix of applications, runtimes, and business APIs, Morningstar.com needed visibility across the entire ecosystem—not just the serverless components.

“One of the huge successes of this re-platforming—and one of the reasons we’re doubling down on our relationship with New Relic company wide—is that now we have a big monitor displaying everything that's happening at Morningstar.com,” says Erdmann. “If there’s a blip in performance, we can tell within seconds where it’s coming from—a huge turnaround from a year-and-a-half ago when it would have taken days, if not weeks, to determine the root of the problem. Now we can tell immediately when the ground starts to shift.”

Eliminating downtime

In addition to drastically reducing the time it takes to pinpoint and resolve issues, Morningstar.com has been able to nearly eliminate downtime for the site. While Morningstar.com used to experience an average of one major outage per month, the site has experienced zero outages in the eight months since re-platforming.

“Each month, we hold an enterprise wide incident review session, where the leaders of every team within the company present and review the major outages they’ve experienced over the last 30 days,” says Gregory. “Since re-platforming Morningstar.com and deploying New Relic throughout, not only have we not had any major incidents to report, but that very fact has become newsworthy, with our CTO actually calling out our success."

It’s not as if Morningstar.com never has issues, says Gregory. It’s just that now his team can spot and resolve them (or reroute them to the proper team for resolution) long before they impact users or bring down the site. Being able to depend on this type of monitoring has freed up the Morningstar.com engineering team to release new features faster and to focus on improving users’ experience on the site.

“We are a small team,” says Gregory. “And we can’t shirk on our duties to deliver product features. Thus, the less time we spend solving engineering problems, the more time we’re able to focus on our primary function, which is delivering products and meaningful results to users. We're not in the business of maintaining infrastructure, and with AWS and New Relic monitoring, we ensure that this remains the case.”

Releasing features at 5 p.m.

Confident that they can proactively detect and fix issues before such issues impact users—and with a new CI/CD pipeline—the Morningstar.com team has reduced launch stress considerably and increased developer velocity.

“Before we re-platformed and deployed New Relic monitoring throughout the site, QA would spend a week reviewing code changes before we would timidly push a release out the door,” says Gregory. “And even then, we would only do so between 9-11 p.m. when nobody was on the website. Contrast that with the present, when we release at 5 p.m. and go home 10 minutes later because we’re confident that nothing will go wrong, and you can see how far we’ve come.”

In the eight months that the re-platformed site has been live, the team hasn’t had to roll back any deployments. Instead, says Gregory, “If something is going haywire, we’re going to roll forward. We're going to trust our process. We’re going to trust our monitoring. And we're going to restore service by fixing the problem, not by rolling back. Thanks to New Relic, we can move forward through a problem rather than backward, and that feels really good.”

Driving a new culture of observability

For both Gregory and Erdmann, one of the most surprising and gratifying aspects of the re-platforming and instrumented monitoring has been an increase in cross-organizational communication between engineering teams and business teams. With customer and usage-related metrics included in their application monitoring, business team members are increasingly interested in how application health relates to business health. Explains Erdmann, “Our New Relic monitoring creates a place where it's not just the engineers who go look at the numbers; everybody does. The dashboard has become a thing that everybody watches and benefits from.”