If you are NuGet.org user, you may have recently noticed stability being an issue. NuGet is crucial to .NET development and we collectively lose our minds when it isn't available. In a recent post, the NuGet team shed some light on the issues:
For the last few weeks we have experienced a few hours of downtime a few times a week... Every few days/weeks we would see the search service become unusable, each of the servers running the service will start failing to a point that the machine was not accessible anymore, but will recover within a short few minutes. Unfortunately once in a while that will happen in a loop, and we will see our machines recycle.
What a wicked game to play, to make feel this way. What a wicked thing to do, to make me dream of you.
In the post mentioned above, we get a high level perspective of the architecture running our beloved NuGet. I decided to recreate it here.
You may notice the current infrastructure is dependent on Lucene. While you can manage your Lucene search indexes directly using [Lucene.NET],(https://www.nuget.org/packages/Lucene.Net/) I wouldn't recommend it. Especially, since the backend in this case is being written to Azure blob storage, which isn't to be a quick and responsive read/write store. On top of all that, failover seems to be a custom rolled solution by the NuGet team.
We made a change and now we will actively work with both data centers regularly, preferring the healthier search services. This was deployed on June 4th, which worked great in the June 5th outage, well until the other region went down as well.
I'm sure there is more to the NuGet architecture that I am privy to, but I would bet many of the development decisions are probably designed and built around Lucene's low level APIs.
I made the following comment on the blog post:
Seems like Lucene into Azure Storage is your bottleneck / unstable bit. Why not move to a service that is still Lucene based but more scalable. I'm thinking ElasticSearch or SOLR. That way you wouldn't have to manage Lucene indexes at such a low level, get the benefits you've learned from setting up your Lucene Analyzers, and scale more easily.
-- Khalid Abuhakmeh
Yishai Galatzer responded to my observation with the following:
it is something we are looking at once things stabilize a bit more. We are trying to make small stabilizing changes rather than large disruptive changes at the moment. I personally don't think Lucene is the only issue, and making large changes tend to take longer and obscure other issues that we can address in the short term.
We still see many gains to be made with small incremental changes that we hope will move NuGet.org to a stable and more responsive place. We will then look at more fundamental improvements.
-- Yishai Galatzer
Makes complete sense and I respect that decision. That said, it had me thinking how I would migrate to a more understandable and stable architecture.
Step 1 : Parallel Experiment
Since the team is invested in Lucene as a search mechanism, I would look at the other Lucene based search products in the wild. The latest search hotness (sizzle noises) is [Elasticsearch](https://www.elastic.co/. It is a REST wrapper around Lucene with friendliness in mind.
Seeing the NuGet infrastructure, we should use what we have. Let's look at our first round of changes, more like additions.
We leave our existing infrastructure intact, but add jobs that now push the same data that goes to our Lucene indexes into Elasticsearch. Additionally, we setup a fork of Nuget that uses Elasticsearch instead of Lucene. We can use Elasticsearch's clustering to help distribute the load on search, which it is designed to do from the start.
We now have to versions of NuGet.org and let users choose to join the "Beta".
Step 2 : Stabilization
As the new "Beta" stabilizes from community testing, we can see where we might want to make the official switch. We also think about what we might want to eliminate.
Four resources are now slated to be removed from our infrastructure.
Step 3 : The "New" Infrastructure
Our new version of NuGet.org is now out of "Beta" and in the wild. We have simplified our architecture, and lean on our technology choices . Here is what our simplified architecture now looks like.
I want to note that we would use Elasticsearch's built in replication and clustering mechanisms to handle our search backend. While the graph shows one node, I would assume this cluster could have many instances. I also noticed in the original architecture that count reports are precomputed by a job in the VM. We no longer would need to do that in the new architecture because we can utilize the faceting already present in Elasticsearch.
NuGet.org was designed to be hosted locally in addition to being hosted in Azure. That may be the reasoning behind using an embeddable library like Lucene. The team wanted anyone to be able to run their own NuGet version. At this point, the team has a greater obligation to the community in producing a more stable centralized NuGet.org. My proposed infrastructure doesn't exclude self hosting scenarios, but it does require a little more lift from those looking to run their own package repository. I've also taken the effort to propose a solution that can be built in parallel to the existing NuGet.org architecture, while purposely taking advantage of the design and technology decisions already used in the preexisting solution. With that in mind, I am a fan of the NuGet team and I am just really brainstorming with high level knowledge of their infrastructure. There is a good chance that I have it all wrong :).