Artwork

Το περιεχόμενο παρέχεται από το Austin Parker, Ana Margarita Medina, and Adriana Villela. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Austin Parker, Ana Margarita Medina, and Adriana Villela ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.
Player FM - Εφαρμογή podcast
Πηγαίνετε εκτός σύνδεσης με την εφαρμογή Player FM !

OpenTelemetry & Nomad with Luiz Aoqui of HashiCorp

39:26
 
Μοίρασέ το
 

Αρχειοθετημένη σειρά ("Ανενεργό feed" status)

When? This feed was archived on November 30, 2023 00:38 (5M ago). Last successful fetch was on September 18, 2023 15:43 (7M ago)

Why? Ανενεργό feed status. Οι διακομιστές μας δεν ήταν σε θέση να ανακτήσουν ένα έγκυρο podcast feed για μια παρατεταμένη περίοδο.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 342986734 series 2565214
Το περιεχόμενο παρέχεται από το Austin Parker, Ana Margarita Medina, and Adriana Villela. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Austin Parker, Ana Margarita Medina, and Adriana Villela ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.

About our guest:

Luiz is a Toronto-based senior software engineer at HashiCorp working with distributed systems on the Nomad workload orchestrator. Before that, Luiz was a full stack and DevOps engineer at IBM, leading a team that builds and manages a SaaS e-learning platform.

Find our guest on:

Find us on:

Show Links:

Transcript:

ADRIANA: Welcome to On-Call Me Maybe. I am your host, Adriana Villela, joined by...

TED: @tedsuo on the internet, Ted in real life.

ADRIANA: Awesome. Today we have...

LUIZ: Hi. I'm Luiz.

ADRIANA: So, Luiz, tell us a little bit about yourself.

LUIZ: Sure. My name is Luiz. I'm an engineer at the HashiCorp working in a project called Nomad, which is our orchestrator solution. Yeah, I've been working on the project for almost three years now. And before that, I was in that developer operations space, meaning that my team was not large enough to have an ops team, so all developers had to do a little bit of [chuckles] operations and everything. So yeah, that's how I got involved in this space. And then eventually, I got the opportunity to work at HashiCorp and its tools.

ADRIANA: Cool. That's awesome. And it's funny how you and I met because I think we met on Twitter [chuckles]; if I'm not mistaken, though, I think my post on HashiCorp my explorations of Nomad...because last year I was a total Nomad noob. I was at Tucows last year running a HashiCorp team, all things Hashicorp just about. So I was like, oh, shoot, I better learn how this stuff works. I guess that's how we met. Now we follow each other on Twitter, which is awesome. And I guess we also have...there's the additional HashiCorp connection because I think, Ted, you said you worked at HashiCorp at some point, right?

TED: Ah, I did not actually work at HashiCorp. But when I was interviewing, when I was looking for my last job, it came down to either HashiCorp or Lightstep. Both were really interesting to me. I like the idea of bootstrapping up the OpenTracing project, which is why I went with Lightstep. But I've always enjoyed HashiCorp's approach to engineering and product development.

And Nomad actually was the project I was most interested in because, in my last job, I was working on container scheduling at Pivotal on a project called Cloud Foundry. So I really enjoyed the domain space of scheduling and building that part of a distributed operating system. I thought it was really cool.

ADRIANA: Yeah, and I have to say, coming from a Kubernetes background and being thrust into Nomad, I was like, oh, man, this is so much easier to get started. [laughs] My mind was blown right away. It was awesome. It was awesome. I can understand why HashiCorp has such a huge fan base; like, people fan over this stuff big time. [laughs]

LUIZ: Yeah, it was interesting coming from...because when I joined, my first contact with HashiCorp was initially Vagrant. I think everyone goes through the Vagrant status and then Terraform. And I only learned about Nomad when I interviewed. So I didn't know about it before I was...I used to work at IBM, and my team was using Rancher at the time, Rancher 1.x, so like a Kubernetes migration.

Afterward, when Rancher 2.0 came out, everything was Kubernetes. So we were like, oh, we might as well use Kubernetes. And then IBM bought Red Hat, which meant that everything became OpenShift. So we had to migrate for a third time. So I kind of dabble in all the different tools out there. And at the time, we were using Terraform for basically everything in our infrastructures. And it kind of became the most fun I had in the day was just playing with Terraform to the point of, okay, maybe I should be working for them since that's where I have the most fun.

But during the interview process, it was kind of a generic position, so it was like a systems engineer generic, not a specific team or product. And then, during the interview process, the Nomad team liked my background and picked me. I was like, okay, let's learn about what Nomad is. The first time I just got started, like, Nomad agent-dev, and then you have an environment up and running, ready to use. I'm like, okay, that's very different than what I'm used to.

ADRIANA: Yeah, I know. That was kind of my, oh my God, it's the same binary that does everything? What?

TED: Luiz, would you mind just since the audience may also not be super familiar with Nomad, just briefly describe what it does and a basic architectural overview.

LUIZ: Yeah, sure. That makes sense. So Nomad is a workload orchestrator. So what that means is that it will grab any sort of tasks that you may have, that you want to do, and any type of infrastructure that you have, and it's going to distribute and schedule those tasks into the cloud. So it will kind of, in some sense, abstract away your cluster.

You have thousands of machines running. You don't actually care where things run. You just give it a specification, and Nomad will figure out the best place to run and to keep running. So if a machine dies, Nomad will reschedule things and make sure that your specification is always real and it's always as you defined. That's the role of the orchestrator.

I guess where Nomad is different is that Nomad is very focused on that task. If you come from a Kubernetes background, you may be aware that Kubernetes does that, plus a lot of other stuff. So it's more feature-complete but also more complex in that sense. There's more to learn about. There's more to understand. And there's a lot more going on when you're just like, I just need a container running; I don't care where or how. So Nomad is very focused on that one task, one job of scheduling things on to other things.

Another difference is that since it's focused on the scheduling part, it is generic in terms of the workload. So nowadays, the most common workload are containers, so you can run containers everywhere, but Nomad is not restricted to only that; you can run JAR files; you can run QEMU VMs. You can run Podman containers, whatever.

So Nomad has this flexibility in terms of what you want to run and what type of workloads as well, so you can run batch jobs or services that are always running or dispatch jobs. So there are all sorts of different use cases that you can do with Nomad. They are sort of built-in into the core of Nomad. So there's no need for external tools or extra coordination to support things like rolling upgrades, or blue/green deployments, or things like that.

ADRIANA: Cool. One of the reasons why we asked you to join us today is that, I guess, a few weeks ago, we were chatting, and you mentioned that you were looking at the possibility of instrumenting the Nomad code with OpenTelemetry, which totally piqued my interest. And as OpenTelemetry lovers, we're like, yay, this is great. So why don't you tell us a little bit more about that aspect that you've been exploring?

LUIZ: Sure, yeah. It has been an area of interest for a long time. As I mentioned earlier, I used to be this developer that does operations as well. And at the time, I didn't have a good sense of what it means to instrument an application, what it means to monitor things.

My manager did a very good job of getting us large screen TVs with great dashboards and all of that. But they were always, like, doesn't matter how many metrics we have, there's always a problem happening. And every time there was a problem, we didn't know what to do exactly. [chuckles] It was always like, we had the information that we had, but that was never enough to actually solve problems.

So now, looking back and then learning about OpenTelemetry and this idea of observability and what it means to have an observable system, it all resonates with me very well because that's the stuff that I wish I had before, and then that's the way things should have been done in the past to understand when an outage happens causing some problems, and things like that.

And so now, switching back from this developer operator perspective to this tool builder perspective, I was looking into ways to make past Luiz’s life easier, like, what the tools that I'm developing today could have done to make my life better in the past.

And I think observability is one of the major things that we can improve. Because, as I mentioned the description, there is this promise of, like, oh, I don't care how my container runs; I just want it to run. And that works 99% of the time, but that 1% when it doesn't work, that too is completely opaque to you. You can have logs and metrics, but that's not enough to really understand what's going on and what's wrong.

So that's where I started becoming interested in this space is just like; how can I make Nomad more transparent and more understandable to people that are using it? And to kind of give that internal view of like, okay, this action triggered these internal operations that generated these internal objects that eventually becomes your container.

And so this piece of observability, OpenTelemetry, all of that fits very well into the narrative, both in terms of when we talk about Nomad users, we usually talk about two different personas. So we have the developer persona, which is the group of people that are writing code. They're generating, let's say, a Docker image, and they want to run the Docker image somewhere. And then there's the operator persona, which are the people that are managing the infrastructure, starting the VMs, installing stuff. They're more like managing the infrastructure part of things.

When looking into making Nomad more observable, I started looking at these two different personas. What can you offer for all of them? So what the developer cares about, what an operator cares about. And so I started looking into what can I provide to each of them? And then it all comes back to telemetry and what kind of data is more relevant. And so, yeah, from the user perspective, I want to allow them to understand what's going on better, like, what's happening when they run some command or when they do some operation what's going on.

But also a little bit more selfish, I also wanted to make my life easier. So when people file bugs, or there's a support ticket, you're always in this situation where one side of the issue is able to collect data but don't necessarily know what data to collect. And then there's the other side of the issue which is us, which is we know what data we need, but we don't have the means to collect it. So there's always this back and forth between look at this metric, what does it say? Give me this log; what kind of information is there?

So my hope is that having this common language of telemetry and traces and spans and all of that will give us a more unified conversation and just make our lives easier in terms of supporting our users and also reducing their load as well. Like, if people understand what's going on, they may not need as much to explain what's happening in solving issues. There's the user side of things, and there's also sort of the selfish making our lives easier as well perspective.

ADRIANA: I love that, yeah, because there's nothing more frustrating like you get a user ticket, and they're like, blah, blah, blah, it's not working. You're like, oh my God, I don't even know where to start. It's like this terrifying moment.

TED: It's really funny. That's actually what you're describing, Luiz, is specifically what got me into distributed tracing, and OpenTracing, and OpenTelemetry, and all of that was having other people operate my software, which in this case happened to be literally the same kind of software, a scheduling system that they're running workloads on. And then they're saying there's a problem.

And we need data from them, and the data that we could get without distributed tracing was just kind of a nightmare to dig through because it's like all or nothing. I can't be like, just give me these logs, there's something. It's just like, well, give me a dump of everything off of all 200 machines that you're running, and I'll pore through it over here. And that was just really, really tedious.

LUIZ: Yeah, there are a lot of ugly, bad scripts, lots of jq happening to try to correlate all the different logs and all different metrics from different machines that are in different time zones. So it's just sort of this mess. And living in this space of building tools where you're not actually having a SaaS product or you have control over the environment that is running is really challenging because we know nothing about where they're running. We know nothing about their environment. We know nothing about how they're running.

So we need to keep probing and asking for more information. And then every question generates five other questions just because in this space where you're shipping tools; you may not necessarily have all the information that we need. We actually purposely don't collect any data from our users. So yeah, that's definitely challenging. And getting the right information that we need to help takes a lot of back and forth.

ADRIANA: Yeah, and it's challenging, too, because sometimes you get the bug report of, like, this isn't working. But tell me more, what specifically is not working? You have to keep digging and digging and digging. And the more information you have, the better if you have that information in the form of telemetry data, even more awesome because it makes your life easier.

But I also like the point that you made about instrumenting Nomad to cater to both the developer who's deploying their containers to Nomad and having to manage their workloads and the person who's actually operating a Nomad cluster. Because I've been in a position where my former team something would go down, [laughs] and it's like the mad dash. And time is ticking.

Because if something's gone down and it's part of critical infrastructure, [laughs] you really want to get to the bottom of that as quickly as possible and hopefully avoid executives breathing down your neck. And so if you have that kind of information, it just makes life so much easier in general.

LUIZ: Yeah, when you need the data the most is usually the most critical time because you're like, your system is down. You're having customers asking like, "What's going on?" So that's the moment that you need to be more precise but also the hardest moment to find data because usually, the system is not behaving as expected. So with only logs and metrics to guide you, you rely on your past self to have made good decisions about what to log and when which doesn't always work well.

And sometimes, we have to ship a new custom binary for a customer with only one line of code to log something more specific. And then it's like, deploy this binary, and let's see what happens. So these situations where you have no idea what's going on and we need to get more log lines is not a great place to be. Because it's extremely time-consuming to deploy new binary, wait for the problem to happen again, hope that that log line will give us some new information that will answer the question. So yeah, this traditional way of doing monitoring is pretty limited.

TED: So, Luiz, I'm curious. I feel like there are probably two basic scenarios that come out of having to observe Nomad and deal with the problem; one would be Nomad has some bug or is scheduling things in a way that isn't what's expected. The other would be that the workload itself has a problem. There's something wrong with the workload, or the way that workload has been configured in Nomad or the way it's been resourced might be incorrect. And so it's thrashing or having some problem just due to the nature of its environment. And I'm wondering how would someone start down that path of trying to understand which situation they're in?

LUIZ: Yeah, so definitely, there are two situations. And normally, you would start with the general health. So when you want to run something, you write what's called a job file. In that job file, you describe how your application is organized. So we have a hierarchy, so the job, the group, another task. And the task is like the low-level location. So if you have a container, your container is a task.

And then, Nomad creates these things called allocations, which is a set of tasks. So you can have two containers in the same allocation. So normally, when you have a problem on what you're running, so like your application has a problem, it's not starting, the allocation is going to be unhealthy. And Nomad will tell you, like, you have this concept of deployment, so your deployment is going to fail because your location is never going to become healthy.

In those situations, you want to look at the application logs and at the application events. When your allocation is starting, Nomad does a lot of operations beforehand. So it downloads files, it downloads the images. It will mount volumes. All of these are events that can happen in your task. And if you have something wrong on your job file, you probably are not even going to start the container. You're going to see Nomad failing to start the application. And then there's an event saying, "Oh, your image doesn't exist," for example. That's the first place to look at is the task events.

And then if the task actually starts and it runs, but it dies up very quickly, so that's where you look for the application logs, the task logs, and things like that to see why your container is starting, but it’s dying off very quickly. So that's usually a bug in the application, or you forgot to point to your database or something like that. So that's application-specific errors. So those are normally where you look for, and those are usually caused by application bugs.

For Nomad bugs, you usually see your deployments not making progress. So you want to run something, but it just keeps spinning. Nomad never actually creates those allocations, never actually starts these containers. Or you want to run something, and then Nomad just says, "Okay," and then it never does anything. So normally, when nothing happens, that's where there's something wrong with Nomad itself. And those are the situations where you need more observability to understand what's happening.

Right now, the best place to look at for logs both in the server, in the clients...so when you have a Nomad cluster, you have these two sorts of roles that your machines are answering. They're either a server or a client. The servers are the machines that hold the global state. The server knows all the details about what's running, where, and how many machines you have in your cluster and all of that metadata of your infrastructure.

And then the clients are the machines that are actually running workloads. They are starting containers. They're running binaries. So they are the things that run whatever you need to run. And so when something's wrong, you need to look at those two sides of things. So normally, the servers are going to pick where to run stuff. So that's where you go to if your application is never starting. Like, you need to look at the server logs to see, okay, did the servers receive the request? Were the servers able to pick a client to run? And these sorts of things.

Once the server picks a client to run, then the clients will talk to the servers to figure out what stage they're supposed to be. And that's where they will start, you know, they might see a request, and it goes, okay, I need to run a container with this image, with this much resource and whatever. And then they will start that specification.

So if your workload has been scheduled but not running, then you might want to look at the client logs to see why it didn't run. So it could be something like your Docker daemon is misconfigured, or you forgot to pass the Docker Hub credential or something like that. So that will be on the client side of the equation. So that's the client logs that you will look for to get more information.

ADRIANA: So you're basically doing a lot of log chasing.

LUIZ: Exactly, yeah. So there are all these different places that you need to look for depending on the symptoms that you're observing. So from the outside, you don't know what's happening. You just know what you're observing, like, my application is not running, and that can mean several things. So you need to drill down and go by instead of figuring out a problem, you need to figure out what's not the problem first.

So it’s like you're excluding options rather than finding the exact problem. It's very time-consuming. It's like, okay, is my job file correct? Is my client configured correctly? There's all of this checklist that I have to go through, which, again, is very time-consuming and hard to do.

TED: Yeah, that's the crazy life of operating these complicated systems. So, Luiz, since we've got another container scheduling nerd on the call, I'm curious, does Nomad schedule stateful workloads?

LUIZ: Yeah. So we also have the capability of doing stateful workloads. We have two types of volumes that you can use, so one of them is called host volumes, which is basically just a path on the machine that you can set on the client configuration. So that will be just a file that is going to be written to the disk on that specific machine. That's more for when you have a database. You want to have good performance. So you have a big disk on that specific machine. You create a host volume.

And by configuring the client that way, Nomad is aware of where the volume is located. So on your job, you don't actually have to specify run on that machine because it has a big disk. You just say I need this volume, and Nomad will make sure that your database only runs on that machine that has the big disk, the fast disk. So it's sort of metadata-driven scheduling.

The other option that we have is the CSI, the Container Storage Interface that is using Kubernetes, but it's a generic specification. So Nomad also implements the same spec "in theory," and I say that in air quotes. In theory, you could use the same plugins that you will use in a Kubernetes cluster to run and create volumes in Nomad as well. And it works the same way. So you have to run the CSI plugin.

And then we have two commands. So one is volume create, so you can create volumes on the fly. If you already have a volume, you can register them in Nomad, and then on your job, you just specify I need this volume. And then, you can specify things like the size of the disk, their capabilities, what they call topologies, like the location of the disk. All the things that CSI does Nomad will follow the spec.

And I put, like, in quotes, "any plugin will work" because some plugins are kind of Kubernetes-specific, like, their implementation. They talk to the Kubernetes API, so it's not 100%. If the plugin doesn't follow the spec to the T, it may not work with Nomad. But we try to implement the interface, and then the plugins are outside of our control. But we do have that capability as well.

TED: Cool. That was actually...creating CSI was the last thing I did in that space before I flew away to observability land. But I'm curious, so one thing that came up looking at stateful scheduling is one, I love the way you're doing it by having it be resource-driven where you just say I need these things, and the scheduler figures it out.

The more we can go towards plug-and-play like a dynamic linking driver model, the way desktop operating systems do it, the better. I think that's one of the things that's hard about scheduling containers these days is you end up with Turing complete YAML pre-processing just to resolve all of your dependencies. So I'm glad to see it going that way.

But one aspect of that that I always found intriguing was there are actually two layers of scheduling. So you have your containers scheduling system or whatever it is that Nomad is controlling. But if you are running containers, those containers then are running on top of virtual machines that themselves have to be scheduled and replaced for various reasons, like the operating system needs to be upgraded or something like that.

And if you're running stateful workloads, in particular, but, I mean, in general, just if you're going to roll those virtual machines, it seems like that would work better if there was some understanding between the two schedulers as to which machines to roll so that you don't end up taking down too many nodes in a consistent databases system, for example, or taking too many ephemeral apps offline at the same time. And I'm curious, have you guys dug into that problem yet over at Nomad?

LUIZ: We haven't done anything specific in that sense. We do have this notion of rolling upgrades. So when you update, we don't necessarily update everything at once. You can control how fast or how slow you want to go with the upgrades. And then you can also have what we call the canary deployments. So you create a copy of your application, and then you pass that, and then you approve or reject that upgrade.

For a stateful workload specific, we don't have any specific functionality around that. So it's up to you to understand what type of data is running where. That's where host volumes sort of become tricky because now you have to do that accounting of, like, okay, this machine has this type of data. But the sort of "simplified," quote, unquote simplified approach shows volumes. It allows you to set up your storage in any way that you like. So, for example, you could have an NFS-backed host volume. So even though for Nomad, it looks like a path in that machine, it's actually backed by a network storage or some sort of thing.

So it becomes simpler in that sense of as long as the volume is mounted in your clients, Nomad will be able to write to that place. So it gives you the flexibility, especially on-prem where you may not have things like EBS, these sort of dynamic provision volumes because it's much simpler to manage. Because then it's like, it's the way you have been doing so far, NFS, Gluster, Ceph, whatever storage you may need. And then for Nomad, you just give it the path where the volume is mounted, and Nomad does the proper thing. So those are the approaches we take.

And I think it's the same for networking. So when people talk about the complexity of Kubernetes, you always have to define complexity because it means different things to different people in different situations. To me, the complexity of Kubernetes comes just in terms of the amount of concepts that you need to learn to get started. You need to learn what a pod is. You need to learn what a deployment is. You need to learn what a volume claim is. And there are all these different sets of concepts that are interlinked and intertwined, and you need to work with them.

And the pod is probably the first one that people encounter, like, what's a pod? And then, you need to read all this documentation. And the pod has a virtual IP, like, it sort of cascades. But in Nomad, we try to take a simpler approach, like, your machine has ports. You ask for a port, we'll give you a port on that machine IP, and that's it. We don't try to create complexity if we don't have to; even though it might simplify some stuff, it will create this mental load and more concepts. So we try to avoid that.

And that's where the host volume comes in. It's like, Nomad doesn't care what you're doing. As long as the path that you give Nomad has data in some shape or form, that's what Nomad will use. So that's the general approach that we take. But also, we understand that CSI has a lot of benefits as well, especially in the cloud. And again, in the networking space, CNI has a lot of benefits as well. So we do support those, or we try to focus on the simplicity-first approach to things.

ADRIANA: And I think that's what makes it such a nice learning curve for Nomad in general. Like for me, because I was already familiar with all these complex Kubernetes concepts, it was so much easier to translate it in my head to Nomad land but in some ways also because it's got fewer moving parts simplified in my brain, right?

TED: Yeah, I'm curious if that...so one thing I've always felt about these systems, especially Kubernetes but any of these containers systems, is they're recreating the control plane that's existing at a level beneath them. So you have all of the networking that you're setting up with your cloud provider, for example, and then you're deploying Kubernetes on top of that network, and then you're setting up networking at the Kubernetes layer. And so I'm curious, in Nomad, it’s part of the simplicity just that you're leveraging those lower-level building blocks. Is that where it comes from?

LUIZ: Yeah, we try to do as little work as possible. In the networking space, you can go set up whatever CNI plugin you want. And then that's on you to make sure that your setup is correct, but you don't have to if you don't want. If you only have machines with IPs, we'll give you a port, and then you can use that port to communicate with your application.

And then focus back on, like, our goal is to orchestrate workloads; it's not to create a service mesh. It is not to create these complex multi-concept systems. We just want to run stuff on machines, and then how you configure your underlying storage or underlying networking that's more or less up to you. We only care about give us a set of machines that can talk to each other, and things will more or less just work because we use these underlying concepts that are already there.

So we try to avoid adding complexity if we can, if there are already tools that do the job. So, for example, when you run a binary in Nomad on a Linux system, we isolate the binary using all the tools that the Linux kernel provides, like chroots and cgroups and network namespaces. So we try to leverage as much of what's already there instead of trying to create these new concepts on top of things.

TED: Awesome.

ADRIANA: I was curious, you know, you explained your motivation for instrumenting Nomad to get better insight into what was going on for you as a developer, for the operators of Nomad, and for the developers deploying stuff to nomad. Now, how did you get into OpenTelemetry? How did you become aware of it?

LUIZ: Good question. I don't exactly remember. I think it started when I was looking for better ways to understand, to provide better tools to our users. I landed into this space of observability and of the players in that market. And then it's just sort of a nascent ecosystem that is also pretty vibrant, just like all the different vendors. There are all these different people involved from different companies. And think I watched the demo. I want to say it was at KubeCon, but I don't remember.

But that was the first time when I saw this idea of traces and spans and this notion of thinking more like events rather than metrics and logs and these traditional ways of doing things. And that became very clear because that's kind of how I think when I'm developing Nomad. And it's like, okay, I'm thinking about the different components inside Nomad and what type of message it exchanges and what type of events happen inside Nomad.

And so that, to me, maps very well to the concept the OpenTelemetry has. So that's when I started looking into how to bring those capabilities to become native in Nomad. And last year, we had a team hackathon. And so that was the project that I picked. Like, how do I make the core of Nomad to generate these spans, these traces, these events so that as long as you provide an OpenTelemetry endpoint, Nomad will generate this data for you automatically?

So yeah, so that's sort of where I learned about this and where I got involved in this work. Unfortunately, since then, I haven't done a lot of progress. It's been my side project, and it's my own personal goal instrumenting Nomad. But it's been slow but pretty fun to do it.

ADRIANA: Well, I hope folks listening in on this will be inspired by the work that you're doing, and maybe it'll get people clamoring for having that little extra bit of observability into Nomad. So for those listening, if you want this, [laughs] make sure you let us know. [laughs]

LUIZ: Yeah, and I am looking forward to see what kind of results we get out of it. Because being such a new tool, such new space, I think there's a lot of answers, like, a lot of things to be figured out. I think there's a lot of perceived value, but it's not always as realized. So I think we need to get it out there and see what happens. So, yeah, that's what I'm trying to do, just get something out and then see how people use it, what kind of value it provides and if people like it or not.

ADRIANA: The other thing, too, is that you touch upon a use case that's probably more common than what we would, I guess, advertise for, which is we want people to instrument as they code. But what if you don't have instrumentation to begin with? You're in the situation where you've got this product. It needs to be instrumented. So you're going through the process of instrumenting key areas, which I think is a realistic scenario that we’ll encounter in the outside world, right?

LUIZ: Yeah. And as I go, I sort of learn about Nomad myself. Even though I've been working on this product for years, there are areas of code that I've never touched before that I don't know exactly how they work. So going back to that selfish goal, it also helps me to understand my own tool better. It helps me, like, the way I implement guides me to understand what's going on as well.

ADRIANA: That's awesome. So you're going to be like the Nomad pro as a result of this little exercise. [laughter]

LUIZ: I hope someday I'll get there. You're working on some code, and then you see a new file that you never touched before, and then you do a git-blame, and that's like Armon committing stuff six years ago, and it's like, no one touched the code since the original commit.

ADRIANA: [laughs]

TED: Awesome. Well, it's been great talking to you, Luiz. Hopefully, we can have a follow-up conversation sometime once this is all out there and hear a report on how it went.

LUIZ: Nice, yeah.

ADRIANA: Yeah, that will be awesome. Well, thanks for joining us. And signing off from On-Call Me Maybe, I'm Adriana Villela.

TED: I'm Ted Young.

ADRIANA: And keep on keeping on.

TED: Aloha.

  continue reading

31 επεισόδια

Artwork
iconΜοίρασέ το
 

Αρχειοθετημένη σειρά ("Ανενεργό feed" status)

When? This feed was archived on November 30, 2023 00:38 (5M ago). Last successful fetch was on September 18, 2023 15:43 (7M ago)

Why? Ανενεργό feed status. Οι διακομιστές μας δεν ήταν σε θέση να ανακτήσουν ένα έγκυρο podcast feed για μια παρατεταμένη περίοδο.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 342986734 series 2565214
Το περιεχόμενο παρέχεται από το Austin Parker, Ana Margarita Medina, and Adriana Villela. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Austin Parker, Ana Margarita Medina, and Adriana Villela ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.

About our guest:

Luiz is a Toronto-based senior software engineer at HashiCorp working with distributed systems on the Nomad workload orchestrator. Before that, Luiz was a full stack and DevOps engineer at IBM, leading a team that builds and manages a SaaS e-learning platform.

Find our guest on:

Find us on:

Show Links:

Transcript:

ADRIANA: Welcome to On-Call Me Maybe. I am your host, Adriana Villela, joined by...

TED: @tedsuo on the internet, Ted in real life.

ADRIANA: Awesome. Today we have...

LUIZ: Hi. I'm Luiz.

ADRIANA: So, Luiz, tell us a little bit about yourself.

LUIZ: Sure. My name is Luiz. I'm an engineer at the HashiCorp working in a project called Nomad, which is our orchestrator solution. Yeah, I've been working on the project for almost three years now. And before that, I was in that developer operations space, meaning that my team was not large enough to have an ops team, so all developers had to do a little bit of [chuckles] operations and everything. So yeah, that's how I got involved in this space. And then eventually, I got the opportunity to work at HashiCorp and its tools.

ADRIANA: Cool. That's awesome. And it's funny how you and I met because I think we met on Twitter [chuckles]; if I'm not mistaken, though, I think my post on HashiCorp my explorations of Nomad...because last year I was a total Nomad noob. I was at Tucows last year running a HashiCorp team, all things Hashicorp just about. So I was like, oh, shoot, I better learn how this stuff works. I guess that's how we met. Now we follow each other on Twitter, which is awesome. And I guess we also have...there's the additional HashiCorp connection because I think, Ted, you said you worked at HashiCorp at some point, right?

TED: Ah, I did not actually work at HashiCorp. But when I was interviewing, when I was looking for my last job, it came down to either HashiCorp or Lightstep. Both were really interesting to me. I like the idea of bootstrapping up the OpenTracing project, which is why I went with Lightstep. But I've always enjoyed HashiCorp's approach to engineering and product development.

And Nomad actually was the project I was most interested in because, in my last job, I was working on container scheduling at Pivotal on a project called Cloud Foundry. So I really enjoyed the domain space of scheduling and building that part of a distributed operating system. I thought it was really cool.

ADRIANA: Yeah, and I have to say, coming from a Kubernetes background and being thrust into Nomad, I was like, oh, man, this is so much easier to get started. [laughs] My mind was blown right away. It was awesome. It was awesome. I can understand why HashiCorp has such a huge fan base; like, people fan over this stuff big time. [laughs]

LUIZ: Yeah, it was interesting coming from...because when I joined, my first contact with HashiCorp was initially Vagrant. I think everyone goes through the Vagrant status and then Terraform. And I only learned about Nomad when I interviewed. So I didn't know about it before I was...I used to work at IBM, and my team was using Rancher at the time, Rancher 1.x, so like a Kubernetes migration.

Afterward, when Rancher 2.0 came out, everything was Kubernetes. So we were like, oh, we might as well use Kubernetes. And then IBM bought Red Hat, which meant that everything became OpenShift. So we had to migrate for a third time. So I kind of dabble in all the different tools out there. And at the time, we were using Terraform for basically everything in our infrastructures. And it kind of became the most fun I had in the day was just playing with Terraform to the point of, okay, maybe I should be working for them since that's where I have the most fun.

But during the interview process, it was kind of a generic position, so it was like a systems engineer generic, not a specific team or product. And then, during the interview process, the Nomad team liked my background and picked me. I was like, okay, let's learn about what Nomad is. The first time I just got started, like, Nomad agent-dev, and then you have an environment up and running, ready to use. I'm like, okay, that's very different than what I'm used to.

ADRIANA: Yeah, I know. That was kind of my, oh my God, it's the same binary that does everything? What?

TED: Luiz, would you mind just since the audience may also not be super familiar with Nomad, just briefly describe what it does and a basic architectural overview.

LUIZ: Yeah, sure. That makes sense. So Nomad is a workload orchestrator. So what that means is that it will grab any sort of tasks that you may have, that you want to do, and any type of infrastructure that you have, and it's going to distribute and schedule those tasks into the cloud. So it will kind of, in some sense, abstract away your cluster.

You have thousands of machines running. You don't actually care where things run. You just give it a specification, and Nomad will figure out the best place to run and to keep running. So if a machine dies, Nomad will reschedule things and make sure that your specification is always real and it's always as you defined. That's the role of the orchestrator.

I guess where Nomad is different is that Nomad is very focused on that task. If you come from a Kubernetes background, you may be aware that Kubernetes does that, plus a lot of other stuff. So it's more feature-complete but also more complex in that sense. There's more to learn about. There's more to understand. And there's a lot more going on when you're just like, I just need a container running; I don't care where or how. So Nomad is very focused on that one task, one job of scheduling things on to other things.

Another difference is that since it's focused on the scheduling part, it is generic in terms of the workload. So nowadays, the most common workload are containers, so you can run containers everywhere, but Nomad is not restricted to only that; you can run JAR files; you can run QEMU VMs. You can run Podman containers, whatever.

So Nomad has this flexibility in terms of what you want to run and what type of workloads as well, so you can run batch jobs or services that are always running or dispatch jobs. So there are all sorts of different use cases that you can do with Nomad. They are sort of built-in into the core of Nomad. So there's no need for external tools or extra coordination to support things like rolling upgrades, or blue/green deployments, or things like that.

ADRIANA: Cool. One of the reasons why we asked you to join us today is that, I guess, a few weeks ago, we were chatting, and you mentioned that you were looking at the possibility of instrumenting the Nomad code with OpenTelemetry, which totally piqued my interest. And as OpenTelemetry lovers, we're like, yay, this is great. So why don't you tell us a little bit more about that aspect that you've been exploring?

LUIZ: Sure, yeah. It has been an area of interest for a long time. As I mentioned earlier, I used to be this developer that does operations as well. And at the time, I didn't have a good sense of what it means to instrument an application, what it means to monitor things.

My manager did a very good job of getting us large screen TVs with great dashboards and all of that. But they were always, like, doesn't matter how many metrics we have, there's always a problem happening. And every time there was a problem, we didn't know what to do exactly. [chuckles] It was always like, we had the information that we had, but that was never enough to actually solve problems.

So now, looking back and then learning about OpenTelemetry and this idea of observability and what it means to have an observable system, it all resonates with me very well because that's the stuff that I wish I had before, and then that's the way things should have been done in the past to understand when an outage happens causing some problems, and things like that.

And so now, switching back from this developer operator perspective to this tool builder perspective, I was looking into ways to make past Luiz’s life easier, like, what the tools that I'm developing today could have done to make my life better in the past.

And I think observability is one of the major things that we can improve. Because, as I mentioned the description, there is this promise of, like, oh, I don't care how my container runs; I just want it to run. And that works 99% of the time, but that 1% when it doesn't work, that too is completely opaque to you. You can have logs and metrics, but that's not enough to really understand what's going on and what's wrong.

So that's where I started becoming interested in this space is just like; how can I make Nomad more transparent and more understandable to people that are using it? And to kind of give that internal view of like, okay, this action triggered these internal operations that generated these internal objects that eventually becomes your container.

And so this piece of observability, OpenTelemetry, all of that fits very well into the narrative, both in terms of when we talk about Nomad users, we usually talk about two different personas. So we have the developer persona, which is the group of people that are writing code. They're generating, let's say, a Docker image, and they want to run the Docker image somewhere. And then there's the operator persona, which are the people that are managing the infrastructure, starting the VMs, installing stuff. They're more like managing the infrastructure part of things.

When looking into making Nomad more observable, I started looking at these two different personas. What can you offer for all of them? So what the developer cares about, what an operator cares about. And so I started looking into what can I provide to each of them? And then it all comes back to telemetry and what kind of data is more relevant. And so, yeah, from the user perspective, I want to allow them to understand what's going on better, like, what's happening when they run some command or when they do some operation what's going on.

But also a little bit more selfish, I also wanted to make my life easier. So when people file bugs, or there's a support ticket, you're always in this situation where one side of the issue is able to collect data but don't necessarily know what data to collect. And then there's the other side of the issue which is us, which is we know what data we need, but we don't have the means to collect it. So there's always this back and forth between look at this metric, what does it say? Give me this log; what kind of information is there?

So my hope is that having this common language of telemetry and traces and spans and all of that will give us a more unified conversation and just make our lives easier in terms of supporting our users and also reducing their load as well. Like, if people understand what's going on, they may not need as much to explain what's happening in solving issues. There's the user side of things, and there's also sort of the selfish making our lives easier as well perspective.

ADRIANA: I love that, yeah, because there's nothing more frustrating like you get a user ticket, and they're like, blah, blah, blah, it's not working. You're like, oh my God, I don't even know where to start. It's like this terrifying moment.

TED: It's really funny. That's actually what you're describing, Luiz, is specifically what got me into distributed tracing, and OpenTracing, and OpenTelemetry, and all of that was having other people operate my software, which in this case happened to be literally the same kind of software, a scheduling system that they're running workloads on. And then they're saying there's a problem.

And we need data from them, and the data that we could get without distributed tracing was just kind of a nightmare to dig through because it's like all or nothing. I can't be like, just give me these logs, there's something. It's just like, well, give me a dump of everything off of all 200 machines that you're running, and I'll pore through it over here. And that was just really, really tedious.

LUIZ: Yeah, there are a lot of ugly, bad scripts, lots of jq happening to try to correlate all the different logs and all different metrics from different machines that are in different time zones. So it's just sort of this mess. And living in this space of building tools where you're not actually having a SaaS product or you have control over the environment that is running is really challenging because we know nothing about where they're running. We know nothing about their environment. We know nothing about how they're running.

So we need to keep probing and asking for more information. And then every question generates five other questions just because in this space where you're shipping tools; you may not necessarily have all the information that we need. We actually purposely don't collect any data from our users. So yeah, that's definitely challenging. And getting the right information that we need to help takes a lot of back and forth.

ADRIANA: Yeah, and it's challenging, too, because sometimes you get the bug report of, like, this isn't working. But tell me more, what specifically is not working? You have to keep digging and digging and digging. And the more information you have, the better if you have that information in the form of telemetry data, even more awesome because it makes your life easier.

But I also like the point that you made about instrumenting Nomad to cater to both the developer who's deploying their containers to Nomad and having to manage their workloads and the person who's actually operating a Nomad cluster. Because I've been in a position where my former team something would go down, [laughs] and it's like the mad dash. And time is ticking.

Because if something's gone down and it's part of critical infrastructure, [laughs] you really want to get to the bottom of that as quickly as possible and hopefully avoid executives breathing down your neck. And so if you have that kind of information, it just makes life so much easier in general.

LUIZ: Yeah, when you need the data the most is usually the most critical time because you're like, your system is down. You're having customers asking like, "What's going on?" So that's the moment that you need to be more precise but also the hardest moment to find data because usually, the system is not behaving as expected. So with only logs and metrics to guide you, you rely on your past self to have made good decisions about what to log and when which doesn't always work well.

And sometimes, we have to ship a new custom binary for a customer with only one line of code to log something more specific. And then it's like, deploy this binary, and let's see what happens. So these situations where you have no idea what's going on and we need to get more log lines is not a great place to be. Because it's extremely time-consuming to deploy new binary, wait for the problem to happen again, hope that that log line will give us some new information that will answer the question. So yeah, this traditional way of doing monitoring is pretty limited.

TED: So, Luiz, I'm curious. I feel like there are probably two basic scenarios that come out of having to observe Nomad and deal with the problem; one would be Nomad has some bug or is scheduling things in a way that isn't what's expected. The other would be that the workload itself has a problem. There's something wrong with the workload, or the way that workload has been configured in Nomad or the way it's been resourced might be incorrect. And so it's thrashing or having some problem just due to the nature of its environment. And I'm wondering how would someone start down that path of trying to understand which situation they're in?

LUIZ: Yeah, so definitely, there are two situations. And normally, you would start with the general health. So when you want to run something, you write what's called a job file. In that job file, you describe how your application is organized. So we have a hierarchy, so the job, the group, another task. And the task is like the low-level location. So if you have a container, your container is a task.

And then, Nomad creates these things called allocations, which is a set of tasks. So you can have two containers in the same allocation. So normally, when you have a problem on what you're running, so like your application has a problem, it's not starting, the allocation is going to be unhealthy. And Nomad will tell you, like, you have this concept of deployment, so your deployment is going to fail because your location is never going to become healthy.

In those situations, you want to look at the application logs and at the application events. When your allocation is starting, Nomad does a lot of operations beforehand. So it downloads files, it downloads the images. It will mount volumes. All of these are events that can happen in your task. And if you have something wrong on your job file, you probably are not even going to start the container. You're going to see Nomad failing to start the application. And then there's an event saying, "Oh, your image doesn't exist," for example. That's the first place to look at is the task events.

And then if the task actually starts and it runs, but it dies up very quickly, so that's where you look for the application logs, the task logs, and things like that to see why your container is starting, but it’s dying off very quickly. So that's usually a bug in the application, or you forgot to point to your database or something like that. So that's application-specific errors. So those are normally where you look for, and those are usually caused by application bugs.

For Nomad bugs, you usually see your deployments not making progress. So you want to run something, but it just keeps spinning. Nomad never actually creates those allocations, never actually starts these containers. Or you want to run something, and then Nomad just says, "Okay," and then it never does anything. So normally, when nothing happens, that's where there's something wrong with Nomad itself. And those are the situations where you need more observability to understand what's happening.

Right now, the best place to look at for logs both in the server, in the clients...so when you have a Nomad cluster, you have these two sorts of roles that your machines are answering. They're either a server or a client. The servers are the machines that hold the global state. The server knows all the details about what's running, where, and how many machines you have in your cluster and all of that metadata of your infrastructure.

And then the clients are the machines that are actually running workloads. They are starting containers. They're running binaries. So they are the things that run whatever you need to run. And so when something's wrong, you need to look at those two sides of things. So normally, the servers are going to pick where to run stuff. So that's where you go to if your application is never starting. Like, you need to look at the server logs to see, okay, did the servers receive the request? Were the servers able to pick a client to run? And these sorts of things.

Once the server picks a client to run, then the clients will talk to the servers to figure out what stage they're supposed to be. And that's where they will start, you know, they might see a request, and it goes, okay, I need to run a container with this image, with this much resource and whatever. And then they will start that specification.

So if your workload has been scheduled but not running, then you might want to look at the client logs to see why it didn't run. So it could be something like your Docker daemon is misconfigured, or you forgot to pass the Docker Hub credential or something like that. So that will be on the client side of the equation. So that's the client logs that you will look for to get more information.

ADRIANA: So you're basically doing a lot of log chasing.

LUIZ: Exactly, yeah. So there are all these different places that you need to look for depending on the symptoms that you're observing. So from the outside, you don't know what's happening. You just know what you're observing, like, my application is not running, and that can mean several things. So you need to drill down and go by instead of figuring out a problem, you need to figure out what's not the problem first.

So it’s like you're excluding options rather than finding the exact problem. It's very time-consuming. It's like, okay, is my job file correct? Is my client configured correctly? There's all of this checklist that I have to go through, which, again, is very time-consuming and hard to do.

TED: Yeah, that's the crazy life of operating these complicated systems. So, Luiz, since we've got another container scheduling nerd on the call, I'm curious, does Nomad schedule stateful workloads?

LUIZ: Yeah. So we also have the capability of doing stateful workloads. We have two types of volumes that you can use, so one of them is called host volumes, which is basically just a path on the machine that you can set on the client configuration. So that will be just a file that is going to be written to the disk on that specific machine. That's more for when you have a database. You want to have good performance. So you have a big disk on that specific machine. You create a host volume.

And by configuring the client that way, Nomad is aware of where the volume is located. So on your job, you don't actually have to specify run on that machine because it has a big disk. You just say I need this volume, and Nomad will make sure that your database only runs on that machine that has the big disk, the fast disk. So it's sort of metadata-driven scheduling.

The other option that we have is the CSI, the Container Storage Interface that is using Kubernetes, but it's a generic specification. So Nomad also implements the same spec "in theory," and I say that in air quotes. In theory, you could use the same plugins that you will use in a Kubernetes cluster to run and create volumes in Nomad as well. And it works the same way. So you have to run the CSI plugin.

And then we have two commands. So one is volume create, so you can create volumes on the fly. If you already have a volume, you can register them in Nomad, and then on your job, you just specify I need this volume. And then, you can specify things like the size of the disk, their capabilities, what they call topologies, like the location of the disk. All the things that CSI does Nomad will follow the spec.

And I put, like, in quotes, "any plugin will work" because some plugins are kind of Kubernetes-specific, like, their implementation. They talk to the Kubernetes API, so it's not 100%. If the plugin doesn't follow the spec to the T, it may not work with Nomad. But we try to implement the interface, and then the plugins are outside of our control. But we do have that capability as well.

TED: Cool. That was actually...creating CSI was the last thing I did in that space before I flew away to observability land. But I'm curious, so one thing that came up looking at stateful scheduling is one, I love the way you're doing it by having it be resource-driven where you just say I need these things, and the scheduler figures it out.

The more we can go towards plug-and-play like a dynamic linking driver model, the way desktop operating systems do it, the better. I think that's one of the things that's hard about scheduling containers these days is you end up with Turing complete YAML pre-processing just to resolve all of your dependencies. So I'm glad to see it going that way.

But one aspect of that that I always found intriguing was there are actually two layers of scheduling. So you have your containers scheduling system or whatever it is that Nomad is controlling. But if you are running containers, those containers then are running on top of virtual machines that themselves have to be scheduled and replaced for various reasons, like the operating system needs to be upgraded or something like that.

And if you're running stateful workloads, in particular, but, I mean, in general, just if you're going to roll those virtual machines, it seems like that would work better if there was some understanding between the two schedulers as to which machines to roll so that you don't end up taking down too many nodes in a consistent databases system, for example, or taking too many ephemeral apps offline at the same time. And I'm curious, have you guys dug into that problem yet over at Nomad?

LUIZ: We haven't done anything specific in that sense. We do have this notion of rolling upgrades. So when you update, we don't necessarily update everything at once. You can control how fast or how slow you want to go with the upgrades. And then you can also have what we call the canary deployments. So you create a copy of your application, and then you pass that, and then you approve or reject that upgrade.

For a stateful workload specific, we don't have any specific functionality around that. So it's up to you to understand what type of data is running where. That's where host volumes sort of become tricky because now you have to do that accounting of, like, okay, this machine has this type of data. But the sort of "simplified," quote, unquote simplified approach shows volumes. It allows you to set up your storage in any way that you like. So, for example, you could have an NFS-backed host volume. So even though for Nomad, it looks like a path in that machine, it's actually backed by a network storage or some sort of thing.

So it becomes simpler in that sense of as long as the volume is mounted in your clients, Nomad will be able to write to that place. So it gives you the flexibility, especially on-prem where you may not have things like EBS, these sort of dynamic provision volumes because it's much simpler to manage. Because then it's like, it's the way you have been doing so far, NFS, Gluster, Ceph, whatever storage you may need. And then for Nomad, you just give it the path where the volume is mounted, and Nomad does the proper thing. So those are the approaches we take.

And I think it's the same for networking. So when people talk about the complexity of Kubernetes, you always have to define complexity because it means different things to different people in different situations. To me, the complexity of Kubernetes comes just in terms of the amount of concepts that you need to learn to get started. You need to learn what a pod is. You need to learn what a deployment is. You need to learn what a volume claim is. And there are all these different sets of concepts that are interlinked and intertwined, and you need to work with them.

And the pod is probably the first one that people encounter, like, what's a pod? And then, you need to read all this documentation. And the pod has a virtual IP, like, it sort of cascades. But in Nomad, we try to take a simpler approach, like, your machine has ports. You ask for a port, we'll give you a port on that machine IP, and that's it. We don't try to create complexity if we don't have to; even though it might simplify some stuff, it will create this mental load and more concepts. So we try to avoid that.

And that's where the host volume comes in. It's like, Nomad doesn't care what you're doing. As long as the path that you give Nomad has data in some shape or form, that's what Nomad will use. So that's the general approach that we take. But also, we understand that CSI has a lot of benefits as well, especially in the cloud. And again, in the networking space, CNI has a lot of benefits as well. So we do support those, or we try to focus on the simplicity-first approach to things.

ADRIANA: And I think that's what makes it such a nice learning curve for Nomad in general. Like for me, because I was already familiar with all these complex Kubernetes concepts, it was so much easier to translate it in my head to Nomad land but in some ways also because it's got fewer moving parts simplified in my brain, right?

TED: Yeah, I'm curious if that...so one thing I've always felt about these systems, especially Kubernetes but any of these containers systems, is they're recreating the control plane that's existing at a level beneath them. So you have all of the networking that you're setting up with your cloud provider, for example, and then you're deploying Kubernetes on top of that network, and then you're setting up networking at the Kubernetes layer. And so I'm curious, in Nomad, it’s part of the simplicity just that you're leveraging those lower-level building blocks. Is that where it comes from?

LUIZ: Yeah, we try to do as little work as possible. In the networking space, you can go set up whatever CNI plugin you want. And then that's on you to make sure that your setup is correct, but you don't have to if you don't want. If you only have machines with IPs, we'll give you a port, and then you can use that port to communicate with your application.

And then focus back on, like, our goal is to orchestrate workloads; it's not to create a service mesh. It is not to create these complex multi-concept systems. We just want to run stuff on machines, and then how you configure your underlying storage or underlying networking that's more or less up to you. We only care about give us a set of machines that can talk to each other, and things will more or less just work because we use these underlying concepts that are already there.

So we try to avoid adding complexity if we can, if there are already tools that do the job. So, for example, when you run a binary in Nomad on a Linux system, we isolate the binary using all the tools that the Linux kernel provides, like chroots and cgroups and network namespaces. So we try to leverage as much of what's already there instead of trying to create these new concepts on top of things.

TED: Awesome.

ADRIANA: I was curious, you know, you explained your motivation for instrumenting Nomad to get better insight into what was going on for you as a developer, for the operators of Nomad, and for the developers deploying stuff to nomad. Now, how did you get into OpenTelemetry? How did you become aware of it?

LUIZ: Good question. I don't exactly remember. I think it started when I was looking for better ways to understand, to provide better tools to our users. I landed into this space of observability and of the players in that market. And then it's just sort of a nascent ecosystem that is also pretty vibrant, just like all the different vendors. There are all these different people involved from different companies. And think I watched the demo. I want to say it was at KubeCon, but I don't remember.

But that was the first time when I saw this idea of traces and spans and this notion of thinking more like events rather than metrics and logs and these traditional ways of doing things. And that became very clear because that's kind of how I think when I'm developing Nomad. And it's like, okay, I'm thinking about the different components inside Nomad and what type of message it exchanges and what type of events happen inside Nomad.

And so that, to me, maps very well to the concept the OpenTelemetry has. So that's when I started looking into how to bring those capabilities to become native in Nomad. And last year, we had a team hackathon. And so that was the project that I picked. Like, how do I make the core of Nomad to generate these spans, these traces, these events so that as long as you provide an OpenTelemetry endpoint, Nomad will generate this data for you automatically?

So yeah, so that's sort of where I learned about this and where I got involved in this work. Unfortunately, since then, I haven't done a lot of progress. It's been my side project, and it's my own personal goal instrumenting Nomad. But it's been slow but pretty fun to do it.

ADRIANA: Well, I hope folks listening in on this will be inspired by the work that you're doing, and maybe it'll get people clamoring for having that little extra bit of observability into Nomad. So for those listening, if you want this, [laughs] make sure you let us know. [laughs]

LUIZ: Yeah, and I am looking forward to see what kind of results we get out of it. Because being such a new tool, such new space, I think there's a lot of answers, like, a lot of things to be figured out. I think there's a lot of perceived value, but it's not always as realized. So I think we need to get it out there and see what happens. So, yeah, that's what I'm trying to do, just get something out and then see how people use it, what kind of value it provides and if people like it or not.

ADRIANA: The other thing, too, is that you touch upon a use case that's probably more common than what we would, I guess, advertise for, which is we want people to instrument as they code. But what if you don't have instrumentation to begin with? You're in the situation where you've got this product. It needs to be instrumented. So you're going through the process of instrumenting key areas, which I think is a realistic scenario that we’ll encounter in the outside world, right?

LUIZ: Yeah. And as I go, I sort of learn about Nomad myself. Even though I've been working on this product for years, there are areas of code that I've never touched before that I don't know exactly how they work. So going back to that selfish goal, it also helps me to understand my own tool better. It helps me, like, the way I implement guides me to understand what's going on as well.

ADRIANA: That's awesome. So you're going to be like the Nomad pro as a result of this little exercise. [laughter]

LUIZ: I hope someday I'll get there. You're working on some code, and then you see a new file that you never touched before, and then you do a git-blame, and that's like Armon committing stuff six years ago, and it's like, no one touched the code since the original commit.

ADRIANA: [laughs]

TED: Awesome. Well, it's been great talking to you, Luiz. Hopefully, we can have a follow-up conversation sometime once this is all out there and hear a report on how it went.

LUIZ: Nice, yeah.

ADRIANA: Yeah, that will be awesome. Well, thanks for joining us. And signing off from On-Call Me Maybe, I'm Adriana Villela.

TED: I'm Ted Young.

ADRIANA: And keep on keeping on.

TED: Aloha.

  continue reading

31 επεισόδια

Όλα τα επεισόδια

×
 
Loading …

Καλώς ήλθατε στο Player FM!

Το FM Player σαρώνει τον ιστό για podcasts υψηλής ποιότητας για να απολαύσετε αυτή τη στιγμή. Είναι η καλύτερη εφαρμογή podcast και λειτουργεί σε Android, iPhone και στον ιστό. Εγγραφή για συγχρονισμό συνδρομών σε όλες τις συσκευές.

 

Οδηγός γρήγορης αναφοράς