Research-and-Learning

CHAOSS DEI Review: Midyear reflection

Tue, 25 Oct 2022 00:00:00 +0000

Since February 2021, the CHAOSS Project is conducting a funded, long-term review of its governance, practices, and processes in a diversity, equity, and inclusion (D.E.I.) “audit.” I originally joined as an internal community liaison and initially helped to identify a team of D.E.I. practitioners external to the CHAOSS Project to support this work. Thanks to the support of the Ford Foundation, we are slowly approaching the two-year anniversary of when this work began.

My brief readout is a guided reflection using questions shared by Matt Germonprez. This reflects my review of our work as a team to date and also shares some of my hopeful outlooks for what our amazing team can accomplish together. This readout will cover (1) our accomplishments as a team, (2) what was expected and surprising, and (3) what we could change in the next year.

CHAOSS accomplishments & learnings 🔗

Three achievements and aspirations stand out over the past year:

Established process management and a team workflow.
Created a small but active Community of Practice (CoP).
Sharing our results with CHAOSS and the Open ecosystem.

Processes & workflow 🔗

We had to forge our own practices that worked best for our group. Photo by Jonny Gios (https://unsplash.com/@supergios?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/forge?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

For direct participants of the team, the Ford Foundation funding did not come with strict requirements or success metrics. As we assembled our team, we were given the discretion of how to conduct a D.E.I. review for the project and determine the best course of doing that. This allowed for creative freedom to figure out what would work best for CHAOSS. Additionally, I could not identify a straightforward way to discover other Open communities and projects doing our kind of work. Since there were also not many other known successful models to follow, we combined our shared experiences across multiple Open communities to build our team, identify main areas of focus, and engage the community around our efforts.

This is an achievement because we collectively created an active group that makes incremental, positive changes to CHAOSS. This is a model we could share with other projects so that others can learn from our experiences.

Community of Practice 🔗

Our team is a small but engaged group of D.E.I. practitioners. We share a connection through our ongoing review of the CHAOSS Project, but we also give and take from our own personal experiences outside of CHAOSS. Our group regularly meets and discusses complex, difficult issues that are both (a) not easy to discuss openly and (b) applicable to many communities beyond only CHAOSS. Our team meetings are a safe space that promotes honest and constructive discussion centered on diversity, equity, and inclusion. In addition to our recommendations and direct efforts with CHAOSS, I often reflect on our conversations as a team when working with other Open communities. An example of this is how we built a list of questions to get a “pulse” from the community on their feelings about CHAOSS.

This is aspirational and not yet fully realized. Our team has collected a solid portfolio of stories and experiences that other communities would stand to benefit learning from. I consider this a current achievement because while our work does specifically look at CHAOSS, we also often reflect from a general perspective and how a topic of interest might look in other communities. When the time comes to package our findings, I believe we are setting ourselves up for easier messaging and outreach opportunities in the future.

According to expectations 🔗

While I have worked in Open Source D.E.I. communities since 2015, I have never conducted an applied research review for community D.E.I. before. I did not come into this with strong immediate expectations because it would inevitably reflect the backgrounds and strengths of the team we would assemble. However, I did have specific hopes or things I hoped would be realized by this work.

As expected 🔗

Data-driven approach: We began this work without a strong representation of the state of CHAOSS. What do contributors think about the project? While data is not a universal panacea, we gravitated to a community survey early on because we needed to understand the community experience better first before making serious suggestions.
Time zones are hard: Our team was spread out across North America, Africa, LATAM, and Europe. Additionally, the work with CHAOSS was also a part-time venture for most of us, in addition to primary employment. Calendars and schedules are hard to get right. Since our team’s organization was ad-hoc, momentum would occasionally slow for some periods.
We have an amazing team! I expected great things once we identified our roster. We have also had more amazing people join us over time and add new passion and insight to our focus as a group.

Surprises 🔗

Documenting our impact is not always intuitive: While we have done internal storytelling work within the CHAOSS Project, we do not have a good record of our achievements to date. Our linear progression does not lend itself easily to self-reflection and recalibration. Although much of our focus is on the CHAOSS community survey and CHAOSS Africa, we also facilitated several other notable achievements in the project in the last year. See the following examples:
- Supporting the establishment of a Code of Conduct Committee.
- Community office hours for newcomers.
- Improved, peer-to-peer onboarding experience in CHAOSS.
- Increased efforts in CHAOSS mentored projects (e.g. Outreachy and GSoC).
- Recommending changes to the project and community, like broader localization to Chinese & Spanish and establishing a D.E.I. council.
Losing and regaining steam on the survey: Although the community pulse survey was one of the earliest tasks identified in our work, launching a first survey proved to take a lot of resources from the team. We briefly stalled out on the survey effort while focused on other areas (like listed above). While our team was able to achieve many smaller victories for CHAOSS with low-hanging fruits, it took a sustained focus and slowdown on new topics to achieve larger contributions like the community pulse survey.

Changes for the CHAOSS team next year 🔗

Looking ahead to 2023, I hope to strengthen our efforts as a team in these areas:

Packaging our work
Dissemination of our work

Photo by Christophe Rollando (https://unsplash.com/@chrisrolls?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/2023?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

Packaging 🔗

Our work stream was linearly ordered and we took a forward-looking approach. Now is a good time to look back and reflect on our results to date. What are our key findings and observations? What suggestions will we make to CHAOSS? How could other communities learn from our experience running this review? One task for us as a team is to identify key messages and themes so that dissemination into broader domains is possible.

Dissemination 🔗

Once we package our work, notes, and reflections, we should take an active approach to disseminating and sharing our work. This includes both the CHAOSS Project and a more general audience. For the CHAOSS Project, this could be a written report, presentations to the CHAOSS board, speaking at CHAOSScon, and outreach to the multiple Working Groups. For a general audience, this could include speaking at industry conferences, sharing our work with other Communities of Practice, social media, or other ways of promoting our deliverables.

How Mozilla Open Source Archetypes influence UNICEF Open Source Mentorship

Tue, 10 Nov 2020 00:00:00 +0000

In May 2018, Mozilla and Open Tech Strategies released a 40-page report titled, “Open Source Archetypes”. This blog post is a recap of how this report influences the Open Source Mentorship programme I lead at the UNICEF Innovation Fund.

I joined the UNICEF Innovation team in June 2020, although this is not the first time I have worked with UNICEF Innovation. I have had some opportunity to write about Open Source, but my personal blog has been quiet! So, this felt like the right opportunity to talk about what I am up to these days.

The Open Source Archetypes report (below) provides nine archetypes common among Open Source projects and communities. These archetypes provide a common language and perspective to think about how to capture the most value of Open Source in various contexts.

Open Source Archetypes (May 2018)Download

This article covers the following topics:

How Open Source Archetypes align with my experience
How I use Open Source Archetypes at UNICEF
Unanswered questions

How Open Source Archetypes align with my experience 🔗

The Open Source Archetypes report is useful to me because it aligns with my own experiences and encounters with common Free and Open Source Software projects. An advantage of taking my alma mater’s Free and Open Source Software and Free Culture Minor is experiencing what real Open Source projects are like long before I entered the industry. The projects and organizations I contributed to and interacted with all ran their projects in one of the nine models identified in the report.

The Open Source Archetypes report speaks to my personal experience either using or contributing to projects like Fedora, Kubernetes, SpigotMC, MusicBrainz, and various independent projects. The value of Open Source for any project is in meeting the goals of the intended audience. By itself, “Open Source” is a broad term, even if it does have a legal definition. My experiences taught me the importance of how different Open Source projects meet the needs of different audiences, or even different combinations and balances of audiences. The Open Source Archetypes report creates language for something I previously only understood through direct experience.

When I first read the report earlier in 2020, I knew it was relevant to my work. But how could I begin to integrate it into the Open Source Mentorship programme I manage for the UNICEF Innovation Fund?

How I use Open Source Archetypes at UNICEF 🔗

The UNICEF Innovation Fund provides early stage funding and support to frontier technology solutions that benefit children and the world. Most teams in the Innovation Fund are from countries where UNICEF has an ongoing country programme.

A requirement for solutions we fund is that they must be Open Source. I have seen many different types of projects and business models since I started working as a part-time consultant for UNICEF in 2018. As exciting as this is, it was challenging to understand the best way of supporting each team and their Open Source projects. Each team and project had differences unrelated to their source code, but closely tied to their business models and impact they wanted to have through their work.

So, the Open Source Archetypes report have me language. It gave me examples and explanations of how Open Source can work to teams who had little to no prior experience of Working Open. I take the unique context and details I understand about each team I work with, and contextualize what they are doing compared to the different models in the report.

The feedback I received so far on the report with the 15+ teams I currently work with is mostly positive. Some teams exclaimed this report was what they wish could have read months before because it resolved many of their doubts. Others were more overwhelmed, and needed extra time to read and review.

For my role as a mentor, the Open Source Archetypes report gives me cues for how to best support and direct each team I work with. The task of building an Open Source community or participating in an existing one is not a small task. Whether it is documentation, project management, quality assurance and testing, or community engagement, I have yet to see any small team accomplish all of these things at once. So, identifying which archetype a team best identifies with gives me a cue to guide the teams on their path forward. It gives me context for how to make Open Source something that works for them instead of against them.

Unanswered questions 🔗

I have great appreciation and gratitude for the folks at Mozilla and Open Tech Strategies who compiled this report. But it was written over two years ago, and like all things in life, things can change. So, while I look comfortably from the position of hindsight, there are some critiques and missing components to the Open Source Archetypes reports.

My unanswered questions are below.

Does the Linux kernel (and subsequently, Linux distributions) represent another unwritten archetype? 🔗

The report explicitly avoided using the Linux kernel as the basis for any archetype:

In some ways the Linux kernel project could be considered “Wide Open”. However, both technically and culturally, Linux kernel development is sui generis and we have deliberately avoided using it as the basis for any archetype.

Open Source Archetypes, Page 17

Contextualizing a project like Linux is hard. There is a lot of history to a project that first launched over email in 1991. There are many “yes, but"s about decisions made 10 or even 25 years ago that would not replay the same way in 2020.

Yet this is important work. Linux represents not just the kernel, but also large, decentralized sub-units of other systems that integrate the kernel in order to make it useful (e.g. Ubuntu, Fedora, Debian, Arch Linux, you name it). These sub-communities include large entities and corporations, spanning multiple countries and organizations of various sizes.

The Linux kernel communities are worthy of a deeper look, possibly in order to define a new archetype.

How can Open Source Archetypes better fit the social/humanitarian sector? 🔗

The archetypes shared in the report largely focus on business sustainability. In other words, the report is biased towards Mozilla’s interest in funding the research in order to better understand how to support a commercially-successful Open Source project. To me, there seems like a gap in models that often work for Open Source projects perhaps like U-Report and Ushahidi.

This is an area of interest to me, and likely others in the UN and NGO space. The report could do more to address these kinds of projects.

How would you teach Open Source? 🔗

To conclude, the Open Source Archetypes report is an invaluable tool that provides me language and context for teaching others about Free and Open Source Software.

How would you teach Open Source? What models, research, or tools would you use to inform an Open Source mentorship or education programme? Share your thoughts below in the comments!

TeleIRC v2.0.0: March 2020 progress update

Thu, 19 Mar 2020 00:00:00 +0000

Since September 2019, the RITlug TeleIRC team is hard at work on the v2.0.0 release of TeleIRC. This blog post is a short update on what is coming in TeleIRC v2.0.0, our progress so far, and when to expect the next major release.

What’s coming in TeleIRC v2.0.0? 🔗

TeleIRC v2.0.0 is a complete rewrite of TeleIRC. The team is migrating the code base from NodeJS to Go. In September 2019, the team began scoping the requirements and how to approach this large task. TeleIRC v2.0.0 does not add new features, but aims to have feature parity with the v1.x.x version of TeleIRC.

You might be asking, why bother with a total rewrite? What does this actually accomplish for the project? To answer this question, some historical context is needed!

TeleIRC v1.0.0 was an experiment. 🔗

TeleIRC v1.0.0 was originally created and released in September 2016 by RIT alum Mark Repka. Mark created TeleIRC as a cool project for the RIT Linux Users Group (RITlug) when he was a student and vice president of RITlug. The project was written in hackathon spirit: to prove that something that was not yet common wasn’t that hard to do.

Fast forward to today: TeleIRC ends up being pretty popular! As do chat bridges (Matterbridge, Matrix/Riot, etc.) as a whole. The Fedora Project is one of our largest users, with a dedicated Special Interest Group to manage the bots. The LibreOffice community is another one of our biggest users. Several international communities also adopted TeleIRC to make their chat rooms more accessible to a new generation of open source fans. Some example users are Linux and BSD user groups and hackerspaces in Argentina, Albania, and across Asia. You can see the full list of TeleIRC users for yourself.

TeleIRC has grown in a way we never thought it would. Which is awesome! But the project was not originally designed to grow or scale the way it has. Additionally, by being at a university, contributors come and go as students graduate and move on to industry. We also have to think about how to maintain TeleIRC beyond the typical student life-cycle common in the academic world.

Let’s approach TeleIRC v2.0.0 as engineers. 🔗

A full rewrite allows us to fully leverage our knowledge as software engineers. In 2020, we know TeleIRC has a large user community and is an important part of how many open source communities communicate. We also know that breaking code into smaller, more modular pieces makes it easier to maintain and bring in new contributors. A full rewrite allows us to apply the lessons the team has learned over the years, in a way that incremental feature releases does not allow.

A few areas are in clear focus for the TeleIRC v2.0.0 rewrite:

Write clean, simple code that is easy to understand
Test the code so it is easy to tell when things are working and when they aren’t
Think about how to bring in new contributors to continue the project in the future

But maybe you are also asking, why the jump to Go?

A Go rewrite distinguishes our project. 🔗

When Mark and I launched the project in 2016, we didn’t look around to see if anything else like RITlug’s TeleIRC already existed. Turns out, there was another NodeJS project with the same name. Skip forward a few years, and there are also projects like Matterbridge, pytgbridge, and other implementations. So, with all this commotion out there these days, why bother with our version of yet another chat bridge?

First, there is one design principle guiding our project from others like us: to do one thing and to do it well. Matterbridge is an excellent tool, and we even use it in conjunction with TeleIRC at our university. However, it is a complex tool with many features and options. For some people, this is a non-issue. But the TeleIRC team likes to think there is beauty in simplicity. Instead of offering a tool with the most features and configuration options, we aspire to do a single thing and to do it really well: connect Telegram groups and IRC channels together.

Second, although the FruitieX/teleirc project is archived today, it was once the biggest alternative to our project, also written in NodeJS. When we decided to launch TeleIRC v2.0.0 development, it had a larger community and user base then ours. So instead of offering a “similar but different” NodeJS project, we would be the first Telegram-IRC bridge written in Go. (Yes, Matterbridge is also written in Go, but see the above paragraph.)

Third… many of the existing maintainers of TeleIRC simply wanted an excuse to learn Go. It is an opportunity to expand our knowledge, experience, and skills, especially since we are students preparing to enter the industry.

Go has a better story for Kubernetes / OpenShift. 🔗

Finally, we are carefully considering the needs of one of our biggest downstream users: the Fedora Project. Several TeleIRC developers also support Fedora’s TeleIRC SIG. Recently, the Fedora Infrastructure team launched an OpenShift instance for the Fedora community, called Communishift. All existing infrastructure in Fedora is gradually moving from virtual machines or OpenStack to OpenShift. To support this migration, we want to make a Go-based TeleIRC as easy to deploy in OpenShift as possible.

And fortunately, Go has a great story in the container orchestration world. Kubernetes and OpenShift are also Go-based projects. Go is the dominant language of this ecosystem. Its excellent performance in the niche of networking makes it a great choice for what TeleIRC does.

Now that you know more about the “why is this happening,” let’s talk on where things are and what you can expect!

TeleIRC v2.0.0: Progress so far 🔗

TeleIRC v2.0.0 is approximately 76% complete. All progress is tracked in the v2.0.0 milestone on GitHub. 46 issues and pull requests were closed since we began in September 2019. At publishing time, about 16 more issues and pull requests are left before we cut the v2.0.0 release.

Earlier in 2019, the maintainer team consisted of Justin Wheeler, Tim Zabel, Seth Hendrick, Nate Levesque, Nic Hartley, and Robby O’Connor. Now joining the committer group, we are happy to welcome Nicholas Jones, Kevin Assogba, and Kennedy Kong to the team. The current core group of maintainers for v2.0.0 are Justin, Tim, Nicholas, Kevin, and Kennedy.

When to expect TeleIRC v2.0.0 🔗

TeleIRC v2.0.0 is targeted for a release date of Friday, May 15th, 2020. At this point, we expect to have full feature parity with the v1.x.x version. We will recommend all existing users to upgrade to the latest release then.

In the meanwhile, the team is getting ready to cut a v2.0.0-pre1 release, our first “pre-release” of the Go port. We expect this release to be available on our Releases by Saturday, March 28th. Along with the v2.0.0-pre1 release, there are a few other details to note:

TeleIRC v1.5.0, the final version of the NodeJS version, will be released.
No future contributions will be accepted to the NodeJS version.
master branch in git will reflect the latest Go version of TeleIRC.

Once the v2.0.0-pre1 release is available, we want help to take it for a test drive! If TeleIRC is critical for you, we do not recommend using it yet, as it does not have full feature parity yet. But your early feedback can help improve the future of the next release while we are in active development.

Get involved with TeleIRC! 🔗

You can be a part of the upcoming TeleIRC v2.0.0 release. We’d love your help! There is no formal commitment to contributing, although we ask for participation through a single sprint cycle.

Read our Contributing guidelines on how to get started with TeleIRC. Virtual developer meetings take place every Saturday at 15:00 US EDT, so anyone can join and participate.

Come say hello in our developer chat rooms, either on IRC or in Telegram!

Background photo by Daria Nepriakhina on Unsplash.

HPC workloads in containers: Comparison of container run-times

Tue, 20 Aug 2019 00:00:00 +0000

Recently, I worked on an interesting project to evaluate different container run-times for high-performance computing (HPC) clusters. HPC clusters are what we once knew as supercomputers. Today, instead of giant mainframes, they are hundreds, thousands, or tens of thousands of massively parallel systems. Since performance is critical, virtualization with tools like virtual machines or Docker containers was not realistic. The overhead was too much compared to bare metal.

However, the times are a-changing! Containers are entering as real players in the HPC space. Previously, containers were brushed off as incompatible with most HPC workflows. Now, several open source projects are emerging with unique approaches to enabling containers for HPC workloads. This blog post evaluates four container run-times in an HPC context, as they stand in July 2019:

Charliecloud
Shifter
Singularity
Podman

Research requirements 🔗

My research focused around a specific set of requirements. To receive a favorable review, a container run-time needed to meet three basic requirements:

Support CentOS/RHEL 7.5+
Compatibility with Univa GridEngine
Support for very large numbers of users

Obviously there are security concerns with the third requirement. This is one reason containers have not made a strong showing in the HPC world yet. With the Docker security model, root access is a requirement to build and run containers. In a production HPC environment where users do not trust other users, this is a hard blocker.

Other HPC environments may differ. If you are an HPC administrator and also considering containers in your environment, consider my requirements. My research was exclusively framed through these three requirements.

Charliecloud 🔗

Charliecloud is an open source project based on a user-defined software stack (UDSS). Like most container implementations, it uses Linux user namespaces to run unprivileged containers. It is designed to be as minimal and lightweight as possible, to the point of not adding features that could conflict with any specific use cases. This can be a positive or a negative, depending on how complex your environment is.

However, I abandoned my research on Charliecloud early on after reading this PLOS research paper:

The software makes use of kernel namespaces that are not deemed stable by multiple prominent distributions of Linux (e.g. no versions of Red Hat Enterprise Linux or compatibles support it), and may not be included in these distributions for the foreseeable future.

The software is emphasized for its simplicity and being less than 500 lines of code, and this is an indication of having a lack of user-driven features. The containers are not truly portable because they must be extracted from Docker and configured by an external C executable before running, and even after this step, all file ownership and permissions are dependent on the user running the workflow.

Singularity: Scientific containers for mobility of compute, May 2017 (Gregory M. Kurtzer, Vanessa Sochat, Michael W. Bauer)

However, it is worth noting this paper was written in support of Singularity. It was also written by the Singularity project lead and others from the Singularity open source community. If you are conducting your own independent research, consider looking closer at Charliecloud, since at the time of writing it is still actively developed. The research paper was written in May 2017.

Edit: This situation already changed and Charliecloud is probably worth a deeper look:

This is a fantastic write up, one thing to mention is that abandoning the CharlieCloud research based solely on lack of support of the user kernel namespace is no longer a blocker. For example, PodMan now uses the same technology and it was released in RHEL8.
— Apptainer (formerly Singularity) (@SingularityApp) August 20, 2019

Shifter 🔗

Shifter is another container run-time implementation focused on HPC users. At time of writing, it is almost exclusively backed by the National Energy Research Scientific Computing Center and Cray. Most documented use cases use Slurm for cluster management / job scheduling. Instead of a Docker/OCI format, it uses its own Shifter-specific format, but this is reverse-compatible with Docker container images. It requires hosting a registry service and a Shifter Image Gateway.

The Shifter Image Gateway is a REST interface implemented with Python Flask. It pulls images from the registry service and converts them to the Shifter image format. MPI integration is supported but its implementation is MPICH-centric.

The downside to Shifter is lack of community. There are not many other organizations other than NERSC and Cray that appear to support Shifter. Documentation exists, but at writing time (July 2019), the last significant contribution was April 2018. Some bugs and feature requests are triaged, but there is not much of a maintainer presence in these issues. Most follow-up discussion to new issues are from a handful of outside contributors without commit access.

Additionally, there are several signs of stagnant development, such as NERSC/shifter#172 to add better MPI integration. However, the PR stalled out since it was first opened in April 2017. Furthermore, there is a high bus factor: most contributions and pull requests come from the same two developers, indicating low engagement from the wider HPC community. Code is regularly tested, but integration tests only exist for Slurm. For more details, check out the GitHub project pulse.

A detail worth noting is Shifter was one of the first real container run-times for HPC. A former Shifter collaborator branched off from Shifter to start Singularity (and eventually, a for-profit company to support it, Sylabs). It invites room for personal biases when evaluating Shifter and Singularity, specifically if you are not a newcomer in the HPC community.

Singularity 🔗

Singularity is the third and last HPC-specific player in the container run-time world. The vendor is Sylabs Inc. There are a few different factors that make Singularity interesting, and in my opinion, the most promising HPC container implementation.

General overview 🔗

Singularity v3.x.x is written almost entirely in Golang. It supports two image formats: Docker/OCI and Singularity’s native Single Image Format (SIF). As of September 2018, there are an estimated 25,000+ systems running Singularity, including users like TACC, San Diego Supercomputer Center, and Oak Ridge National Laboratory. Additionally, Univa announced a partnership with Sylabs in July 2018 to bring Singularity workflows to Univa GridEngine.

Sylabs offers Singularity (free and open source) and SingularityPRO (paid and proprietary). The commercial version comes with a support contract and long-term support for some releases (among other things).

Admin/root access is not required to run Singularity containers and it requires no additional configuration to do this out of the box. Containers are run under the Linux user ID that launches them (see Security and privilege escalation).

At a quick glance, Sylabs developers appear to be actively engaged in the Kubernetes development community, particularly around Red Hat technology. They also seem to keep their promises: in early 2018, blog posts show ambitious feature promises for the then-upcoming v3.0.0 release at the end of the year. Near the end of 2018, the release was delivered on-time with most/all of the promised functionality.

Image formats 🔗

The Singularity Image Format (SIF) is a single-image format (i.e. no layers involved). This was a design decision specifically for HPC workloads. SIFs are treated like a binary executable by a Linux user. Additionally, it is possible to create SIFs using the Definition File spec.

However, Singularity is also compatible with Docker/OCI images and OCI is given active development focus by upstream Singularity. Docker/OCI images are converted on-the-fly to a SIF. Docker/OCI images can be used locally or pulled from a remote registry like Docker Hub or Quay. To the user, if using a Docker/OCI image, the conversion is seamless and does not require additional configuration to use.

See this Sylabs blog post for a deeper dive on how SIFs were designed.

Flexible configuration 🔗

Singularity (uniquely?) offers advanced configuration options for HPC administrators. Some highlights are detailed here:

Controlling bind mounts:
- mount dev = minimal: Only binds null, zero, random, urandom, and shm into container
- mount home = {yes,no}, mount tmp = {yes,no}: Choose to enable or disable these bind mounts globally
- bind path = "": Bind specific paths into containers by default
- user bind control = {yes,no}: Allow users to include their own bind mount paths or limit it to an admin-approved set of paths (above)
Controlling containers:
- limit container paths =: Possible to limit SIFs provided at a specific path and nowhere else

HPC community engagement 🔗

These notes only apply to Singularity free, not the proprietary SingularityPRO product.

The signals from their open source community engagement are positive and strong. They appear authentic and genuine to an open source commitment (i.e. not open-core business model). This is demonstrated in a few ways:

First, they have thorough user documentation, intended for end-users in HPC environments using Singularity. They have a less thorough but still useful admin documentation.

Second, all issues are triaged quickly and get feedback from core developers or outside contributors at a consistent pace. Pull requests don’t stagnate either: the oldest PR is less than six months old.

Third, code is regularly tested (1, 2). The code generally follows best practices (i.e. it is not atrocious to work with).

Fourth, there are also a handful of active contributors (both developers and in the community support channels) who come from outside of Sylabs, which indicates more engagement by a wider audience of people.

For more statistics, check out the GitHub project pulse.

Podman 🔗

tl;dr: Podman is an underdog that shows promise, but likely needs another one or two years of time for most HPC use cases.

Podman is a container run-time developed by Red Hat. Its primary goal is to be a drop-in replacement for Docker. While it is not explicitly designed with HPC use cases in mind, it intends to be a lightweight “wrapper” to run containers without the overhead of the full Docker daemon. Furthermore, the Podman development team is recently looking into better support for HPC use cases.

Podman is currently lacking for a HPC use case for some of these reasons:

Missing support for parallel filesystems (e.g. IBM Spectrum Scale)
Rootless Podman was designed to use kernel user namespaces which is not compatible with most parallel filesystems (might change in a year or two)
Not yet possible to set system site policy defaults
Pulling Docker/OCI images requires multiple subuids/subgids (might change in a year or two)

Where Podman does shine is providing a way to run and build containers without root access or setuid.

The same challenges and problems required for Podman to run OCI containers in an HPC environment are the same problems faced by Singularity to build SIF images without root in the HPC environment: mapping UIDs to subuids/subgids on the compute nodes. More interestingly, Buildah offers a promising way to enable users to build container images as Docker/OCI images all without root. It is plausible to use Buildah as the container image delivery mechanism and swap out the container run-time implementation (Podman vs. Singularity) depending on specific needs and requirements.

What do you think? 🔗

I hope other folks out there in the HPC world find this preliminary research useful. Do you agree or disagree with any parts of this write-up? Is something out-of-date? Drop a comment down below.

Introducing InfluxDB: Time-series database stack

Tue, 15 Aug 2017 00:00:00 +0000

Article originally published on Opensource.com.

The needs and demands of infrastructure environments changes every year. With time, systems become more complex and involved. But when infrastructure grows and becomes more complex, it’s meaningless if we don’t understand it and what’s happening in our environment. This is why monitoring tools and software are often used in these environments, so operators and administrators see problems and fix them in real-time. But what if we want to predict problems before they happen? Collecting metrics and data about our environment give us a window into how our infrastructure is performing and lets us make predictions based on data. When we know and understand what’s happening, we can prevent problems before they happen.

But how do we collect and store this data? For example, if we want to collect data on the CPU usage of 100 machines every ten seconds, we’re generating a lot of data. On top of that, what if each machine is running fifteen containers? What if you want to generate data about each of those individual containers too? What about by the process? This is where time-series data becomes helpful. Time-series databases store time-series data. But what does that mean? We’ll explain all of this and more and introduce you to InfluxDB, an open source time-series database. By the end of this article, you will understand…

What time-series data / databases are
Quick introduction to InfluxDB and the TICK stack
How to install InfluxDB and other tools

Introducing time-series concepts 🔗

Example of table, or how a RDBMS like MySQL stores data. Image from DevShed (http://www.devshed.com/c/a/php/using-the-active-record-pattern-with-php-and-mysql/).

If you’re familiar with relational database management software (RDBMS), like MySQL, tables, columns, and primary keys are familiar terms. Everything is like a spreadsheet, with columns and rows. Some data might be unique, other parts might be the same as other rows. RBDMS’s like MySQL are widely used and are great for reliable transactions that follow ACID (Atomicity, Consistency, Isolation, Durability) compliance.

With relational database software, you’re usually working with data that is something you could model in a table. You might update certain data by overwriting and replacing it. But what if you’re collecting on data on something that generates a lot of data and you want to watch change over time? Take a self-driving car. The car is constantly collecting information about its environment. It takes this data and it analyzes changes over time to behave correctly. The amount of data might be tens of gigabytes an hour. While you could use a relational database to collect this data, they’re not built for this. When it comes to scaling and usability of the data you’re collecting, an RBDMS isn’t the best tool for the job.

Why time-series is a good fit 🔗

And this is where time-series data makes sense. Let’s say you’re collecting data about a city traffic, temperature from farming equipment, or the production rate of an assembly line. Instead of going into a table with rows and columns, imagine pushing multiple rows of data that are uniquely sorted by a timestamp. This visual might help make more sense of this.

Imagine rows and rows of data, uniquely sorted by timestamps. Image from Timescale (https://blog.timescale.com/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563).

Having the data in this format makes it easier to track and watch change over time. When data accumulates, you can see how something behaved in the past, how it’s behaving now, and how it might behave in the future. Your options to make smarter data decisions expands!

Curious how the data is stored and formatted? It depends on the time-series database (TSDB) you use. InfluxDB stores the data in the Line Protocol format. Queries return the data in JSON.

How InfluxDB stores time-series data in Line Protocol (https://docs.influxdata.com/influxdb/v1.3/write_protocols/line_protocol_tutorial/). Image from Roberto Gaudenzi (https://www.slideshare.net/RobertoGaudenzi1/introduction-to-influx-db).

If you’re still confused or trying to understand time-series data or why you would want to use it over another solution, you can read an excellent, in-depth explanation from Timescale’s blog or InfluxData’s blog.

InfluxDB: A time-series database 🔗

InfluxDB is an open source time-series database software developed by InfluxData. It’s written in Go (a compiled language), which means you can start using it without installing any dependencies. It supports multiple data ingestion protocols, such as Telegraf (also from InfluxData), Graphite, collectd, and OpenTSDB. This leaves you with flexible options for how you want to collect data and where you’re pulling it from. It’s also one of the fastest-growing time-series database software available. You can find the source code for InfluxDB on GitHub.

This article will focus on three tools in InfluxData’s TICK stack for how you can build a time-series database and begin collecting and processing data.

TICK stack 🔗

InfluxData creates a platform based on four open source projects that work and play well with each other for time-series data. When used together, you can collect, store, process, and view the data easily. The four pieces of the platform are known as the TICK stack. This stands for…

_T_elegraf: Plugin-driven server agent for collecting / reporting metrics
_I_nfluxDB: Scalable data store for metrics, events, and real-time analytics
_C_hronograf: Monitoring / visualization UI for TICK stack (not covered in this article)
_K_apacitor: Framework for processing, monitoring, and alerting on time-series data

These tools work and integrate well with the other pieces by design. However, it’s also easy to substitute one piece out for another tool of your choice. For this article, we’ll explore three parts of the TICK stack: InfluxDB, Telegraf, and Kapacitor.

Diagram of how the different components of the TICK stack connect with each other. From influxdata.com (https://www.influxdata.com/time-series-platform/).

InfluxDB 🔗

As mentioned before, InfluxDB is the time-series database (TSDB) of the TICK stack. Data collected from your environment is stored into InfluxDB. There are a few things that stand out about InfluxDB from other time-series databases.

Emphasis on performance 🔗

InfluxDB is designed with performance as one of the top priorities. This allows you to use data quickly and easily, even under heavy loads. To do this, InfluxDB focuses on quickly ingesting the data and using compression to keep it manageable. To query and write data, it uses an HTTP(S) API.

The performance notes are noteworthy standing up the amount of data InfluxDB is capable of handling. It can handle up to a million points of data per second, at a precise level even to the nanosecond.

SQL-like queries 🔗

If you’re familiar with SQL-like syntax, querying data from InfluxDB will feel familiar. It uses its own SQL-like syntax, InfluxQL, for queries. As an example, imagine you’re collecting data on used disk space on a machine. If you wanted to see that data, you could write a query that might look like this.

SELECT mean(diskspace_used) as mean_disk_used
FROM disk_stats
WHERE time() >= 3m
GROUP BY time(10d)

If you’re familiar with SQL syntax, this won’t feel too different. The above statement will pull the mean values of used disk space from a three-month period and group them by every ten days.

Downsampling / data retention 🔗

When working with large amounts of data, storing it becomes a concern. Over time, it can accumulate to huge sizes. With InfluxDB, you can downsample into less precise, but smaller metrics that you can store for longer periods of time. Data retention policies for your data enable you to do this.

For example, pretend you have sensors collecting data on the amount of RAM in a number of machines. You might collect metrics on the amount of memory in use by multiple users, the system, cached memory, and more. While it might make sense to hang on to that data for thirty days to watch what’s happening, after thirty days, you might not need it that precise. Instead, you might only want the ratio of total memory to memory in use. Using data retention policies, you can tell InfluxDB to hang on to the precise data for all the different usages for thirty days. After thirty days, you can average data to be less precise, and you can hold on to that data for six months, forever, or however long you like. This compromise meets in the middle between keeping historical data and reducing disk usage.

Telegraf 🔗

If InfluxDB is where all of your data is going, you need a way to collect and gather the data first. Telegraf is a metric collection daemon that gathers various metrics from system components, IoT sensors, and more. It’s open source and written completely in Go. Like InfluxDB, Telegraf is also written by the InfluxData team and is built to work with InfluxDB. It also includes support for different databases, such as MySQL / MariaDB, MongoDB, Redis, and more. You can read more about it on InfluxData’s website.

Telegraf is modular and heavily based on plugins. This means that Telegraf is either lean and minimal or as full and complex as you need it. Out of the box, it supports over a hundred plugins for various input sources. This includes Apache, Ceph, Docker, IPTables, Kubernetes, NGINX, and Varnish, just to name a few. You can see all the plugins, including processing and output plugins in their README.

Even if you’re not using InfluxDB as a data store, you may find Telegraf useful as a way to collect this data and information about your systems or sensors.

Kapacitor 🔗

Now we have a way to collect and store our data. But what about doing things with it? Kapacitor is the piece of the stack that lets you process and work with the data in a few different ways. It supports both stream and batch data. Stream data means you can actively work and shape the data in real-time, even before it makes it to your data store. Batch data means you retroactively perform actions on samples, or batches, of the data.

One of the biggest pluses for Kapacitor is that it enables you to have real-time alerts for events happening in your environment. CPU usage overloading or temperatures too high? You can set up several different alert systems, including but not limited to email, triggering a command, Slack, HipChat, OpsGenie, and many more. You can see the full list in the documentation.

Like the previous tools, Kapacitor is also open source and you can read more about the project in their README.

Installing the TICK stack 🔗

Packages are available for nearly every distribution. You can install these packages from the command line. Use the instructions for your distribution.

Fedora 🔗

sudo dnf install https://dl.influxdata.com/influxdb/releases/influxdb-1.3.1.x86_64.rpm \
https://dl.influxdata.com/telegraf/releases/telegraf-1.3.4-1.x86_64.rpm \
https://dl.influxdata.com/kapacitor/releases/kapacitor-1.3.1.x86_64.rpm

CentOS 7 / RHEL 7 🔗

sudo yum install https://dl.influxdata.com/influxdb/releases/influxdb-1.3.1.x86_64.rpm \
https://dl.influxdata.com/telegraf/releases/telegraf-1.3.4-1.x86_64.rpm \
https://dl.influxdata.com/kapacitor/releases/kapacitor-1.3.1.x86_64.rpm

Ubuntu / Debian 🔗

wget https://dl.influxdata.com/influxdb/releases/influxdb_1.3.1_amd64.deb \
https://dl.influxdata.com/telegraf/releases/telegraf_1.3.4-1_amd64.deb \
https://dl.influxdata.com/kapacitor/releases/kapacitor_1.3.1_amd64.deb
sudo dpkg -i influxdb_1.3.1_amd64.deb telegraf_1.3.4-1_amd64.deb kapacitor_1.3.1_amd64.deb

Other distributions 🔗

For help with other distributions, see the Downloads page.

See the data, be the data 🔗

Now that you have the tools installed, you can experiment with some of these tools. There’s plenty of upstream documentation on all three projects. You can the docs here:

Additionally, for more help, you can visit the InfluxData community forums. Happy hacking!

Introduction to Kubernetes with Fedora

Mon, 03 Jul 2017 00:00:00 +0000

This article was originally published on the Fedora Magazine.

This article is part of a short series that introduces Kubernetes. This beginner-oriented series covers some higher level concepts and gives examples of using Kubernetes on Fedora.

The information technology world changes daily, and the demands of building scalable infrastructure become more important. Containers aren’t anything new these days, and have various uses and implementations. But what about building scalable, containerized applications? By itself, Docker and other tools don’t quite cut it, as far as building the infrastructure to support containers. How do you deploy, scale, and manage containerized applications in your infrastructure? This is where tools such as Kubernetes comes in. Kubernetes is an open source system that automates deployment, scaling, and management of containerized applications. Kubernetes was originally developed by Google before being donated to the Cloud Native Computing Foundation, a project of the Linux Foundation. This article gives a quick precursor to what Kubernetes is and what some of the buzzwords really mean.

What is Kubernetes? 🔗

Kubernetes simplifies and automates the process of deploying containerized applications at scale. Just like Ansible orchestrates software, Kubernetes orchestrates deploying infrastructure that supports the software. There are various “layers of the cake” that make Kubernetes a strong solution for building resilient infrastructure. It also assists with making systems that can grow at scale. If your application has increasing demands such as higher traffic, Kubernetes helps grow your environment to support increasing demands. This is one reason why Kubernetes is helpful for building long-term solutions for complex problems (even if it’s not complex… yet).

Kubernetes: The high level design. Daniel Smith, Robert Bailey, Kit Merker (https://www.slideshare.net/RohitJnagal/kubernetes-intro-public-kubernetes-meetup-4212015).

At a high level overview, imagine three different layers.

Users: People who deploy or create containerized applications to run in your infrastructure
Master(s): Manages and schedules your software across various other machines, for example in a clustered computing environment
Nodes: Various machines to support the application, called kubelets

These three layers are orchestrated and automated by Kubernetes. One of the key pieces of the master (not included in the visual) is etcd. etcd is a lightweight and distributed key/value store that holds configuration data. Each node, or kubelet, can access this data in etcd through a HTTP/JSON API interface. The components of communication between master and node such as etcd are explained in the official documentation.

Another important detail not shown in the diagram is that you might have many masters. In a high-availability (HA) set-up, you can keep your infrastructure resilient by having multiple masters in case one happens to go down.

Terminology 🔗

It’s important to understand the concepts of Kubernetes before you start to play around with it. There are many core concepts in Kubernetes, such as services, volumes, secrets, daemon sets, and jobs. However, this article explains four that are helpful for the next exercise of building a mini Kubernetes cluster. The three concepts are pods, labels, replica sets, and deployments.

Pods 🔗

If you imagine Kubernetes as a Lego® castle, pods are the smallest block you can pick out. By themselves, they are the smallest unit you can deploy. The containers of an application fit into a pod. The pod can be one container, but it can also be as many as needed. Containers in a pod are unique since they share the Linux namespace and aren’t isolated from each other. In a world before containers, this would be similar to running an application on the same host machine.

When the pods share the same namespace, all the containers in a pod:

Share an IP address
Share port space
Find each other over localhost
Communicate over IPC namespace
Have access to shared volumes

But what’s the point of having pods? The main purpose of pods is to have groups of “helping” containers on the same namespace (co-located) and integrated together (co-managed) along with the main application container. Some examples might be logging or monitoring tools that check the health of your application, or backup tools that act when certain data changes.

In the big picture, containers in a single pod are always scheduled together too. However, Kubernetes doesn’t automatically reschedule them to a new node if the node dies (more on this later).

Labels 🔗

Labels are a simple but important concept in Kubernetes. Labels are key/value pairs attached to objects in Kubernetes, like pods. They let you specify unique attributes of objects that actually mean something to humans. You can attach them when you create an object, and modify or add them later. Labels help you organize and select different sets of objects to interact with when performing actions inside of Kubernetes. For example, you can identify:

Software releases: Alpha, beta, stable
Environments: Development, production
Tiers: Front-end, back-end

Labels are as flexible as you need them to be, and this list isn’t comprehensive. Be creative when thinking of how to apply them.

Replica sets 🔗

Replica sets are where some of the magic begins to happen with automatic scheduling or rescheduling. Replica sets ensure that a number of pod instances (called replicas) are running at any moment. If your web application needs to constantly have four pods in the front-end and two in the back-end, the replica sets are your insurance that number is always maintained. This also makes Kubernetes great for scaling. If you need to scale up or down, change the number of replicas.

When reading about replica sets, you might also see replication controllers. They are somewhat interchangeable, but replication controllers are older, semi-deprecated, and less powerful than replica sets. The main difference is that sets work with more advanced set-based selectors – which goes back to labels. Ideally, you won’t have to worry about this much today.

Even though replica sets are where the scheduling magic happens to help make your infrastructure resilient, you won’t actually interact with them much. Replica sets are managed by deployments, so it’s unusual to directly create or manipulate replica sets. And guess what’s next?

Deployments 🔗

Deployments are another important concept inside of Kubernetes. Deployments are a declarative way to deploy and manage software. If you’re familiar with Ansible, you can compare deployments to the playbooks of Ansible. If you’re building your infrastructure out, you want to make sure it is easily reproducible without much manual work. Deployments are the way to do this.

Deployments offer functionality such as revision history, so it’s always easy to rollback changes if something doesn’t work out. They also manage any updates you push out to your application, and if something isn’t working, it will stop rolling out your update and revert back to the last working state. Deployments follow the mathematical property of idempotence, which means you define your specs once and use them many times to get the same result.

Deployments also get into imperative and declarative ways to build infrastructure, but this explanation is a quick, fly-by overview. You can read more detailed information in the official documentation.

Installing on Fedora 🔗

If you want to start playing with Kubernetes, install it and some useful tools from the Fedora repositories.

sudo dnf install kubernetes

This command provides the bare minimum needed to get started. You can also install other cool tools like cockpit-kubernetes (integration with Cockpit) and kubernetes-ansible (provisioning Kubernetes with Ansible playbooks and roles).

Learn more about Kubernetes 🔗

If you want to read more about Kubernetes or want to explore the concepts more, there’s plenty of great information online. The documentation provided by Kubernetes is fantastic, but there are also other helpful guides from DigitalOcean and Giant Swarm. The next article in the series will explore building a mini Kubernetes cluster on your own computer to see how it really works.

Questions, Kubernetes stories, or tips for beginners? Add your comments below.

GSoC 2016 Weekly Rundown: Breaking down WordPress networks

Sat, 02 Jul 2016 00:00:00 +0000

This week, with an initial playbook for creating a WordPress installation created (albeit needing polish), my next focus was to look at the idea of creating a WordPress multi-site network. Creating a multi-site network would offer the benefits of only having to keep up a single base installation, with new sites extending from the same core of WordPress. Before making further refinements to the playbook, I wanted to investigate whether a WordPress network would be the best fit for Fedora.

Background for Fedora 🔗

Understanding the background context for how WordPress fits into the needs for Fedora is important. There are two sites powered by WordPress within Fedora: the Community Blog and the Fedora Magazine. Each site uses a different domain (communityblog.fedoraproject.org and fedoramagazine.org, respectively).

At the moment, there are not any plans to set up or offer a blog-hosting service to contributors (and for good reason). The only two websites that would receive the benefits of a multi-site network would be the Community Blog and the Magazine. For now, the intended scale of expanding WordPress into Fedora is to these two platforms.

Setting up the WordPress network 🔗

To test the possibilities of using a network for our needs, I used a development CentOS 7 machine for my project testing purposes. There are some guidelines on creating networks for reading first before proceeding. After reading these, it was clear the approach to take was the domain method. I moved to the installation guide on the development machine.

I wanted to document the process I was following for the multi-site network, so I created a short log file of my observations and information I found as I proceeded.

One of the time burners of this section was picking up Apache again. A few years ago, I switched my own personal web servers to nginx from Apache. Fedora’s infrastructure uses Apache for its web servers. It took me a little longer than I had hoped to get familiar with it again, mostly with virtual hosts and SELinux contexts for WordPress media uploads. Despite the extra time it took with Apache, I feel like this will save me time later when I am working on polishing the final deliverable or working with the Apache roles available.

In addition to this, I also picked out the dependencies for WordPress, such as the PHP packages needed and setting up a MariaDB database. After a while, I was able to get the WordPress network established and running on the development machine. It was convenient having a testable interface at my fingertips to work with.

WordPress network: Conclusion? 🔗

At the end of my testing and poking around, it appeared to me that there would not be an easy solution to using a WordPress network for Fedora. The network had the best ability when set up to use wildcard sub-domains, which wouldn’t be a plausible solution for us because of the two different domains. There were more manual ways of doing it (i.e. not in the WordPress interface) with Apache virtual hosts. However, I felt like it would be easier to write one playbook that handles a single WordPress installation, and can be run for both sites separately (or new sites).

Given that the factor of scale is two websites, I think maintaining two separate WordPress installations will be the easier method and save time and keep efficiency.

This week’s challenges 🔗

This week had a late start for me on Wednesday due to traveling on a short vacation with my family from Sunday to Tuesday. Coming back from the trip, I also have a new palette of responsibilities that I am assisting with in Community Operations and Marketing, following decause’s departure from Red Hat. I’m still working on finding a healthy balance of time and focus between other important tasks I am responsible for and my project work.

I’m hoping that having a full week will allow me to make further progress and continue to overcome some of the challenges that have arisen in the past few weeks.

Next week’s goals 🔗

For next week, I’m planning on focusing on my existing product and making it feel and run more like a “Fedora playbook”. I mostly want to work on saving unnecessary effort and being consistent by tapping into the existing Ansible roles in Fedora Infrastructure. This would make setting up an Apache web server, MySQL database, and a few other tasks more automated. It keeps the tasks and organization in a consistent manner as well since they are across Fedora’s infrastructure already.

By next Friday, the plan is to have a more idempotent product that runs effectively and as expected in my development server. Beyond that, the next step would be to work on getting my site into a staging instance.