Data

Scrub gently: On data scrubbing in a community survey.

Tue, 15 Nov 2022 00:00:00 +0000

Recently, my team with the CHAOSS Project had a data concern emerge when I was working on a project to run a community survey. This community had never run a survey before, and it was the first notable event where the project made an explicit, structured ask for feedback from the community. As a result, this first experience was also a calibration event, so we could guide this kind of work in future years.

Survey says: What? 🔗

At some point, after we opened the survey, a question emerged about how to handle an unruly response. In the ongoing responses, our data manager noticed one response that was objectively harmful. The person was strongly against the D.E.I. initiative that organized the survey. The response was written in a hostile tone, made insulting and derogatory comments about groups of people, and was entirely opposed to the project spending any time and resources on diversity, equity, and inclusion. The question asked to our group was whether we would include this response in the published data, or whether we would omit it.

There were two perspectives. Some elected to remove this response from the final report and any published data. Others felt it was important to wait and see if this response would become a pattern as we ran the survey. I found myself in the second group that felt it was important to wait and see first. I want to unpack this rationale, both for future me and perhaps someone else reading.

On discarding the survey response 🔗

There were good points about removing the harmful response.

Firstly, the response used harmful language and was likely triggering. This particular response included angry rhetoric that was reflective, to a degree, of the social and political “climate” of our world today. Including the response in our final reporting could also be giving it a platform, which would arguably be a harmful act. It would validate that input as acceptable input. Our group was not in disagreement that the response was harmful and not behavior the community should tolerate.

Second, the response did not provide actionable insight or useful asks to the project and community. It was written in an aggressive, angry tone towards the reader and did not offer workable suggestions other than ending and divesting from all D.E.I. work immediately. Given this was not an acceptable option, there wasn’t much there for us to learn or understand about CHAOSS from this individual response. So, why include or save this response?

There is an option to ignore feedback by intentionally discarding it, but what if the individual feedback represents a larger trend?

What is community culture? 🔗

It is important to be aware of threats to community culture. What is community culture? My improvised definition is any organizational culture oriented towards the care, well-being, and thriving of others (including the self) within a single, shared community environment. Regardless of other values and goals in a project, the shared culture of the project can either lean towards a collective, communal-oriented approach or an independent, individual-oriented approach. The communal approach that prioritizes the well-being of all instead of a privileged view could also be considered as community culture. Many traditional “Open” projects skew toward a strong community culture.

On monitoring survey responses for a pattern 🔗

Coming back to the survey response, what if omitting the data leaves holes in the story of your community? If there is not just one, but several of these kinds of responses, what comments does that make about the community culture? Is there already a strong community culture, or is there resistance and challenges to building a more cooperative, caring environment? There is real work to do at both ends of the spectrum, but what that work might look like depends on which side you are on.

I posit that omitting the “unhappy” or harmful responses can create a dangerous blind spot to toxicity within a community culture. When it comes to direct, interpersonal interactions with others (e.g. meetings, emails, chats, etc.), stewards of the community culture need to take direct action against visible challenges and threats to the community culture. If someone starts swearing out at someone in a meeting, that is a hard-to-miss action. It is visible, and anyone could observe it or even record it.

In anonymous surveys, you might find a more subtle layer of the community culture than what is shown by the actions of a small few. There can be greater trust that someone’s comments will not be tied back to their identity, so some responders may feel emboldened with their words and true opinions.

The point of this is that especially in larger communities, it is worth noting negative and harmful responses and not totally ignoring them. Communities that organize in more decentralized ways will always have supporters, users, and contributors from both the core and the periphery. The core project membership may not interact or engage often with the periphery often, so there can be a blind spot to parts of the project that identify with the community but are a few degrees removed from the inner ring of the project community.

Noting whether something is indicative of a larger pattern is important. If your community has a ton of jerks, you need to know that your community is full of jerks so that you don’t waste time persuading people otherwise, when the lived experience is very different.

In the original conversation with the CHAOSS Project team, this data scrubbing question emerged in the process of running the survey instead of after the data collection concluded. The survey later closed and our data manager confirmed that the flagged response from earlier was the only one of its kind. As a group, we then felt more confident in discarding that one outlier as an anomaly since the survey was open to the general public.

Feature photo by JESHOOTS.COM on Unsplash. Modified by Justin Wheeler.

CHAOSS DEI Review: Midyear reflection

Tue, 25 Oct 2022 00:00:00 +0000

Since February 2021, the CHAOSS Project is conducting a funded, long-term review of its governance, practices, and processes in a diversity, equity, and inclusion (D.E.I.) “audit.” I originally joined as an internal community liaison and initially helped to identify a team of D.E.I. practitioners external to the CHAOSS Project to support this work. Thanks to the support of the Ford Foundation, we are slowly approaching the two-year anniversary of when this work began.

My brief readout is a guided reflection using questions shared by Matt Germonprez. This reflects my review of our work as a team to date and also shares some of my hopeful outlooks for what our amazing team can accomplish together. This readout will cover (1) our accomplishments as a team, (2) what was expected and surprising, and (3) what we could change in the next year.

CHAOSS accomplishments & learnings 🔗

Three achievements and aspirations stand out over the past year:

Established process management and a team workflow.
Created a small but active Community of Practice (CoP).
Sharing our results with CHAOSS and the Open ecosystem.

Processes & workflow 🔗

We had to forge our own practices that worked best for our group. Photo by Jonny Gios (https://unsplash.com/@supergios?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/forge?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

For direct participants of the team, the Ford Foundation funding did not come with strict requirements or success metrics. As we assembled our team, we were given the discretion of how to conduct a D.E.I. review for the project and determine the best course of doing that. This allowed for creative freedom to figure out what would work best for CHAOSS. Additionally, I could not identify a straightforward way to discover other Open communities and projects doing our kind of work. Since there were also not many other known successful models to follow, we combined our shared experiences across multiple Open communities to build our team, identify main areas of focus, and engage the community around our efforts.

This is an achievement because we collectively created an active group that makes incremental, positive changes to CHAOSS. This is a model we could share with other projects so that others can learn from our experiences.

Community of Practice 🔗

Our team is a small but engaged group of D.E.I. practitioners. We share a connection through our ongoing review of the CHAOSS Project, but we also give and take from our own personal experiences outside of CHAOSS. Our group regularly meets and discusses complex, difficult issues that are both (a) not easy to discuss openly and (b) applicable to many communities beyond only CHAOSS. Our team meetings are a safe space that promotes honest and constructive discussion centered on diversity, equity, and inclusion. In addition to our recommendations and direct efforts with CHAOSS, I often reflect on our conversations as a team when working with other Open communities. An example of this is how we built a list of questions to get a “pulse” from the community on their feelings about CHAOSS.

This is aspirational and not yet fully realized. Our team has collected a solid portfolio of stories and experiences that other communities would stand to benefit learning from. I consider this a current achievement because while our work does specifically look at CHAOSS, we also often reflect from a general perspective and how a topic of interest might look in other communities. When the time comes to package our findings, I believe we are setting ourselves up for easier messaging and outreach opportunities in the future.

According to expectations 🔗

While I have worked in Open Source D.E.I. communities since 2015, I have never conducted an applied research review for community D.E.I. before. I did not come into this with strong immediate expectations because it would inevitably reflect the backgrounds and strengths of the team we would assemble. However, I did have specific hopes or things I hoped would be realized by this work.

As expected 🔗

Data-driven approach: We began this work without a strong representation of the state of CHAOSS. What do contributors think about the project? While data is not a universal panacea, we gravitated to a community survey early on because we needed to understand the community experience better first before making serious suggestions.
Time zones are hard: Our team was spread out across North America, Africa, LATAM, and Europe. Additionally, the work with CHAOSS was also a part-time venture for most of us, in addition to primary employment. Calendars and schedules are hard to get right. Since our team’s organization was ad-hoc, momentum would occasionally slow for some periods.
We have an amazing team! I expected great things once we identified our roster. We have also had more amazing people join us over time and add new passion and insight to our focus as a group.

Surprises 🔗

Documenting our impact is not always intuitive: While we have done internal storytelling work within the CHAOSS Project, we do not have a good record of our achievements to date. Our linear progression does not lend itself easily to self-reflection and recalibration. Although much of our focus is on the CHAOSS community survey and CHAOSS Africa, we also facilitated several other notable achievements in the project in the last year. See the following examples:
- Supporting the establishment of a Code of Conduct Committee.
- Community office hours for newcomers.
- Improved, peer-to-peer onboarding experience in CHAOSS.
- Increased efforts in CHAOSS mentored projects (e.g. Outreachy and GSoC).
- Recommending changes to the project and community, like broader localization to Chinese & Spanish and establishing a D.E.I. council.
Losing and regaining steam on the survey: Although the community pulse survey was one of the earliest tasks identified in our work, launching a first survey proved to take a lot of resources from the team. We briefly stalled out on the survey effort while focused on other areas (like listed above). While our team was able to achieve many smaller victories for CHAOSS with low-hanging fruits, it took a sustained focus and slowdown on new topics to achieve larger contributions like the community pulse survey.

Changes for the CHAOSS team next year 🔗

Looking ahead to 2023, I hope to strengthen our efforts as a team in these areas:

Packaging our work
Dissemination of our work

Photo by Christophe Rollando (https://unsplash.com/@chrisrolls?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/2023?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

Packaging 🔗

Our work stream was linearly ordered and we took a forward-looking approach. Now is a good time to look back and reflect on our results to date. What are our key findings and observations? What suggestions will we make to CHAOSS? How could other communities learn from our experience running this review? One task for us as a team is to identify key messages and themes so that dissemination into broader domains is possible.

Dissemination 🔗

Once we package our work, notes, and reflections, we should take an active approach to disseminating and sharing our work. This includes both the CHAOSS Project and a more general audience. For the CHAOSS Project, this could be a written report, presentations to the CHAOSS board, speaking at CHAOSScon, and outreach to the multiple Working Groups. For a general audience, this could include speaking at industry conferences, sharing our work with other Communities of Practice, social media, or other ways of promoting our deliverables.

4 metrics to measure sustainable open source investments.

Fri, 31 Dec 2021 00:00:00 +0000

How do we understand value when we talk about sustainability? What does investing in open source mean? The meaning is different for many people because of an implicit understanding of what open source means.

This post is a reflection on the past year in my work with the UNICEF Venture Fund. We integrated new open source tools to capture metrics and data about open source repositories connected to UNICEF portfolio companies and created a shortlist of key metrics that map to business sustainability metrics. Now, we are better positioned to look back on past, current, and upcoming portfolio companies and mentor support programs.

As we move into 2022, this post covers my current thinking on these points:

Defining investments.
How do these investments impact sustainability?
CHAOSS metrics as an open source tool for an investment lens on sustainability.
What next?

Defining investments. 🔗

When we talk about investing in open source, what do we mean? What are the known inputs? What are the expected outputs? “Investments” and “investing” are broad terms. Investments typically mean sizeable financial injections of support and growth, but can also include non-financial investments too. Investments can also take the form of both time and energy (i.e. electricity and digital infrastructure).

The UNICEF Venture Fund provides equity-free funding for start-up companies building open source solutions of interest to UNICEF. All the start-up companies are registered companies in UNICEF program countries. As part of the Venture Fund’s location in the Office of Innovation, it is also a vehicle for UNICEF to explore frontier technology areas through the investments. When a start-up company is receiving investment from UNICEF, the company receives both funding and tailored mentorship about business and open technology.

A question I want to know is, what is the impact of the received funding plus guided mentorship? How does this approach enable the companies to be successful after graduating? What discoveries or knowledge could be shared with others to assist the development of their own open programs?

To summarize, an investment can be financial or non-financial. Financial investments include direct funding, grants, venture capital, fellowships, or any other exchange of capital. Non-financial investments include time spent in coaching sessions, personalized content for companies, and shared digital infrastructure. Neither list is exhaustive.

How do these investments impact sustainability? 🔗

Bitergia Cauldron.io (https://cauldron.io)

Data makes introspection easier. Bitergia’s Cauldron.io was a champion tool for kickstarting an open source metrics strategy for the UNICEF Venture Fund. Its introduction as a tool opened up a wider span of data to look at. There are new opportunities to ask questions and explore growth, scale, and sustainability.

In order to come to a conclusion on sustainability impact, we need streamlined data to test a thesis. The Venture Fund team improved internal processes to how metrics are collected from portfolio companies. The team is unifying behind fewer tools and methods to ensure we see the same data and have the same view of the data points we measure. This also provides a fresh opportunity to review how we measure open source impact across portfolio companies. Many have dashboards on Cauldron.io, but data needs a storyteller for it to make meaning. So, the next step is to ask questions with this new data and frame a thesis to measure and test the sustainability of Venture Fund investments into open source.

Many have traveled before me on the same trail of thought. I started first with the Community Health Analytics Open Source Software (CHAOSS) project and its metrics releases. This served as the initial point of brainstorming to frame questions and different scenarios of risk, evolution, DEI, and value.

CHAOSS metrics as an open source tool for an investment lens on sustainability. 🔗

I reviewed the latest release of CHAOSS metrics and narrowed down four metrics I want to measure in the next year. I also shared thoughts on why collect this data and how to do it. This blog post is no more than me wondering out loud, to help me frame an analytical approach for this metrics strategy.

The four metrics are detailed below:

Contribution Attribution
Contributors
Collaboration Platform Activity
Labor Investment

Take note of your dependencies and contributors. Photo by Glenn Carstens-Peters (https://unsplash.com/@glenncarstenspeters?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/lists?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

Contribution Attribution 🔗

Question: Who has contributed to an open source project and what attribution information about people and organizations is assigned for contributions?

chaoss.community/metric-contribution-attribution/

This metric is insightful because it is targeted deeply into team and project culture. This metric is a good representation of how much the project leans into an open source model of building their project. This work ethos and intention to forge on an open source path is difficult to understand at times. If a team takes care to attribute their software dependencies and other contributors to their code (if any), this is a good sign that the team accepts collaboration as a value and encourages working with others.

I would measure this across two types of contributions: attributions for software dependencies including those with permissive licenses, and for any other direct contributors to the code and how they are recognized for their participation. This could be filtered in a red-yellow-green light approach:

Red: No attributions are made, or all attributions are inadequate.
Yellow: One of two attributions are made, or one attribution type is inadequately attributed.
Green: All dependencies and used works are correctly attributed.

Spend more time getting to know who participates and why. Photo by Alex Hudson (https://unsplash.com/@aliffhassan91?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/bazaar?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

Contributors 🔗

Question: Who are the contributors to a project?

chaoss.community/metric-contributors/

This metric explores a more human dimension of the people and participants to an open source project. The metric defines contributors and contributions broadly, as “anyone who contributes to the project in any way.” Understanding the people participating in a community, their motivations, goals, and why they choose to be in that community is important to understand sustainability. Otherwise, you may lose out on good opportunities to attract contributions from people who are already engaged, and new engagements may be difficult because of a mismatch of expectations.

This metric is more a means than it is an end; that is, it provides opportunities to ask more questions than provide detailed answers. Nevertheless, it does provide some guidance towards understanding contributors in a project, and it can lead to some concrete actions based on gathered insights. For example, this metric will enable deeper looks in areas of diversity, equity, and inclusion.

Since I work with start-up companies with small, lean development teams, I look to understand the motivations of the developers working on their projects and where the motivations may align with another open source solution. This enables the two communities to leverage their combined brainstorming for meeting complimentary goals around development and innovation.

To collect this data, I would have the team define what areas of contribution they seek for their open source solutions and then map those desired contributions to a specific project area or different team members. This enables a form of consistent accountability for checking expectations with reality and understanding team capacity. Each area could be a key-value pair, where the value is the project area, team lead, or delegated team member for the type of contribution solicited.

There are many ways to collaborate, but the question is, are you counting the right ways? Photo by Kai Dahms (https://unsplash.com/@dilucidus?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/measure?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

Collaboration Platform Activity 🔗

Question: What is the count of activities across digital collaboration platforms (e.g., GitHub, GitLab, Slack, email) used by a project?

chaoss.community/metric-collaboration-platform-activity/

Collaboration platform activity is one effective proxy metric for community engagement if measured accurately. The metric does not define collaboration as much as it provides a data structure to measure it. It abstracts collaboration into key data points like timestamp, sender, whether the platform has threaded or non-threaded discussions, data collection date, and platform message identifier. To a degree, collaboration can be abstracted out in this way: a person takes any given action at a given time in a given way, and this action is measured as project-related activity on the collaboration platform.

There are a few possible approaches to collecting this data from UNICEF Venture Fund companies. Each approach does not cancel out another, but each approach could be combined with the others:

Measure common git activity like commits, issues, pull/merge requests. We already measure this data, but use it only in connection to validating Venture Fund workplans for each team with UNICEF portfolio manager(s).
Count communications like comments, reviews, public messages, and other outreach. Communications strategies and tools are typically inferred from common git activity. Measuring for engagement and stratifying those metrics into a smaller group could allow for deeper insights to the evolution of early-stage open source communities.
Make community hubs first-class citizens in the data curation process to infer about informal engagement. Both open source projects and UNICEF Venture Fund portfolio companies use a variety of tools to communicate, especially in view of COVID-19 and its seismic impact on how we work. Platforms like Discord, Telegram, Mattermost, Slack, Rocket.chat, Matrix, and others are focal points where projects collaborate, ask questions, and support others. Bringing this data stream into the mix offers deeper insights into how teams engage and build community around their work, and also guidance on when to push for contribution opportunities at the right time.

The satisfaction of these three options in their totality is not enough. To leverage the fullest impact, these metrics must tie into each other, and need to be connected back to a narrative. Why is this data being collected and what actions are influenced by the knowledge of this data? The data collection enables the evaluation of sustainability and understanding the birth, growth, and evolution of an open source technology product. Influenced actions can include moving more human resources (i.e. contractors or staff) to support a project, adopting a new open source best practice, and/or engaging new customers, talent, or other leads based on participation in the community.

Measuring collaboration platform activity is not black and white. Many new questions would likely come forward as part of measuring this activity. Yet this is the point—it lays the foundation for the next layer to the data collection, analysis, and reporting process around sustainability.

What is the impact of an investment on fair and equitable labor? Photo by Jon Tyson (https://unsplash.com/@jontyson?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on Unsplash (https://unsplash.com/s/photos/worker?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).

Labor Investment 🔗

Question: What was the cost of an organization for its employees to create the counted contributions (e.g., commits, issues, and pull requests)?

chaoss.community/metric-labor-investment/

This metric is perhaps the most ambitious of the group. How do you measure labor investment into an open source project? Or literally, the number of person-hours that go into software design, development, co-creation, and community management? It feels like a gargantuan effort, but there may be better ways to measure this in connection to other data the UNICEF Venture Fund is already connected about the businesses.

Measuring labor investment impacts two narratives: the rate of development on the open source work, and the impact of UNICEF investment into a company backing an open source work.

Firstly, understanding the rate of development on an open source work is easier to infer by understanding who is allocated on a project and how much of their time they dedicate to it. If a team of three contributors spares a few hours a week, it will mean something different compared to a team of five engineers spread across different disciplines working full-time. Mapping the labor investment for open source projects supported by UNICEF would enable better planning by understanding the typical labor investment in open source workplan tasks as piloted by other Venture Fund portfolio companies.

Secondly, this gives us a new way of talking about the impact of UNICEF Venture Fund investments as an investment not only in software products but also in labor. It gives us insight into the investment of labor in software engineering talent among portfolio companies. How does this measurement change over time of the investment? Do projects receive more or less investment of labor during the 12 month period we work with them? This could also be used as a proxy metric for the impact of our unique mentorship and coaching opportunities.

What next? 🔗

Knowing is half the journey. Even if the knowledge is not yet firmly rooted. The analysis and introspection are from me as an individual working among the UNICEF Venture Fund and do not represent the views and beliefs of UNICEF or the UN in any capacity. My intent is that by sharing this analysis in the open, it allows for a space where conversation can spark where it could not before. It also invites others to share ideas, feedback, and constructive criticism of an emerging metrics strategy for investments made into the open source ecosystem.

Next, more layers can be added and internal and external validation can help to keep this moving forward. An implementation plan would be the next step to follow this post. The implementation plan considers the process of how start-up companies move through the Venture Fund from start to finish. Who interacts with the companies and when? At what point is a company ready to begin building in a new metric or count in their monthly metrics? Do they understand the implications and assessments of these metrics? At what points in the process is data already being collected? Could these new data requests be added to existing requests? And so on.

I hope to formalize some of this new reporting and metrics strategy in upcoming cohorts in 2022, as part of a renewed effort into communicating how our open source investments tie into sustainable impact towards the U.N. Sustainable Development Goals.

This post will serve as a milestone marker on the metrics strategy discussion in the coming one to two months. See you in 2022.

Featured photo by Edward Howell on Unsplash. Modified by Justin Wheeler. CC BY-SA 4.0.

Cryptographic Autonomy License (CAL-1.0): My first license review

Wed, 28 Oct 2020 00:00:00 +0000

The bookmark was creeping on my browser’s toolbar for months. “Cryptographic Autonomy License” CAL-1.0 on the Open Source Initiative webpage. But today, I decided it was time to do my first amateur license review. This is a fun exercise (for me). Remember, I am not a lawyer and this does not constitute legal advice!

The Cryptographic Autonomy License is one of newest Open Source licenses on the block. The Open Source Initiative approved it in February 2020. This license also made ripples when it came through. But the question I had, and could not find a clear answer to, was why is it so interesting?

This blog post is my attempt to do a casual coffee-table review of the license. If you agree or disagree, I encourage you to leave a comment and share your opinion and why!

This short article covers three sections:

CAL-1.0 provisions: What basic Free Software assumptions are present in the license, much like other copyleft licenses.
What’s fresh!!: What is the hype? Ready for the key information? It is covered here.
Personal takeaways: My personal thoughts on this license and where it might be applicable.

CAL-1.0 provisions 🔗

I learned there are basic assumptions and expectations that are true for all Open Source licenses, per the Open Source Definition. Copyleft licenses also have different degrees of rigidity depending on context and use. So, what basic ingredients of a Free Software license are present in the Cryptographic Autonomy License?

Note: The number in parentheses before each line is the corresponding section number in the license text.

Basic legal provisions 🔗

(6.0) Disclaimer of warranty, limit on liability: If someone uses the software and it causes unexpected disastrous side effects, the Licensor cannot be held responsible.
(2.0) Receiving a license: Anyone can receive a CAL-1.0 license. To receive it, you just have to agree to its rules.
(7.4) Attorney fees: If a case involving noncompliance with the CAL-1.0 is brought to court, loser pays legal fees for prosecution and defense.
(7.3) No sub-licensing: You cannot add another license “on top” of the CAL-1.0.
(3.0) Patent clause: Got patents? This license is equipped to interface with external patent licenses.

Permissive provisions 🔗

(4.1) Access: Source code must be made available over a network with this license.
(4.3) Attribution: Cite your sources. Retain all licensing, authorship, and/or attribution notices.

Copyleft provisions 🔗

(4.1) Modified Work: Changes to the original Work make it a Modified Work. Same license rules apply to a Modified Work.
(5.2) Reinstatement: A la GPLv3, for non-compliant derivative works, there is a 60 day grace period to come into compliance before your license is terminated.
(4.5) Combined Work Exception: Software in the Larger Work as well as the Larger Work as a whole may be licensed under the terms of your choice.
Network use: A la AGPL, it also includes a trigger for network use.

What’s fresh!! 🔗

The fresh take on this license from other licenses is all in 4.2. Maintain User Autonomy:

In addition to providing each Recipient the opportunity to have Access to the Source Code, You cannot use the permissions given under this License to interfere with a Recipient’s ability to fully use an independent copy of the Work generated from the Source Code You provide with the Recipient’s own User Data.

Section 4.2 Maintain User Autonomy: intro text

My non-lawyer take on this is that user data plays a much more prominent role in the terms of this license than other copyleft licenses. Just like the AGPL was a response to the changing world of network services and cloud computing, the CAL-1.0 is a response to the changing world of machine learning and data science.

The CAL-1.0 seems to define “user autonomy” in the context of actually using the software, versus something more holistic like Digital Autonomy. In other words, if you are running CAL-1.0 software, you cannot interfere with requests for personal user data from your users.

This might not sound so radical, but it really is. It is a radical way to assert users’ ownership of their data. If you are the end user of a distributed or cloud-based app licensed under CAL-1.0, you are enabled (to some degree) to request copies of personal user data without interference or obfuscation.

CAL-1.0 and Hatbrim Technologies 🔗

To better explain this, consider this made-up example.

I am a product manager at Hatbrim Technologies. Hatbrim develops an integrated calendar application, Holocal, to store events, meetings, and reminders. Holocal is an integrated application that includes a front-end component, back-end component, and a machine learning algorithm. The algorithm offers tailored suggestions to reduce my meeting load based on my common meeting patterns with other events or activities I have planned.

Oraculous, a competing company to Hatbrim Technologies, creates a fork of Holocal called OraCal. It is almost functionally identical to Holocal except it also adds an integration to other services from Oraculous. However, OraCal also modifies the calendar optimization algorithm. It adds a periodic random event suggestion based on events and activities in your calendar.

Meanwhile at Hatbrim… 🔗

Since I am a product manager at Hatbrim, I turn to my trusty team of developers and ask them to explore the OraCal fork of Holocal. I am curious to know how their calendar optimization method works, since Oraculous must also release OraCal under the Cryptographic Autonomy License (CAL-1.0). My team of developers review the OraCal code, try making changes to Holocal, but we are unable to replicate this feature of OraCal in our environment.

Eventually, one developer runs OraCal internally, but optimized for our data. Still no luck to reproduce the nifty calendar event suggestion feature! Fortunately, the CAL-1.0 offers a protection here. So, the developer sends an email to Oraculous to request her personal user data from OraCal provided to her. Because the CAL-1.0 has provisions to prevent foul play or modifying the data, the developer receives a copy of her data and realizes another Oraculous tool was scrubbing and appending data for calendar predictions before it returned to OraCal.

In this hypothetical scenario, our developer is ultimately able to understand how the Modified Work is changed and how Oraculous adapted the original Work. Under another copyleft license like any GPL variant or the Mozilla Public License, a licensee has no obligation to share any user data with an end user. For any reason. Unless they happen to be nice or because another legal authority or body holds them accountable to share user data.

CAL-1.0 personal takeaways 🔗

Did I mention I am not a lawyer and this does not constitute legal or financial advice? In case I did not, I am not a lawyer and this does not constitute legal or financial advice.

This advice and interpretation of the license is raw and unfiltered. But you only read something for the first time but once. So, with all other contemporary issues in the Free Software world going on, I thought it would be a fun exercise to draft this blog post as I read through the Cryptographic Autonomy License for the first time.

Ultimately, my takeaways after reading and reflecting on the license a few times is this:

Lack of transparency in motivation: Holo, the company behind the license, emphasizes all the good qualities of this license while sneakily dodging the fact that it is a mildly anti-competitive license for their business case.
Precedent-setting: This is the first approved Open Source license that explicitly does anything significant about data. It will be interesting to see if this inspires other licenses that make definitions on data.
Potentially powerful if picked up: If used more widely or in more popular projects, it has potential to disrupt the status quo of how Open Source thinks about user data and the autonomy of the end user.
No defining moment: To my knowledge, CAL-1.0 lacks a significant defining moment since its approval. It is unclear what real-world noncompliance litigation looks like. It lacks the battle-testing of other copyleft licenses.

I imagine I am not the only one who feels mutually excited and hesitant about the Cryptographic Autonomy License. I am not sure if it makes sense to apply to any of my work or to recommend as a default license to others yet. And licensing is only but one of many pathways in the Free Software legal and policy world. But nonetheless, it is an interesting Free Software development that is still maturing since February 2020.

Photo by Markus Spiske on Unsplash. Modified by Justin Wheeler.

How five Queen songs went mainstream in totally different ways

Tue, 16 Oct 2018 00:00:00 +0000

Originally published on the MusicBrainz blog.

Making graphs is easy. Making intuitive, easy-to-understand graphs? It’s harder than most people think. At the Rochester Institute of Technology, the ISTE-260 (Designing the User Experience) course teaches the language of design to IT students. For an introductory exercise in the class, students are tasked to visualize any set of data they desire. Students David Kim, Jathan Anandham, Justin Wheeler, and Scott Tinker used the MusicBrainz database to look at how five different Queen songs went mainstream in different ways.

Five factors of Queen 🔗

Our mini data science experiment decided to look at five unique data points available to us via MusicBrainz Works:

Number of recorded covers
Number of artists who covered a song
Release year
Year of last recorded cover
Time elapsed between release year and year of last recorded cover

Originally, we looked at songs from different artists, but decided to look at five recordings from the same artist. With Queen being a notoriously famous band, there were several data points to work with in terms of how often a song was covered.

Studying five Queen songs: Another One Bites the Dust, Bohemian Rhapsody, Don’t Stop Me Now, Fat Bottomed Girls, We Will Rock You

Making sense of the data 🔗

A few explanations are necessary for some of the data, especially the difference in number of covers and number of artists. Don’t Stop Me Now, Fat Bottomed Girls, and We Will Rock You had the same number of recorded covers as number of artists who have covered the song. Why were Another One Bites the Dust and Bohemian Rhapsody different?

As it turns out, Another One Bites the Dust had more covers than the number of artists who have covered the song. This happens because some artists have covered the song twice (e.g. once on a studio release and another on a live recording release). On the other hand, Bohemian Rhapsody had more artists covering it than number of covers because some recordings featured multiple artists on the same cover (e.g. the 1992 live performance with Elton John and Axl Rose).

The data opens many interesting questions. Why have some songs persisted longer than others (in terms of recent covers)? Have these songs impacted culture and society in different ways? How have they permeated culture? Is there geographical bias in the data?

This exercise was an exploratory assignment, but we had fun visualizing it and ended up learning an interesting pattern in music data.

Check out the presentation and paper 🔗

If you’re interested for the full details, the slides and a short paper about the presentation are available online. They provide deeper context for the research and the visualization details based on different design concepts.

You can see what else David Kim, Jathan Anandham, Justin Wheeler, and Scott Tinker are up to on LinkedIn. Thanks for tuning in to this adventure into music data analysis, powered by MusicBrainz!

Photo by Matthias Wagner on Unsplash.

Statistics proposal and self-hosting ListenBrainz

Mon, 18 Dec 2017 00:00:00 +0000

This post is part of a series of posts where I contribute to the ListenBrainz project for my independent study at the Rochester Institute of Technology in the fall 2017 semester. For more posts, find them in this tag.

This week is the last week of the fall 2017 semester at RIT. This semester, I spent time with the MetaBrainz community working on ListenBrainz for an independent study. This post explains what I was working on in the last month and reflects back on my original objectives for the independent study.

Running my own ListenBrainz 🔗

The RIT Linux Users Group hosts various virtual machines for our projects. I requested one to set up and host a “production” ListenBrainz site. The purpose of doing this was to…

Test my changes in a “production” environment
Offer a service for the RIT Linux Users Group to poke around with

I spent most of this time working with our system administrator to set up the machine and adjust hardware specs for ListenBrainz. Once we fixed storage space and memory issues, it was easy to set it up and get ListenBrainz running. My experience writing the development guide made it easy to get set up and get working. On the first run, it worked!

Now, listen.ritlug.com is live.

Figuring out HTTPS 🔗

My next challenge for the site is to set up HTTPS. I tried using a reverse proxy in nginx to set up HTTPS, but I received 502 Bad Gateway errors. I realized I spent too much time figuring this out on my own and decided to ask for help in the MetaBrainz community forums.

Proposing new statistics 🔗

Halfway through the independent study, I realized I would fall short of my original objective of implementing basic statistics in ListenBrainz. To compromise, I wrote a proposal for new statistics to start in the project. My proposal looked at other proprietary platforms that compete with ListenBrainz to see some of their statistics. I also came up with some of my own.

I proposed this to the MetaBrainz community on the community forums. I’m awaiting feedback on my ideas. Once I get feedback, I plan to file new tickets for each statistic to track their implementation over time.

I don’t expect statistics being at the forefront of ListenBrainz for some time. A lot of work is going towards other areas of the project. But later in 2018, I expect more focus on the user-facing side of the project.

My statistic and Google BigQuery 🔗

My biggest blocker over the last month was Google BigQuery. I wrote a statistic to calculate play counts over a time period, but was asked to test my statistic. To test my statistic, I needed real data to work with.

Originally, I tried using the Simple Last.fm Scrobbler to submit listens to the local IP address for my development environment, but I wasn’t able to get the app to reach my ListenBrainz server. To get the data, I had to set up Google BigQuery credentials so I could make queries against data on the production site, listenbrainz.org.

I tried working through the Google BigQuery documentation. There’s a lot of documentation for using BigQuery as a developer, but it was confusing where to find the information I needed to set it up in my development environment. I tried creating a new project in the Google Cloud Platform, but I was confused because it prompted me to upload my own data instead of accessing data already in BigQuery.

Too late, I realized I spent too much time on my own and not asking for help. I submitted a pull request with the statistic I made and asked for help in the MetaBrainz community. I also offered to write documentation for setting this up once I learn how to do it.

Reflecting back 🔗

I looked back on my original objectives for the independent study, and I was satisfied and dissatisfied.

Not enough programming 🔗

I wanted this independent study to enhance my programming knowledge. I especially wanted to focus on Python because I wanted to become more familiar with the language. However, I actually didn’t do much programming during the independent study, to my own fault.

My biggest challenge was I bit off more than I could chew. I wanted to write code, and made a big goal before I knew the code base of the project. Even now, I still am not completely comfortable with the code yet. It’s a big project with a lot of things going on. I was able to understand the things I did work on, but there’s still a lot.

I realized that next time, I need to spend more time evaluating the code base of a project before writing out my milestones. I wish I set more realistic, smaller milestones for myself. My milestone of implementing basic reports was lofty given my existing programming knowledge.

Successes 🔗

One of my other objectives was to write documentation for the project. I felt I succeeded in this milestone, and actually found it enjoyable and interesting to do! I helped separate out documentation from the README into the dedicated ReadTheDocs site. I wrote the development environment guide and helped fix some build issues with the docs site. I also plan to write more for some of the other pain points I found, like Google BigQuery.

My last milestone was to create a use case for a data visualization course at RIT. While I didn’t implement my basic reports, I did create the proposal and make an effort to write new statistics. There’s a lot of potential now to work with the data in Google BigQuery and do front-end work with tools like D3.js and Plotly.js. I believe there’s significant potential to use ListenBrainz as a hands-on project for students to explore data visualization with real data. I hope to support my independent study professor, Prof. Roberts, with questions and logistics of using it as a tool for learning in the future.

Unexpected success 🔗

I also think I had an unplanned success too. I immersed myself in the community for ListenBrainz too. Over the last few months, I realized that many of my strengths are in community management and tooling. During my time in the community, I did the following:

To the future! 🔗

This ends my independent study with ListenBrainz, but it doesn’t end my time contributing! I chose ListenBrainz because it’s a project I’m passionate about. An independent study allowed me to justify more time on it than a side project in my free time. I’m happy to have that opportunity, but I don’t want to end here!

I want to follow through on the statistics because I’m passionate about understanding music listening trends. I think there’s a lot of power for psychological research through music data. To this point, I filed a ticket to request tagging listens with “emotion” words that are synced back to MusicBrainz entities.

I won’t have as much time to work on the project without the course credit, but I hope to stay involved for the future. I love the project and I love the community. I’m thankful for the opportunity to work on this project as an independent study, and learn some things along the way.

Exploring Google Code-In, ListenBrainz easyfix bugs, D3.js

Sat, 21 Oct 2017 00:00:00 +0000

Last week moved quickly for me in ListenBrainz. I submitted multiple pull requests and participated in the weekly developer’s meeting on Monday. I was also invited to take part as a mentor for ListenBrainz for the upcoming round of Google Code-In! In addition to my changes and new role as a mentor, I’m researching libraries like D3.js to help build visualizations for music data. Suddenly, everything started moving fast!

Last week: Recap 🔗

The ListenBrainz team accepted my development environment improvements and documentation. This gave me an opportunity to better explore project documentation tools. I experimented with Sphinx and Read the Docs. Sphinx introduced me to reStructuredText for documentation formats. I’ve avoided it in favor of Markdown for a long time, but I see where reStructuredText is stronger for advanced documentation.

Since ListenBrainz is a new project, I plan to contribute documentation for any of my work and improve documentation for pre-existing work. One of the goals for this independent study is to make ListenBrainz a viable candidate for a future data analysis course. To make it easy to use and understand, ListenBrainz needs excellent documentation. Since one of my strengths is technical writing, I plan to contribute more documentation this semester.

You can see some of the new documentation already!

Google Code-In mentor 🔗

The MetaBrainz community manager, Freso Olesen, approached me to mentor for Google Code-In. Google Code-In is an opportunity for teenagers to meaningfully contribute to open source projects. Google describes Google Code-In as…

Pre-university students ages 13 to 17 are invited to take part in Google Code-in: Our global, online contest introducing teenagers to the world of open source development. With a wide variety of bite-sized tasks, it’s easy for beginners to jump in and get started no matter what skills they have.

Mentors from our participating organizations lend a helping hand as participants learn what it’s like to work on an open source project. Participants get to work on real software and win prizes from t-shirts to a trip to Google HQ!

MetaBrainz is a participating organization of Google Code-In this cycle. Because of my work with ListenBrainz, I will contribute a few hours a week to help mentor participating students with ListenBrainz. Beginner problems should be easy to help with since I’m still beginning too, and as I spend more time with ListenBrainz, I can help with harder problems.

I’m excited to give back to one of my favorite open source projects in this way! I’m grateful to have this chance to help out during Google Code-In.

Choosing easyfix bugs 🔗

After I figured out the development environment issues, I went through open tickets filed against ListenBrainz to find some to work on. I made a preliminary pass through all open tickets and left some comments for more information, when needed. The tickets I highlighted to look into next were

LB-85: Username in the profile URL should be case insensitive
LB-124: Install messybrainz as a a python library from requirements
LB-176: Add stats module and begin calculating some user stats from BigQuery
LB-206: “playing_now” submissions not showing on profile
LB-212: Show the MetaBrainz logo on the listenbrainz footer.

Of these five, LB-124 and LB-212 are already closed. While drafting this article, I completed LB-124 in PR #266. This was part of a test to get the documentation building again because of odd import errors. Later, a new student also learning the project for the first time asked to work on LB-212. Since it was a good first task to explore the project code, I passed the ticket to him.

I want to do one more “easyfix” bug before going into the main part of my independent study timeline. I don’t yet feel comfortable with the code and one more bug solved will help. After this, I plan to pursue the heavier lifting of the independent study to explore data operations and queries to make.

Researching D3.js 🔗

Prof. Roberts introduced D3.js as a library to build interactive, dynamic charts and visual representations of data. I haven’t yet looked into much front-end work, but this was a cool project that I wanted to highlight in my weekly report. This feels like it could be a powerful match for ListenBrainz, especially since the data has high detail.

Upcoming activity 🔗

This next week, I won’t have as much time to contribute to ListenBrainz. On October 21, I’m traveling to Raleigh, NC for All Things Open. On October 24, I present my talk, “What open source and J.K. Rowling have in common”. Since I’ll be out of Rochester and missing other classwork, I expect less time on my ListenBrainz work.

This next week will be slower than the last two weeks. Hopefully I’ll learn something at the conference too to bring back for ListenBrainz.

Until then… keep the FOSS flag high.

How to set up a ListenBrainz development environment

Wed, 04 Oct 2017 00:00:00 +0000

One of the first rites of passage when working on a new project is creating your development environment. It always seems simple, but sometimes there are bumps along the way. The first activity I did to begin contributing to ListenBrainz was create my development environment. I wasn’t successful with the documentation in the README, so I had to play around and work with the project before I was even running it.

The first part of this post details how to set up your own development environment. Then, the second half talks about the solution I came up with and my first contribution back to the project.

Install dependencies: Docker 🔗

This tutorial assumes you are using a Linux distribution. If you’re using a different operating system, install the necessary dependencies or packages with your preferred method.

ListenBrainz ships in Docker containers, which helps create your development environment and later deploy the application. Therefore, to work on the project, you need to install Docker and use containers for building the project. Containers save you from installing all of this on your own workstation! Since I’m using Fedora, I run this command.

sudo dnf install docker docker-compose

Register a MusicBrainz application 🔗

Next, you need to register your application and get a OAuth token from MusicBrainz. Using the OAuth token lets you sign into your development environment with your MusicBrainz account. Then, you can import your plays from somewhere else.

To register, visit the MusicBrainz applications page. There, look for the option to register your application. Fill out the form with these three options.

Name: (any name you want and will recognize, I used listenbrainz-server-devel)
Type: Web Application
Callback URL: http://localhost/login/musicbrainz/post

After entering this information, you’ll have a OAuth client ID and OAuth client secret. You’ll use these for configuring ListenBrainz.

Update config.py 🔗

With your new client ID and secret, update the ListenBrainz configuration file. If this is your first time configuring ListenBrainz, copy the sample to a live configuration.

cp listenbrainz/config.py.sample listenbrainz/config.py

Next, open the file with your favorite text editor and look for this section.

# MusicBrainz OAuth
MUSICBRAINZ_CLIENT_ID = "CLIENT_ID"
MUSICBRAINZ_CLIENT_SECRET = "CLIENT_SECRET"

Update the strings with your client ID and secret. After doing this, your ListenBrainz development environment is able to authenticate and log in from your MusicBrainz login.

Initialize ListenBrainz databases 🔗

Your development environment needs some databases present to work. Before proceeding, run these three commands to initialize the databases.

docker-compose -f docker/docker-compose.yml -p listenbrainz run --rm web python3 manage.py init_db --create-db
docker-compose -f docker/docker-compose.yml -p listenbrainz run --rm web python3 manage.py init_msb_db --create-db
docker-compose -f docker/docker-compose.yml -p listenbrainz run --rm web python3 manage.py init_influx

Your development environment is now ready. Now, let’s actually see ListenBrainz load locally!

Run the magic script 🔗

Once you have done this, run the develop.sh script in the root of the repository. Using docker-compose, the script creates multiple Docker containers for the different services and parts of the ListenBrainz server. Running this script will start Redis, PostgreSQL, InfluxDB, and web server containers, to name a few. But this also makes it easy to stop them all later.

./develop.sh

You will see the containers build and eventually run. Leave the script running to see your development environment. Later, you can shut it down by pressing CTRL^C. Once everything is running, visit your new site from your browser!

http://localhost/

Now, you are all set to begin making changes and testing them in your development environment!

Making my first pull request 🔗

As mentioned earlier, my first attempt at a development environment was unsuccessful. My system kept denying permission to the processes in the containers. After looking at system audit logs and running a temporary setenforce 0, I tried the script one more time. Everything suddenly worked! So the issue was mostly with SELinux.

With my goal to get my environment set up, I figured out a few issues with the configuration offered by the project developers. I eventually made PR #257 against listenbrainz-server with my improvements.

Labeling SELinux volume mounts 🔗

To diagnose the issue, I started with a quick search and found a StackOverflow question with my same problem. There, the question was about Docker containers and denied permissions in the container. The answers explained it was an SELinux error and the context for the containers was not set. However, temporarily changing context for a directory didn’t seem too effective and doesn’t persist across reboots.

Continuing the search, I found an issue filed against docker-compose about the :z and :Z flags for volume mounts. These flags set SELinux context for containers, with the best explanation I found coming from this StackOverflow answer.

Two suffixes :z or :Z can be added to the volume mount. These suffixes tell Docker to relabel file objects on the shared volumes. The ‘z’ option tells Docker that the volume content will be shared between containers. Docker will label the content with a shared content label. Shared volumes labels allow all containers to read/write content. The ‘Z’ option tells Docker to label the content with a private unshared label.

Therefore, I added the :z flag to all the volume mounts in the docker-compose.yml file. I submitted a fix upstream for this in listenbrainz-server#257!

Correct the startup port 🔗

In the README, it says the server will start on port 8000, but the docker-compose.yml file actually started the server on port 80. I included a fix for this in my pull request as well.

git push! 🔗

This post makes a debugging experience that actually took hours look like it happened in minutes. But after getting over this hurdle, it was awesome to finally see ListenBrainz running locally on my workstation. It was an even better feeling when I could take my improvements and send them back in a pull request to ListenBrainz. Hopefully this will make it easier for others to create their own development environments and start hacking!

On the data refrain: Contributing to ListenBrainz

Mon, 02 Oct 2017 00:00:00 +0000

A unique opportunity of attending an open source-friendly university is when course credits and working on open source projects collide. This semester, I’m participating in an independent study at the Rochester Institute of Technology where I will contribute to the ListenBrainz project.

Many students take part in independent studies where they work on their own projects. However, in the spirit of open source collaboration, I wanted to contribute to a project that already existed. That way, my work would be helpful to a real-world project where it would have a value even after the end of the semester. Additionally, I wanted a project to help me sharpen my Python skill. And ListenBrainz was a fun, exciting candidate for this.

Objectives 🔗

The independent study proposal included three primary goals I hoped to meet during this independent study:

Add basic reports to ListenBrainz from listening history data for top songs / artists / albums of week, month, year, etc.
Create documentation to improve ease to use and develop for the ListenBrainz project
Offer as a use case for the data visualization course in fall 2018 with instructions on how to use the data

Add basic reports 🔗

Methods for generating basic reports, charts, and statistics about listening history are important. They help make ListenBrainz a more interesting platform for a casual music listener, not just a developer. Therefore, my goal was to add a way to add basic reports or specific metrics for presenting to the user in the front-end.

As a stretch goal, if I have extra time, I would work on generating content (e.g. charts / graphs / statistics) to show the user in the front-end.

Documentation 🔗

Documentation is something near and dear to me. I enjoy making it easier for other people to use a project or get started with contributing. Therefore, I will contribute some time as a technical writer and help improve documentation on the project. This includes improving existing documentation, like how to set up a development environment, or creating new content.

As an end deliverable, it would be nice to have someone who has never worked with the project run get a development environment set up, import some data, and see something presented to them. Good documentation is key to making something like this possible.

Use case for data course 🔗

RIT will offer a data visualization course in future semesters and it would be helpful if ListenBrainz could be a use case or even tool for the course. Then, students could work with ListenBrainz for creating different visualizations for the music data. And maybe contribute some of their visualizations back upstream! For this to happen, we need comprehensive documentation and complete features.

A focus includes making ListenBrainz a good fit for this course.

Learn more about ListenBrainz 🔗

For the next few months, until December, I will blog regularly about contributing to the ListenBrainz project and my progress. Additionally, more posts about MusicBrainz, other MetaBrainz projects, or music data may follow. I’m hoping to either create new or improve old documentation as well, so I plan to write often anyways!

For now, you can learn a bit more about ListenBrainz and other projects in the MetaBrainz family, like MusicBrainz.

ListenBrainz
GitHub: metabrainz/listenbrainz-server
About MusicBrainz
About MetaBrainz