We spent the first couple of months of the new project phase getting a reality check with the community. We found out that Open Integrity was still needed and got a better sense of where to position ourselves which informed our development work in the subsequent months.
In this post, you’ll find some of the insights we gained, and details on the technical approaches we’ve been developing.
Here's some highlights of the insights we gathered :
- There's more and more data available out there and we're needed to help make sense of it.
- There's a gap in nuanced, technically informed yet accessible data for end users, educators and advocates.
- Researchers are keen on having a place where, instead of one off publications, research data can accumulate and frameworks can be re-used to gather actual data.
- There's great interest in how we want to approach this but to make our case we need to build a credible platform!
So we spent most of the past months in the lab designing and developing the platform, so here's a breakdown of the key things we've focused on from a technical perspective. If you want more details, we're working completely in the open so you might want to:
The foundation for our efforts is to gather good data. For us this means facts about software, some which can be automatically gathered, others that need to be created by people. But all of it needs to be traceable and auditable. Perhaps in time to a forensics standard.
We found out about Event Sourcing, which in a nutshell is about depending on an immutable store of events which are used to build the state of the app. The key difference with a standard database is that insted of modifying things in place (and possibly losing the information that was there before), this approach never deletes anything (hence immutable). If that sounds familiar, that's because it's used in a lot of places. From bookkeeping and how bank accounts are managed to version control.
We gain a lot of good things from this. Traceability (if the metadata about where the event came from is stored with the event itself), as well as auditability (we could evolve towards an append only log when needed). We also get time travel. Well that's how it's called anyway but it reflects something very important for us which is that we can have a view of how the world was at a particular point in time. That's useful to get a sense for instance of how OpenSSL would have fared against some criteria before heartbleed. (Something which our friends at the Core Infrastructure Initiative are already keen on showing).
Finally it's just a good pattern for scalability. Both for performance (it generally goes hand in hand with CQRS, i.e. separating reads from writes - which means accepting a world of eventual consistency), and for "ease to reason about" which helps when applications get more complex. It has a cost too, which is that it's more exotic than traditional RDBMS based approaches and that there are less frameworks available.
We ended up choosing CouchDB as our event store for the first phase of the project given that it has eventual consistency as its core. It also has a polyglot app framework (Erlang, JS, Python and even Haskell) and we already had some experience with it. The map reduce incremental views seemed like a good fit for doing Projections too.
One of the databases that uses event sourcing is Datomic but its not open source. We looked at their data model for events and got inspiration from it. It's really a quad-store at heart. For the first iteration we knew we wanted to tie in data from very heterogenous sources. It felt too costly to go for a full linked data triple/quad based so we went for a compromise where the "object" of our triples can be nested objects. As our schema was still evolving in unpredictable ways, we also decided to model relationships in a linked data friendly way so that we had more flexibility. Being "linked data friendly" means that we'll be able to piggy back on using Ontologies for reasoning and inference at some later point, as well as play nice with regards to data interoperability.
We therefore kept in our sights the possibility to link our data to existing formats (
SPDX notably). We reviewed a good number of data models relevant to the field to come up with a first iteration that we felt confident would help "hang facts" that were useful to accurately picture the reality of software projects without having to introduce too much complexity.
The 3 key entities in the data model are:
Project: Represents the name people use to refer to software. For example
Instance: A particular implementation for a given OS or audience. For example
Package: A particular release of a software component such as
In order to model the fact that software might have different capabilities or properties depending on their configuration or the set of features used we use the following approach:
Configuration: A given
Packagecan have any number of configurations. For example
Specifications: Configurations are defined by a set of specifications they implement. Such as
Just collecting factual data about software is not enough and there are plenty of projects that are already doing it. What we aim for with Open Integrity is to integrate these very different data sources and make sense of them as a whole. The data model is a good first iteration to "hang facts" that we collect ourselves or with the help of partners or via external sources, but trying to implement a formal model of all the moving parts that play a role in software projects is too ambititous at this point (or maybe at any point).
What seems clear is that research and knowledge about software (and mostly anything else) progresses through a dialectical process which can be seen as an argumentation that unfolds as new techniques are developed and new knowledge emerges. It's also clear that some interpretation of facts (again most interesting ones) are not universally accepted. So then do we just give up and consider that it's all relative anyway? Surely not.
The approach we take is one where argumentation is part of the process of gathering more knowledge. Not to oversimplify but we're interested in these scenarios:
- Lack of credible facts. We are an evidence driven platform so facts are credible if they have publicly verifiable evidence.
- Disagreement on conclusions. Either on the premises, the inferences, or maybe
- Disagreement on values. The disagreement boils down to world views.
We're lucky that this problem is well researched in Argumentation Theory and brings in various formal logic approaches, expression of proof-standards and use of constraint-based programming. Value-based Argumentation Frameworks are of particular interest to us as they allow to deal with the Disagreement on values scenario by allowing different values to underpin inferences in the system. We've adapted our argumentation layer data model in order to leverage the activity in this field and connect to existing systems.
One of the challenges we're working on now, is to offer a user experience which isn't more complicated than a threaded discussion while enabling argumentations to be grounded in sound structures. Of course we started to give a bit of thought to how this process could be abused, and we won't escape the need for skilled moderators if the platform draws interest.
We're learning a lot in the process and we're trying to build in the flexibility that will allow the platform to evolve when it launches. We've made some good progress on how to tackle our problem space and we're gearing up to put these pieces together in a convincing way.
User experience is going to be really key in order to lower the barrier to contribution. So will be good data that speaks for itself, and that's what we'll be in touch about next.
Let us know what you think via gitlab instance (use our
cypherpunks account if you don't want to create your own) for instance in our comments project.