The Future of Fields
A group of Drupal developers recently met in Chicago in order to make Drupal more flexible when it comes to dealing with external data sources. A driving factor was the desire to be able to have internal access to external data sources without having to import huges amounts of data into the database. They also wanted to treat external data as Drupal nodes, for example, being able to reference any Flickr photo and vote on it with the Fivestar module without having to download and store every single photo.
The biggest conclusion that the design sprint participants came away with was to focus on getting more of CCK into core, specifically helping make fields first-class data objects. More on some of their insights and open questions were discussed in the Future of Fields session. [Begins showing a picture of the Data Architecture design Sprint including: chx, yched, bjaspan, crell, karens and nedjo]
Drupalcon used to be a developer conference where they would meet and figure out what was going to happen for the next year, but in Barcelona they realized that it was harder to do that.
So they asked about the who are the right people to be involved, and they all flew into Chicago and did a sprint.
Barry would advocate this be a trend in future Drupal development, get together a targeted group to make some progress.
The DADS (Data Architecture Design Sprint Re-design of Drupal's core date architecture * Data API * Object modeling - node object is a dumping ground for stuff. Some is in arrays, some are in classes, some are fake classes. * Fields in core -- What does that mean? CCK in core. Once the node object is cleaned up, let's clean up how it's rendered. Then realized that started getting really big, and started to focus on getting fields into core, and then things would go from there. * Many related ideas
Their goal was to propose a design at Drupalcon
Our "Grand Conclusions"
Some key thoughts: What makes Drupal unique? Drupal has the hook-based system Drupal's architecture allows contributed modules to easily add value to content Everything from fivestar to views -- and that there are so many of these modules, and that the fact there are so many of these modules is what makes Drupal unique.
The Future Web services -- sharing data from multiple sources * Consuming - Need to import in Amazon data or RDF * Providing access to our data -- no way at the moment to export data via JSON without custom coding. * Enhancing -- Being able to do any Drupal action on fields that is currently limited to nodes
Drupal is all about our contributed modules, we need to be able to provide our secret sauce throughout Drupal. Otherwise we're just sucking and aggregating other people's data, and not leveraging Drupal's power and flexibility.
So we want to add Drupal's value to the data sent out to web services. If we take photo from a Flickr, it shouldn't have to know it's a node to be able to vote on it. And putting CCK into core, that's a big step towards that.
[Showing KarenS flowchart] Drupal takes in data, formats and saves data, and produces HTML (and XML, etc) -- The little triangle at the bottom are the Drupal hooks that holds Drupal together.
If getting fields into Core is where we want to go, then Karen will talk more about it.
KAREN STEVENSON: We all know that getting fields into core is the big issue that we want to do for D7. So why bother? The way CCK handles fields is what we want to do, and it's a good way of looking at the issue.
At the usability testing at UMN, people were confused that Title and Body were treated as fields -- even though they don't act the same way as fields. The reason why they behave differently is that they're not in core. One way to consistency is to get fields into core in order to have consistency.
In a way, we have developed something better for users than developers. Users can create fields on the fly, but developers need to be able to do that as well. So that's also a motivation to get CCK fields into core.
Hopefully people are familiar with CCK. "Field type" is how is it stored, and what is unique about it. A field type tells you how it is stored in the database. The "field" is really the settings, how we set up the field, the length or what type of field. "Field instance" is how bind a field to a content type. The instance also have settings.
It's confusing terminology, but that's the way CCK works now.
Let's take a look at existing things in core to see what types of things could be fields. Body: text area with a teaser splitter. It's a two-field, teaser area and body area. Created: date field Upload: file field User picture: image field IDs: Number + optionwidgets Taxonomy feels like a field -- but taxonomy has so much special processing, and will be a challenging thing Comments? Fields? Or are they data? They're not sure yet.
Several of these things are not even in core CCK like date, file or image field. So it's a double promotion from not existing in contrib to not existing in core. And the reason why they're not in CCK is because they're not simple fields, and all of those things should make it in core CCK.
When we do a split between -- no one is talking about taking ALL of CCK and putting into core. That's too much, and would be too much complication
What is the Minimum Viable Possibility of how much should of CCK should go into core D7 and what should stay in contrib:
Core * Field API * Field Storage Engine * Node CRUD * WIdgets * Forms * Formatters * Multiple values -- handle them as a whole or as a separate things: like a GMAP -- "Add more" -- Core now has add more button -- Custom * FIeld validation
Contrib * Field UI -- that'll have to stay in contrib, because it's messy and hard -- Add, manage, diplay tabs * Fieldgroups * Allowed & default values -- probably more complicated than we need to go * Content Copy * Module integration -- Views, token, pathauto
The core module would be called "field.module" and we continue to have a contrib module.
Field storage options CCK does dynamic storage of fields, it is done in a per content type, and a multiple content is put into a 'per field' If you share a field between two content types, CCK has to create a new field.
Any time you change parameters, you have to alter the schema, and move the data from one table to another one, and potentially loose some data
Lots of places where it can go wrong and where data could get lost, and it's worse now because of CCK has a dynamically-defined schema. Store the field information in a field table, and whenever there is a call to determine the schema Ran into race conditions -- if you want to actually change the schema. And the big concern is to get all of this into core without breaking everything and making it a maintaining nightmare.
So fixed 'per tables' tables with cached node array * Simple structure and code * Downside is that there is a performance hit from numerous sql JOINS
One solution is to combine 'per field' tables and intermediate tables better optimized for query * Duplication of data, yields a larger database
They are doing a separate cache for the node object,so that they're not doing the joining as often -- which works fine for node_load, but we're back to lots of JOINS with views. And currently Views is already doing the JOINS, but they've been going around and round -- it's been a show stopper without figuring this out.
Their preliminary conclusion os that simple fields seems to make sense. It gives simple code that's predictable, and let's find out if these JOINS are an issue with performance tests. They imagine coding it up and implementing it, and then seeing if there are killer performance issues. If so, then scrap it and try another approach.
The Field API There is a basic API in CCK D6. Having the ability to have modules to create fields creates some problems. * Should the fields be locked down and not be able to be altered? * Who is going to own the fields? * Are they available in the UI? * Should those fields be shared?
A lot of these issues and more need to be worked through
In CCK for D6, they are working on how to get some field-based permissions, and so it may also be a Phase II issue for getting these permissions fields into D7 after a big chunk of functionality has been committed in the Phase I.
If they had permissions for fields, then modules could do more things with fields. Jeff Eaton has been in discussion with KarenS, and they've agreed that it's better to start with them as being locked down permissions-wise, and then gradually move towards allowing people to have permissions on them.
* Can start the process as separating the API from the UI and create a new fields.module in D6 with the idea to move it into D7. * It'll have to include the basic fields (text, number, optionwidgets) * Rework existing core fields and transition them into fields.module -- they think it makes sense to start with the Body field. * Determine what else needs to be done for date , image, file handling etc. * If they can get something into D7 early in the process, and then possibly get some of the UI into core.
QUESTION: Would Page and Story define in the CCK way instead of the way they are?
The revisions table could go away because the body would become the revisions table (and CCK currently handles revisions on the field level)
QUESTION: If fields get into core, then instead of Webform would we now just use the core field.module? Not sure, a lot of things could happen.
QUESTION: What about the Image field -- Possibly grouping fields for the alt tags, etc. to have the same table? -- And a fourth option, not dynamically from the UI, but that there is a programatic way to have a hook that says that these fields will always stay together. You could keep consistency in the schema API
The API will be able to define how fields will be grouped together.
LARRY GARFIELD gets up to talk about Local Fields on Remote Content
One challenge: If you want Drupal to talk to foreign data, then it's really, really hard.
What do we mean for data sources?
* local nodes * Entire Amazon catalog * Flickr photos * legacy data -- accessible via SOAP, XML, etc. * Other content from Drupal sites
For any given piece of content you want to perform the any of the following operations: * Display locally * Comment * Vote * Search * Views
These operations don't work for non-local data.
Importing ALL of that data can be really bad, because you have huge amounts of data -- and even if it was possible, then keeping it in sync could be really crazy. Lazy node creation also isn't much better.
But with everything we've talked about so far -- we don't get anything new and exciting. So what would this enable that we could really get excited about?
The Data Architecture Design Spring participants talked about "Institute of Contempory Art" as a case study -- coincidentally is very similar project to what Palatir was working on. * Data in legacy database that was available via SOAP, and that they didn't want to import b/c needed access via another non-Drupal CMS. * They had artwork (title, year, image ULR) artist (name birthplace, bio), resources (video, audio, etc) * Had to figure out how to make all of this available to Drupal. * The goal is to have the http://example.com/artwork/abc -- and it's not a node in the database, but they want to be able to treat it as a node.
So what does it mean to be a node? It has a nid, and some simple properties like created, status, and simple metadata, and then they have the Fields, which are the meat of the content -- title body, comments, taxonomy, CCK fields, and other complex structured data attached on by other modules.
So how can you provide a unified structure for the Node and the ICA: "Thingy"? [Thingy is a temporary name decided in Barelona that won't be called a thingy when it gets into core.]
It has a unique ID: that has an opaque string that has properties (i.e. metadata), and it has fields -- which could be intrinsic to the title or body, as well as extrinsic fields that come from the Drupal database. Anything else that is used that is pulled in doesn't need to be known where it's coming from remotely -- loading the object we shouldn't care where the external data is coming from. They want to have a clear separation between the data source and the data interface.
How do you add a field to a thingy so that you can add Fivestar.module and add some votes? Instead of a single column for a nid, you have two columns, you have the the type of nid (node, artwork, artwork), id (12, abc [menu path on the file system], abc), Delta column (0,0,1) and then the Vote (2,5,4)
The artwork "abc" does not exist anywhere exist anywhere else in SQL other than this vote column.
The Delta column is a CCK-specific value that allows for multiple values for CCK fields.
QUESTION: What if you have a combination of some images from Flickr and some locally Having a given field with having some values come from some data sources would be helpful, but would be really difficult to implement. The heavy lifting is on the field level.
QUESTION: What if the remote data is deleted, how do you sync? We don't know yet, that's yet to be determined, and part of the problem with dealing with external and stale data.
The thingy called "artwork" is equivalent to what we currently call a "node." Artwork would be a class, and they we load the object with "abc" to pull it in from the remote system, and the only thing stored locally is the Drupal value-data like a fivestar vote.
QUESTION: Works well from files or images, but what about text strings and caching?
It would have to be exposed in some form that you can query. But depends on your use case.
This group left some of this discussion out of the presentation, and have more info in their group on g.d.o.
What about searching these?
Can't search local database AND remote SOAP within one SQL request -- possibly with SPARQL, but will probably still have to do a separate query for each data source and merge results.
Views -- probably does not work as we understand it today because Views is tied to a local SQL database. We haven't figured it out yet -- possibly lazy load SQL. Will probably have to just implement it and see what works and what doesn't work.
The Views query-builder will live on for local data, but the rest of views is unclear for how it will evolve or adapt for dealing with remote data.
NEDJO ROGERS: Brief summary of these ideas and proposals, and we need to hear back and have follow-up discussions for how much this makes sense. They will have to fundamentally rework how core works with data with nodes and users.
We also have some assumptions in Drupal that are being challenged: * All data is entered into the dB via forms for users * Also anything can be extended at any time. * All data resides in a local SQL database
CCK knows what it's object looks like, we don't have to guess b/c it's aware of it's schema. It's sort of representing the type of data API we're looking for. Let's do it, but let's do it right. By renewing our core Data API so that we can free our Data API from these assumptions.
Some of these things can be taken care of now with CCK for Drupal 6. Looking at a minimal implementation of CCK for Drupal 7, and where we take it is an open question.
They'd like to hear back any questions and comments, does it make sense to people?
Does it ring true as a future direction for Drupal, and use the CCK fields as a broader goal for some of these things moving forward?
None of this is going to happen unless someone takes ownership that is going to get involved and help make it happen.
If you want to help, and don't know how, then write unit tests so that it will extend the development cycle.
Much more info on Daily reports, Final Report, Proposed Content Models 1 and 2, and more ramblings: http://groups.drupal.org/data-architecture-design-sprint
Replace the concept of nodes with objects that can be uniquely identified -- how we do it: whether it's something on top of the node vs. something else, then take a look at Model 1 and 2.
QUESTION: Will the data structure that is abstract and flexible enough to have a choice of fully normalized vs. de-normalized with no joins.
We don't know yet. So when we think about how can we attach a comment to an external Flickr photo, there is a choice between whether the field table knows what the actual data looks like or another option is something actively being discussed.
They do have adding fields to fields in D6.
QUESTION: Stale data problem, and data unavailable.
Might be implementation specific: dependent on the field on whether it's a Flickr photo or for the museum. But what you say generally is "data unavailable"
QUESTION: How would node revisions work if KarenS is saying that using the field.module for the body would get rid of node revisions. Do revisions go away? What about the diff.module?
Every CCK field keeps it's own revisions, and so she's is throwing that out, because CCK has something very similar already built in for each field.
QUESTION: Also think about the possibility of storing the revisions as a diff, so that you have a whole version control system built in internally.
QUESTION: Why not just use XML, and then think as thingies as 'entities' and fields are 'attributes'.
It's good for some tasks, but it's not really a good mechanism to using it internally. XML aren't multi-value. You could easily map the thingy structure to XML if you want to export.