- Family and Friends
Posts Tagged GraphDB
It’s about time to import some data into our previously established object scheme. If you want to do this yourself you want to first run the Crunchbase mirroring tool and create your own mirror on your hard disk.
In the next step another small tool needs to be written. A tool that creates nice clean GQL import scripts for our data. Since every data source is different there’s not really a way around this step – in the end you’ll need to extract data here and import data here. One possible different solution could be to implement a dedicated importer for the GraphDB – but I’ll leave that for another article series. Back to our tool: It’s called “First-Import” and it’s only purpose is to create a first small graph out of the mirrored Crunchbase data and fill the mainly primitive data attributes. Download this tool here.
This is why in this first step we mainly focus on the following object types:
Additionally all edges to a company object and the competition will be imported in this part of the article series.
So what does the first-import tool do? Simple:
- it then maps all attributes of that deserialized JSON object to attribute names in our graph data object scheme and it does so by outputting a simple query
- Simple Attribute Types like String and Integer are just simply assigned using the “=” operator in the Graph Query Language
- 1:1 References are assigned by assigning a REF(…) to the attribute – for example: INSERT INTO Product VALUES (Company = REF(Permalink=’companyname’))
- 1:n References are assigned by assigning a SETOF(…) to the attribute – because we are not using a bulk import interface but the standard GQL REST Interface it’s necessary that the object(s) we’re going to reference are already in existence – therefore we chose to do this 1:n linking step after creating the objects itself in a separate UPDATE step. Knowing this the UPDATE looks like this: UPDATE Company SET (ADD TO Competitions SETOF(permalink=’…’,permalink=’…’)) WHERE Permalink = ’companyname’
For the most part of the work it’s copy-n-paste to get the first-import tool together – it could have been done in a more sophisticated way (like using reflection on the deserialized JSON objects) but that’s most probably part of another article.
When run in the “crunchbase” directory created by the Crunchbase Mirroring tool the first-import tool generates GQL scripts – 6 of them to be precise:
The last script is named “Step_3” because it’s supposed to come after all the others.
These scripts can be easily imported after establishing the object scheme. The thing is though – it won’t be that fast. Why is that? We’re creating several thousand nodes and the edges between them. To create such an edge the Query Language needs to identify the node the edge originates and the node the edge should point to. To find these nodes the user is free to specify matching criteria just like in a WHERE clause.
So if you do a UPDATE Company SET (ADD TO Competitions SETOF(Permalink=’company1’,Permalink=’company2’)) WHERE Permalink = ’companyname’ the GraphDB needs to access the node identified by the Permalink Attribute with the value “companyname” and the two nodes with the values “company1” and “company2” to create the two edges. It will work just like all the scripts are but it won’t be as fast as it could be. What can help to speed up things are indices. Indices are used by the GraphDB to identify and find specific objects. These indices are used mainly in the evaluation of a WHERE clause.
The sones GraphDB offers a number of integrated indices, one of which is HASHTABLE which we are going to use in this example. Furthermore everyone interested can implement it’s own index plugin – we will have a tutorial how to do that online in the future – if you’re interested now just ask how we can help you to make it happen!
Back to the indices in our example:
The syntax of creating an index is quite easy, the only thing you have to do is tell the CREATE INDEX query on which type and attribute the index should be created and of which indextype the index should be. Since we’re using the Permalink attribute of the Crunchbase objects as an identifier in the example (it could be any other attribute or group of attributes that identify one particular object) we want to create indices on the Permalink attribute for the full speed-up. This would look like this:
- CREATE INDEX ON Company (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON FinancialOrganization (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON Person (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON ServiceProvider (Permalink) INDEXTYPE HashTable
- CREATE INDEX ON Product (Permalink) INDEXTYPE HashTable
Looks easy, is easy! To take advantage of course this index creation should be done before creating the first nodes and edges.
After we got that sorted the only thing that’s left is to run the scripts. This will, depending on your machine, take a minute or two.
So after running those scripts what happened is: all Company, FinancialOrganization, Person, ServiceProvider and Product objects are created and filled with primitive data types
- all attributes which are essentially references (1:1 or 1:n) to a Company object are being set, these are
That’s it for this part – in the next part of the series we will dive deeper into connecting nodes with edges. There is a ton of things that can be done with the data – stay tuned for the next part.
After the overview and the first use-case introduction it’s about time to play with some data objects.
So how can one actually access the data of crunchbase? Easy as pie: Crunchbase offers an easy to use interface to get all information out of their database in a fairly structured JSON format. So what we did is to write a tool that actually downloads all the available data to a local machine so we can play with it as we like in the following steps.
This small tool is called MirrorCrunchbase and can be downloaded in binary and sourcecode here. As for all sourcecode and tools in this series this runs on windows and linux (mono). You can use the sourcecode to get an impression what’s going on there or just the included binaries (in bin/Debug) to mirror the data of Crunchbase.
To say a few words about what the MirrorCrunchbase tool actually does first a small source code excerpt:
So first it gets the list of all objects like the company names and then it retrieves each company object according to it’s name and stores everything in .js files. Easy eh?
When it’s running you get an output similar to that:
And after the successful completion you should end up with a directory structure
The .js files store basically every information according to the data scheme overview picture of part 2. So what we want to do now is to transform this overview into a GQL data scheme we can start to work with. A main concept of sones GraphDB is to allow the user to evolve a data scheme over time. That way the user does not have to have the final data scheme before the first create statement. Instead the user can start with a basic data scheme representing only standard data types and add complex user defined types as migration goes along. That’s a fundamentally different approach from what database administrators and users are used to today.
Todays user generated data evolves and grows and it’s not possible to foresee in which way attributes need to be added, removed, renamed. Maybe the scheme changes completely. Everytime the necessity emerged to change anything on a established and populated data scheme it was about time to start a complex and costly migration process. To substantially reduce or even in some cases eliminate the need for such a complex process is a design goal of the sones GraphDB.
In the Crunchbase use-case this results in a fairly straight-forward process to establish and fill the data scheme. First we create all types with their correct name and add only those attributes which can be filled from the start – like primitives or direct references. All Lists and Sets of Edges can be added later on.
So these would be the Create-Type Statements to start with in this use-case:
CREATE TYPE Company ATTRIBUTES ( String Alias_List, String BlogFeedURL, String BlogURL, String Category, DateTime Created_At, String CrunchbaseURL, DateTime Deadpooled_At, String Description, String EMailAdress, DateTime Founded_At, String HomepageURL, Integer NumberOfEmployees, String Overview, String Permalink, String PhoneNumber, String Tags, String TwitterUsername, DateTime Updated_At, Set<Company> Competitions )
CREATE TYPE FinancialOrganization ATTRIBUTES ( String Alias_List, String BlogFeedURL, String BlogURL, DateTime Created_At, String CrunchbaseURL, String Description, String EMailAdress, DateTime Founded_At, String HomepageURL, String Name, Integer NumberOfEmployees, String Overview, String Permalink, String PhoneNumber, String Tags, String TwitterUsername, DateTime Updated_At )
CREATE TYPE Product ATTRIBUTES ( String BlogFeedURL, String BlogURL, Company Company, DateTime Created_At, String CrunchbaseURL, DateTime Deadpooled_At, String HomepageURL, String InviteShareURL, DateTime Launched_At, String Name, String Overview, String Permalink, String StageCode, String Tags, String TwitterUsername, DateTime Updated_At)
CREATE TYPE ExternalLink ATTRIBUTES ( String ExternalURL, String Title )
CREATE TYPE EmbeddedVideo ATTRIBUTES ( String Description, String EmbedCode )
CREATE TYPE Image ATTRIBUTES ( String Attribution, Integer SizeX, Integer SizeY, String ImageURL )
CREATE TYPE IPO ATTRIBUTES ( DateTime Published_At, String StockSymbol, Double Valuation, String ValuationCurrency )
CREATE TYPE Acquisition ATTRIBUTES ( DateTime Acquired_At, Company Company, Double Price, String PriceCurrency, String SourceDestination, String SourceURL, String TermCode )
CREATE TYPE Office ATTRIBUTES ( String Address1, String Address2, String City, String CountryCode, String Description, Double Latitude, Double Longitude, String StateCode, String ZipCode )
CREATE TYPE Milestone ATTRIBUTES ( String Description, String SourceDescription, String SourceURL, DateTime Stoned_At )
CREATE TYPE Fund ATTRIBUTES ( DateTime Funded_At, String Name, Double RaisedAmount, String RaisedCurrencyCode, String SourceDescription, String SourceURL )
CREATE TYPE Person ATTRIBUTES ( String AffiliationName, String Alias_List, String Birthplace, String BlogFeedURL, String BlogURL, DateTime Birthday, DateTime Created_At, String CrunchbaseURL, String FirstName, String HomepageURL, Image Image, String LastName, String Overview, String Permalink, String Tags, String TwitterUsername, DateTime Updated_At )
CREATE TYPE Degree ATTRIBUTES ( String DegreeType, DateTime Graduated_At, String Institution, String Subject )
CREATE TYPE Relationship ATTRIBUTES ( Boolean Is_Past, Person Person, String Title )
CREATE TYPE ServiceProvider ATTRIBUTES ( String Alias_List, DateTime Created_At, String CrunchbaseURL, String EMailAdress, String HomepageURL, Image Image, String Name, String Overview, String Permalink, String PhoneNumber, String Tags, DateTime Updated_At )
CREATE TYPE Providership ATTRIBUTES ( Boolean Is_Past, ServiceProvider Provider, String Title )
CREATE TYPE Investment ATTRIBUTES ( Company Company, FinancialOrganization FinancialOrganization, Person Person )
CREATE TYPE FundingRound ATTRIBUTES ( Company Company, DateTime Funded_At, Double RaisedAmount, String RaisedCurrencyCode, String RoundCode, String SourceDescription, String SourceURL )
You can directly download the according GQL script here. If you use the sonesExample application from our open source distribution you can create a subfolder “scripts” in the binary directory and put the downloaded script file there. When you’re using the integrated WebShell, which is by default launched on port 9975 an can be accessed by browsing to http://localhost:9975/WebShell you can execute the script using the command “execdbscript” followed by the filename of the script.
As you can see it’s quite straight forward a copy-paste action from the graphical scheme. Even references are not represented by a difficult relational helper, instead if you want to reference a company object you can just do that (we actually did that – look for example at the last line of the gql script above). As a result when you execute the above script you get all the Types necessary to fill data in in the next step.
So that’s it for this part – in the next part of this series we will start the initial data import using a small tool which reads the mirrored data and outputs gql insert queries.
Where to start: existing data scheme and API
This series already tells in it’s name what the use case is: The “CrunchBase”. On their website they speak for themselves to explain what it is: “CrunchBase is the free database of technology companies, people, and investors that anyone can edit.”. There are many reasons why this was chosen as a use-case. One important reason is that all data behind the CrunchBase service is licensed under Creative-Commons-Attribution (CC-BY) license. So it’s freely available data of high-tech companies, people and investors.
Currently there are more than 40.000 different companies, 51.000 different people and 4.200 different investors in the database. The flood of information is big and the scale of connectivity even bigger. The graph represented by the nodes could be even bigger than that but because of the limiting factors of current relational database technology it’s not feasible to try to do that.
sones GraphDB is coming to the rescue: because it’s optimized to handle huge datasets of strongly connected data. Since the CrunchBase data could be uses as a starting point to drive connectivity to even greater detail it’s a great use-case to show these migration and handling.
Thankfully the developers at CrunchBase already made one or two steps into an object oriented world by offering an API which answers queries in JSON format. By using this API everyone can access the complete data set in a very structured way. That’s both good and bad. Because the used technologies don’t offer a way to represent linked objects they had to use what we call “relational helpers”. For example: A person founded a company. (person and company being a JSON object). There’s no standardized way to model a relationship between those two. So what the CrunchBase developers did is they added an unique-Identifier to each object. And they added a new object which is uses as a “relational helper”-object. The only purpose of these helper objects is to point towards a unique-identifier of another object type. So in our example the relationship attribute of the person object is not pointing directly to a specific company or relationship, but it’s pointing to the helper object which stores the information which unique-identifier of which object type is meant by that link.
To visualize this here’s the data scheme behind the CrunchBase (+all currently available links):
As you can see there are many more “relational helper” dead-ends in the scheme. What an application had to do up until now is to resolve these dead-ends by going the extra mile. So instead of retrieving a person and all relationships, and with them all data that one would expect, the application has to split the data into many queries to internally build a structure which essentially is a graph.
Another example would be the company object. Like the name implies all data of a company is stored there. It holds an attribute called investments which isn’t a primitive data type (like a number or text) but a user defined complex data type. This user defined data type is called List<FundingRoundStructure>. So it’s a simple list of FundingRoundStructure objects.
When we take a look at the FundingRoundStructure there’s an attribute called company which is made up by the user defined data type CompanyStructure. This CompanyStructure is one of these dead-ends because there’s just a name and a unique-id. The application now needs retrieve the right company object with this unique-id to access the company information.
Simple things told in a simple way: No matter where you start, you always will end up in a dead-end which will force you to start over with the information you found in that dead-end. It’s not user-friendly nor easy to implement.
The good news is that there is a way to handle this type of data and links between data in a very easy way. The sones GraphDB provides a rich set of features to make the life of developers and users easier. In that context: If we would like to know which companies also received funding from the same investor like let’s say the company “facebook” the only thing necessary would be one short query. Beside that those “relational helpers” are redundant information. That means in a graph database this information would be stored in the form of edges but not in any helper objects.
The reason why the developers of CrunchBase had to use these helpers is that JSON and the relational table behind it isn’t able to directly store this information or to query it directly. To learn more about those relational tables and databases try this link.
I want to end this part of the series with a picture of the above relational diagram (without the arrows and connections).
The next part of the series will show how we can access the available information and how a graph scheme starts to evolve.
If you want to explain how easy it is for a user or developer to use the sones GraphDB to work on existing datasets you do that by showing him an example – a use case. And this is exactly what this short series of articles will do: It’ll show the important steps and concepts, technologies and designs behind the use case and the sones GraphDB.
The sones GraphDB is a DBMS focusing on strong connected unstructured and semi-structured data. As the name implies these data sets are organized in Nodes and Edges objectoriented in a graph data structure.
“a simple graph”
To handle these complex graph data structures the user is given a powerful toolset: the graph query language. It’s a lot like SQL when it comes to comprehensibility – but when it comes to functionality it’s completely designed to help the user do previously tricky or impossible things with one easy query.
This articles series is going to show how real conventional-relational data is aggregated and ported to an easy to understand and more flexible graph datastructure using the sones GraphDB. And because this is not only about telling but also about doing we will release all necessary tools and source codes along with this article. That means: This is a workshop and a use case in one awesome article series.
The requirements to follow all steps of this series are: You want to have a working sone GraphDB. Because we just released the OpenSource Edition Version 1.1 you should be fine following the documentation on how to download and install it here. Beside that you won’t need programming skills but if you got them you can dive deep into every aspect. Be our guest!
This first article is titled “Overview” and that’s what you’ll get:
part 1: Overview
part 2: A short introduction into the use-case and it’s relational data
part 3: Which data and how does a GQL data scheme start?
part 4: The initial data import
part 5: Linking nodes and edges: What’s connected with what and how does the scheme evolve?
part 6: Querying the data and how to access it from applications?