Machine Learning/Kaggle Social Network Contest/Problem Representation: Difference between revisions
(Created page with '== TODO == * someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx * come up with a plan of attack. == Idea A == Construct a huge…') |
No edit summary |
||
| Line 1: | Line 1: | ||
== TODO == | == TODO == | ||
* come up with a plan of attack. | * come up with a plan of attack. | ||
| Line 10: | Line 9: | ||
node_i, node_j, feature_ij_1, feature_ij_2, ... | node_i, node_j, feature_ij_1, feature_ij_2, ... | ||
The | * The node_i's would come from the set of sampled users (ie the 38k outbound nodes). | ||
* The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them) | |||
The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows. | |||
This | Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes | ||
Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.) | |||
This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected. | This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected. | ||
== Idea B == | == Idea B == | ||
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take | We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot. | ||
Revision as of 00:40, 19 November 2010
TODO
- come up with a plan of attack.
Idea A
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.
It would have the following format
node_i, node_j, feature_ij_1, feature_ij_2, ...
- The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
- The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)
The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.
Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes
Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)
This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.
Idea B
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.