Machine Learning/Kaggle Social Network Contest/load data: Difference between revisions
(→igraph) |
|||
| (6 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
= R = | |||
== igraph == | |||
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM. | |||
Grab the package with: | |||
<pre> | |||
install.packages("igraph") | |||
</pre> | |||
Load the data using: | |||
<pre> | |||
data <-as.matrix(read.csv("social_train.csv", header = FALSE)); | |||
dg <- graph.edgelist(data, directed=TRUE) | |||
</pre> | |||
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges. | |||
=Python= | |||
== How to load the network into networkx == | == How to load the network into networkx == | ||
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install. | There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install. | ||
| Line 6: | Line 24: | ||
NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks. | NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks. | ||
Method 1 | '''Method 1''' | ||
<pre> | <pre> | ||
import networkx as nx | import networkx as nx | ||
| Line 12: | Line 30: | ||
</pre> | </pre> | ||
'''Method 2''' | |||
Method 2 | |||
<pre> | <pre> | ||
import networkx as nx | import networkx as nx | ||
| Line 34: | Line 50: | ||
</pre> | </pre> | ||
{| border="1" | Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods! | ||
{| border="1" | |||
|- | |- | ||
!|Rows | !|Rows | ||
| Line 51: | Line 68: | ||
| 86s | | 86s | ||
|} | |} | ||
= Ruby = | |||
== Note on CSV Libraries == | |||
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV. | |||
== Loading Adjacency Lists == | |||
<pre> | |||
require 'rubygems' | |||
require 'faster_csv' | |||
def load_adj_list_faster(filename) | |||
adj_list_hash={} | |||
FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row| | |||
node_id=row.shift | |||
list_of_adj=row | |||
adj_list_hash[node_id] = list_of_adj | |||
end | |||
return adj_list_hash | |||
end | |||
adj_list_lookup = load_adj_list_faster('adj_list.out.csv') | |||
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv') | |||
</pre> | |||
Latest revision as of 13:28, 23 November 2010
R[edit | edit source]
igraph[edit | edit source]
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.
Grab the package with:
install.packages("igraph")
Load the data using:
data <-as.matrix(read.csv("social_train.csv", header = FALSE));
dg <- graph.edgelist(data, directed=TRUE)
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.
Python[edit | edit source]
How to load the network into networkx[edit | edit source]
There is a network analysis package for Python called networkx. This package can be installed using easy_install.
The network can be loaded using the read_edgelist function in networkx or by manually adding edges
NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.
Method 1
import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
Method 2
import networkx as nx
import csv
import time
t0 = time.clock()
DG = nx.DiGraph()
netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')
for row in netcsv:
tmp1 = int(row[0])
tmp2 = int(row[1])
DG.add_edge(tmp1, tmp2)
print "Loaded in ", str(time.clock() - t0), "s"
Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!
| Rows | 1M | 2M | 3M |
|---|---|---|---|
| Method 1 | 20s | 53s | 103s |
| Method 2 | 15s | 41s | 86s |
Ruby[edit | edit source]
Note on CSV Libraries[edit | edit source]
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.
Loading Adjacency Lists[edit | edit source]
require 'rubygems'
require 'faster_csv'
def load_adj_list_faster(filename)
adj_list_hash={}
FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row|
node_id=row.shift
list_of_adj=row
adj_list_hash[node_id] = list_of_adj
end
return adj_list_hash
end
adj_list_lookup = load_adj_list_faster('adj_list.out.csv')
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')