Machine Learning/Kaggle Social Network Contest/load data: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= R =
== igraph ==
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.
Grab the package with:
<pre>
install.packages("igraph")
</pre>
Load the data using:
<pre>
data <-as.matrix(read.csv("social_train.csv", header = FALSE));
dg <- graph.edgelist(data, directed=TRUE)
</pre>
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.
=Python=
=Python=
== How to load the network into networkx ==
== How to load the network into networkx ==
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
Line 57: Line 73:


== Note on CSV Libraries ==
== Note on CSV Libraries ==
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[http://fastercsv.rubyforge.org/|FasterCSV]] instead of the stock CSV (import 'csv').  For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv').  For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.


== Loading Adjacency Lists ==
== Loading Adjacency Lists ==
Line 76: Line 92:
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')
</pre>
</pre>
= R =

Latest revision as of 13:28, 23 November 2010

R[edit | edit source]

igraph[edit | edit source]

The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.

Grab the package with:

install.packages("igraph")

Load the data using:

data <-as.matrix(read.csv("social_train.csv", header = FALSE));
dg <- graph.edgelist(data, directed=TRUE)

Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.

Python[edit | edit source]

How to load the network into networkx[edit | edit source]

There is a network analysis package for Python called networkx. This package can be installed using easy_install.

The network can be loaded using the read_edgelist function in networkx or by manually adding edges

NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.

Method 1

import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')

Method 2

import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
    tmp1 = int(row[0])
    tmp2 = int(row[1])
    DG.add_edge(tmp1, tmp2)


print "Loaded in ", str(time.clock() - t0), "s"

Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!

Rows 1M 2M 3M
Method 1 20s 53s 103s
Method 2 15s 41s 86s


Ruby[edit | edit source]

Note on CSV Libraries[edit | edit source]

If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.

Loading Adjacency Lists[edit | edit source]

require 'rubygems'
require 'faster_csv'
def load_adj_list_faster(filename)
  adj_list_hash={}
  FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row|
    node_id=row.shift
    list_of_adj=row
    adj_list_hash[node_id] = list_of_adj
  end
  return adj_list_hash
end

adj_list_lookup = load_adj_list_faster('adj_list.out.csv')
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')